Table of Contents
cs.CL [Back]
[1] The Overlooked Repetitive Lengthening Form in Sentiment Analysis
Lei Wang,Eduard Dragut
Main category: cs.CL
TL;DR: 本文探讨了重复延长形式(RLF)在情感分析中的重要性及大语言模型对其的理解能力,提出了首个专注于RLF的多领域数据集Lengthening,并设计了可解释的指令微调框架ExpInstruct以提升模型性能与可解释性。
Details
Motivation: 重复延长形式(RLF)作为一种独特且强调性的非正式表达方式,在情感分析中长期被忽视,本文旨在探究其重要性及语言模型对其的理解能力。 Method: 构建首个面向RLF的情感分析多领域数据集Lengthening(85万样本),提出两阶段可解释指令微调框架ExpInstruct,并设计统一量化方法评估语言模型对非正式表达的理解能力。 Result: RLF句子具有强情感表达力,可作为文档级情感标志;微调的预训练语言模型在RLF任务上性能超越零样本GPT-4但解释性不足;ExpInstruct可在少量样本下使开源大模型在性能和解释性上均达到零样本GPT-4水平。 Conclusion: RLF是情感分析中不可忽视的重要非正式表达形式,ExpInstruct框架有效提升了模型对RLF的理解能力与可解释性,为在线内容分析提供了新思路。 Abstract: Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF[2] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
Qianfan Zhang,Tianyu Guo,Xuandi Ren,Jiale Chen,Ming Ding,Ran Xin,Xia Xiao
Main category: cs.CL
TL;DR: 本文研究了如何通过训练时强化学习(RL)和测试时并行思维两种互补方法,扩展竞争性编程中的推理token预算。作者发现验证准确率与平均生成推理token数呈近似对数线性关系,并提出验证RL预热和随机截断两种改进策略;为缓解全注意力下单次推理扩展的成本问题,设计了多轮并行思考流程,将token预算分配到多线程与多轮生成、验证与精炼中,并端到端训练模型以对齐训练与测试结构。最终系统在AetherCode 456道难题上超越GPT-5-high。
Details
Motivation: 竞争性编程需要大量推理token,但单纯扩大单次推理token数在全注意力机制下成本高昂,亟需更高效的token预算扩展方法。 Method: 提出两种互补方法:1)训练时RL优化——引入验证RL预热和随机截断以提升log-linear训练轨迹;2)测试时并行思考——设计多线程、多轮的生成-验证-精炼pipeline,并端到端训练模型适配该结构。 Result: 基于Seed-OSS-36B模型,16线程×16轮的完整系统在平均7.6M tokens/题条件下,以pass@1达到原RL模型的pass@16 oracle性能,并在AetherCode 456道难题上超越GPT-5-high。 Conclusion: 训练时RL优化与测试时并行思考协同可高效扩展推理token预算;端到端适配测试结构的训练方式显著提升实际推理效率与性能。 Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.[3] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
Abolfazl Ansari,Delvin Ce Zhang,Zhuoyang Zou,Wenpeng Yin,Dongwon Lee
Main category: cs.CL
TL;DR: 本文介绍了M2-Verify,一个大规模、多模态、跨领域的科学主张一致性验证数据集,旨在评估模型对科学主张与其多模态证据之间严格一致性的判断能力;实验表明当前SOTA模型在复杂场景下表现显著下降,并存在幻觉问题。
Details
Motivation: 现有基准缺乏足够规模、领域多样性及视觉复杂性,难以真实评估科学主张与多模态证据之间的严格一致性。 Method: 构建了源自PubMed和arXiv的M2-Verify数据集,包含16个领域的46.9万+样本,并通过专家审核确保质量;开展基线实验与专家评估,分析模型在不同复杂度扰动下的表现及解释生成中的幻觉现象。 Result: SOTA模型在低复杂度医学扰动上Micro-F1达85.8%,但在高复杂度解剖结构偏移等任务中降至61.6%;专家评估发现模型在生成科学解释时存在明显幻觉。 Conclusion: M2-Verify填补了多模态科学一致性验证基准的空白,揭示了当前模型在严谨科学推理上的局限性,为未来研究提供了可靠评测平台与实用指南。 Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.[4] Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea,Adela Bâra
Main category: cs.CL
TL;DR: 本文探讨了语言模型中人类偏好学习的挑战,提出了一种特征增强的奖励建模框架,通过引入可解释的辅助特征(如响应长度、拒绝信号、毒性分数和语义相似度)提升模型在偏好排序任务中的性能与可解释性,并揭示了特征交互对偏见放大的影响。
Details
Motivation: 当前奖励建模依赖主观、模糊的偏好比较,难以捕捉人类判断的多维性,导致性能受限。 Method: 在Anthropic HHRLHF数据集上,采用标准成对偏好设置评估10个大语言模型;引入响应长度、拒绝指示符、毒性得分和提示-响应语义相似度等可解释特征,构建特征增强的混合建模框架;结合SHAP和LIME进行可解释性分析,并探究特征交互对偏见的影响。 Result: 所提方法将ROC AUC从基线<0.74提升至最高0.84,显著提高成对准确率;DeBERTav3Large表现最优;可解释性分析表明模型决策依赖于上下文化的安全性和支持性表达,而非孤立关键词;特征交互被发现会放大偏好偏差。 Conclusion: 特征增强能有效提升奖励模型的性能与可解释性,但需警惕特征组合带来的隐式偏差;多维、可解释信号的融合是迈向更鲁棒、可信偏好建模的关键路径。 Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.[5] Procedural Knowledge at Scale Improves Reasoning
Di Wu,Devendra Singh Sachan,Wen-tau Yih,Mingda Chen
Main category: cs.CL
TL;DR: 本文提出Reasoning Memory框架,通过检索增强生成(RAG)在推理过程中复用大规模程序性知识(如问题重构、方法选择、验证回溯等),显著提升语言模型在数学、科学和编程等复杂推理任务上的性能。
Details
Motivation: 现有测试时扩展方法孤立处理每个问题,未能系统复用先前推理轨迹中的程序性知识(如如何重构问题、选择策略、验证或回溯),导致潜力未被充分挖掘。 Method: 构建Reasoning Memory框架:从现有逐步推理语料中分解出3200万对自包含的子问题-子程序对作为程序性知识库;推理时通过轻量级‘in-thought’提示让模型生成核心子问题,并检索相关子程序作为隐式程序先验进行多路径推理。 Result: 在六个数学、科学和编程基准上,Reasoning Memory持续优于基于文档、轨迹和模板的RAG方法,以及计算量匹配的测试时扩展基线;最高提升达19.2%(相比无检索)和7.9%(相比最强基线);消融实验证明收益主要来自程序性知识覆盖广度与分解-检索设计的有效性。 Conclusion: 显式建模并高效复用程序性知识是提升大模型复杂推理能力的关键路径,Reasoning Memory为测试时推理提供了可扩展、可解释的新范式。 Abstract: Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.[6] No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
Tiankai Yang,Jiate Li,Yi Nian,Shen Dong,Ruiyao Xu,Ryan Rossi,Kaize Ding,Yue Zhao
Main category: cs.CL
TL;DR: 本文提出并形式化了无意跨用户污染(UCC)这一新型故障模式,指出在多用户共享状态的LLM代理中,良性交互产生的范围限定型信息可能被错误复用,导致静默失败;通过受控评估和分类,发现原始共享状态下污染率达57–71%,仅靠对话级写时清洗不足以应对含可执行构件的共享状态,需构件级防御机制。
Details
Motivation: LLM代理在多用户共享知识层部署时,因未区分用户作用域而复用局部有效信息,导致静默性能退化,现有研究未系统关注此类无攻击者参与的良性污染问题。 Method: 提出UCC概念并形式化其定义,设计受控评估协议,构建三类污染类型学,并在两种共享状态机制(纯对话状态与含可执行构件的状态)上实证评估污染率及写时清洗策略效果。 Result: 原始共享状态下UCC发生率为57–71%;写时清洗在纯对话共享中有效,但在含可执行构件的共享状态下仍存在显著残余风险,且污染常表现为静默错误答案。 Conclusion: 共享状态LLM代理需超越文本级清洗,引入构件级防御机制,以防止无意跨用户污染引发的静默失败。 Abstract: LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.[7] Open-Domain Safety Policy Construction
Di Wu,Siyue Liu,Zixiang Ji,Ya-Liang Chang,Zhe-Yu Liu,Andrew Pleffer,Kai-Wei Chang
Main category: cs.CL
TL;DR: 本文提出Deep Policy Research (DPR),一种轻量级、任务定制的代理系统,仅需少量人工编写的领域种子信息,即可通过迭代式网络搜索、信息蒸馏与结构化组织,自动生成内容审核政策;在多个基准测试中,其性能优于基线方法,甚至接近专家撰写水平。
Details
Motivation: 制定和维护领域特定的安全政策成本高昂,亟需自动化工具辅助政策起草。 Method: DPR采用单次网络搜索工具与轻量级框架,通过迭代生成搜索查询、从多样化网页源中提炼政策规则,并将规则组织为索引化文档。 Result: 在OpenAI不当内容基准(5个领域)和内部多模态广告审核基准上,DPR持续优于仅定义提示和上下文学习基线;端到端设置下,在多个领域中表现接近专家撰写的政策段落;且优于通用深度研究系统。 Conclusion: 任务定制、结构化的研究循环比通用网络研究更适用于政策起草,DPR为低成本、高质量内容安全政策生成提供了可行路径。 Abstract: Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.[8] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Itay Yona,Dan Barzilay,Michael Karasik,Mor Geva
Main category: cs.CL
TL;DR: 本文研究了语言模型中实体相关事实问答的内部机制,通过定位实体选择性MLP神经元并进行因果干预,发现早期层中存在集中分布的神经元,激活单个神经元即可恢复实体一致性预测,支持实体规范化解释。
Details
Motivation: 语言模型能回答许多以实体为中心的事实性问题,但其内部机制尚不清楚。 Method: 使用模板化提示定位每个实体的选择性MLP神经元,并在PopQA数据集上的问答示例中通过因果干预验证;对200个精选实体进行分析,结合负向消融和控制注入实验。 Result: 实体选择性神经元多集中在早期层;负向消融导致实体特异性失忆;单神经元激活可恢复实体一致性预测;对别名、缩写、拼写错误及多语言形式具有鲁棒性;效果强但非普适,流行实体覆盖率更高。 Conclusion: 识别出稀疏且具因果可操作性的访问点,可用于分析和调控语言模型中实体条件下的事实行为。 Abstract: Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.[9] Assessing Pause Thresholds for empirical Translation Process Research
Devi Sri Bandaru,Michael Carl,Xinyue Ren
Main category: cs.CL
TL;DR: 本文比较了三种计算打字暂停阈值的方法,并提出并评估了一种新的生产单元断点计算方法,以区分自动化的与需要反思的翻译过程。
Details
Motivation: 现有研究对如何确定区分自动化与反思性翻译过程的暂停阈值存在长期争议,需进一步比较和优化方法。 Method: 比较三种近期暂停阈值计算方法,并提出并评估一种新的生产单元断点(Production Unit Breaks)计算方法。 Result: 提出了一个更优的生产单元断点计算方法,并对其有效性进行了实证评估。 Conclusion: 新提出的生产单元断点计算方法在识别翻译过程中的认知负荷变化方面更具解释力和实用性。 Abstract: Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.[10] Adaptive Stopping for Multi-Turn LLM Reasoning
Xiaofan Zhou,Huy Nguyen,Bo Yu,Chenxi Liu,Lu Cheng
Main category: cs.CL
TL;DR: 本文提出MiCP,首个面向多轮推理的符合性预测(CP)框架,通过在不同轮次间分配误差预算,实现自适应停止并保证整体覆盖率,显著降低推理成本和轮次数量。
Details
Motivation: 现有大语言模型的多轮推理方法缺乏形式化停止准则,导致高风险领域中过早停止或过度推理的问题;传统符合性预测仅适用于单次输出,无法处理多轮自适应流程。 Method: 提出Multi-Turn Language Models with Conformal Prediction(MiCP),在多轮推理(如自适应RAG和ReAct)中动态分配误差预算,支持带覆盖率保证的自适应停止机制。 Result: MiCP在单跳与多跳问答基准上达到目标覆盖率,同时减少轮次数、推理开销和预测集大小;提出新指标联合评估覆盖率有效性与回答效率。 Conclusion: MiCP首次将符合性预测扩展至多轮语言模型,为高可靠性AI系统提供了具备统计保证的自适应推理框架。 Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.[11] Cost-Efficient Estimation of General Abilities Across Benchmarks
Michael Krumdick,Adam Wiemerslage,Seth Ebner,Charles Lovering,Chris Tanner
Main category: cs.CL
TL;DR: 本文提出了一种基于预测效度的高效LLM基准评估框架,利用WILD数据集和改进的多维项目反应理论(IRT)模型结合自适应项目选择,在仅观察16个测试项、22,000 tokens下实现对未见任务性能预测的平均绝对误差<7%。
Details
Motivation: 现有LLM基准繁多但冗余,性能可由少数潜在能力解释;亟需一种以预测未见任务性能效率为标准的、可比性强的基准评估方法。 Method: 构建大规模细粒度数据集WILD(65个模型 × 109,564个题目 × 163个任务);提出改进的多维IRT模型;结合最优实验设计实现自适应题目选择;引入成本感知的折扣因子优化token使用。 Result: 在112个预留基准任务上达到<7% MAE;仅需16个题目即可达成该精度;通过成本感知选择将达7% MAE所需token从141,000降至22,000(降低85%)。 Conclusion: 以预测效度为核心的基准框架更高效、可扩展且具成本效益,为LLM评估提供了新范式。 Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.[12] The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
Jacek Bąkowski
Main category: cs.CL
TL;DR: 本研究利用分布语义和随机森林模型,发现印地语中源自梵语与波斯-阿拉伯语的同义词即使语义相近,仍可通过其上下文使用模式(词向量)准确区分词源,为同义词承载不同文化视角与历史起源导致的概念子空间分化提供了量化证据。
Details
Motivation: 检验同义词是否真能反映不同认知视角或文化关联,尤其是印地语中因长期语言接触形成的梵语与波斯-阿拉伯语同义词对,其词源信息是否仍在现代用法中留存。 Method: 基于印地语同义词对的词嵌入表示,训练随机森林分类器,以预测每个词的词源(梵语或波斯-阿拉伯语),并控制语义相关性变量进行验证。 Result: 模型能高精度区分词源,即使在语义无关的同义词对上亦然,表明分布语境中隐含可提取的词源信号。 Conclusion: 同义现象并非冗余,而是承载历史、文化与认知差异的系统性语言机制;词源可塑造独立的概念子空间,上下文比传统语义相似性更能揭示细微但结构化的意义区分。 Abstract: Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.[13] Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
Hexuan Wang,Jingyu Zhang,Benjamin Van Durme,Daniel Khashabi
Main category: cs.CL
TL;DR: 本文研究了引用粒度(句子级、段落级、文档级)对归因生成模型性能的影响,发现中等粒度(段落级)在归因质量和答案正确性上达到最优平衡,而过细或过粗的粒度均会损害模型表现,且影响程度随模型规模非单调变化。
Details
Motivation: 细粒度引用虽利于人工验证,但其对模型归因性能的影响尚未充分探索;需探究如何在满足人类可验证性的同时兼顾模型内在能力约束。 Method: 系统评估四种不同规模(8B–120B)的语言模型在不同引用粒度(句子、段落、多段落)下的归因质量与答案正确性,分析性能变化规律及与模型规模的交互关系。 Result: 段落级引用在所有模型尺度上均取得最佳归因质量,较最优粒度外的设置提升显著(细粒度导致16–276%性能下降);细粒度干扰语义依赖,粗粒度引入噪声;大模型受细粒度约束惩罚更重;最优粒度还能保持甚至提升答案正确性。 Conclusion: 单纯追求人类可验证的细粒度引用会违背模型语义建模机制,降低归因忠实性与生成可靠性;应将引用粒度与模型天然语义范围对齐,以实现高效可信的归因生成。 Abstract: Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.[14] Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Tianyi Zhao,Yinhan He,Wendy Zheng,Yujie Zhang,Chen Chen
Main category: cs.CL
TL;DR: 本文通过电路级机制分析,揭示了大语言模型(LLM)在生成错误答案时仍表现出高自信(即“自信地错误”)的内在机制,并发现中后层的少量MLP模块和注意力头负责这种自信膨胀信号;通过针对性推理时干预,可显著提升模型校准度。
Details
Motivation: 大语言模型常出现‘自信地错误’现象,即对错误答案给出过高置信度,这会误导用户并削弱置信度作为不确定性信号的可靠性,但其内部机制尚不清楚。 Method: 提出一种电路级机制分析框架,包括:(1)将口头化置信度建模为可微内部信号;(2)识别因果性放大该信号的神经电路;(3)基于发现进行推理时的定向校准干预。在两个指令微调LLM和三个数据集上开展实证分析。 Result: 发现中后层少量MLP块和注意力头在最终token位置一致写入自信膨胀信号;对其实施定向推理时干预可显著改善模型校准性能。 Conclusion: LLM中的口头化过度自信由可识别的内部电路驱动,且可通过靶向干预有效缓解。 Abstract: Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.[15] A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning
Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
Main category: cs.CL
TL;DR: 本文通过构建包含129,451首波斯诗歌的语料库,将反复出现的意象归纳为可追踪的‘家族’,分离意象性与神圣/宫廷指涉性成分,并以多层图建模其关系,揭示波斯诗歌象征系统随时间演化的动态结构特征。
Details
Motivation: 波斯诗歌中意象常以家族形式反复出现并相互关联,但现有计算研究多将其简化为孤立词汇或宽泛文档语义,忽略了这一关键诗学组织单位。 Method: 基于129,451首波斯诗歌语料库,识别并整合反复出现的意象形成可追踪的‘家族’,区分意象性与神圣/宫廷指涉性成分,并构建多层关系图谱;按11个伊斯兰历世纪分箱,分析图结构演化(如模块度、跨范围连接、枢纽节点变化等)。 Result: 发现象征核心稀疏而指涉成分密集,二者间连接具有选择性;部分意象家族(如Shab、Ruz、Khaak)长期广泛分布,而酒器、花园、火焰、抒情声音等后期增强,宫廷-英雄词汇则早期权重更高;模块度上升、跨域连接下降、宫廷桥梁弱化、神圣桥梁强化;枢纽节点随时间迁移(如Kherqe后期凸显,Farkhondeh和Banafsheh衰退,Saaghar始终居中)。 Conclusion: 波斯诗歌象征体系并非静态词库,而是一个长生命周期的动态系统,其内部权重与连接关系随时间持续演化。 Abstract: Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.[16] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once
Harnoor Dhingra
Main category: cs.CL
TL;DR: 本文提出Magic, Madness, Heaven, Sin框架,从任务的规范性目标(认知、交互、社会、安全)出发,系统分析大语言模型输出变异性的不同含义与评价标准,强调变异应被视为任务依赖的属性而非模型固有特征。
Details
Motivation: 现有研究对大语言模型输出‘多样性’的讨论术语混乱,主因是任务背后的规范性目标未被显式阐明。 Method: 构建四维规范性框架(Magic-事实性、Madness-用户效用、Heaven-社会表征、Sin-安全性),分类定义各情境下的变异表现与失效模式,并分析跨情境交互影响。 Result: 揭示优化某一目标(如安全性)可能损害其他目标(如群体表征或创意多样性),验证变异评价必须上下文敏感。 Conclusion: 输出变异不是模型的内在属性,而是由具体任务目标所塑造的可评估特性;需推动基于规范性目标的上下文感知评估范式。 Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model's intrinsic trait.[17] Why Instruction-Based Unlearning Fails in Diffusion Models?
Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu
Main category: cs.CL
TL;DR: 本文研究了基于指令的遗忘方法在扩散模型中的有效性,发现仅靠自然语言指令无法有效抑制目标概念,揭示了提示级别指令在扩散模型中的根本局限性。
Details
Motivation: 探究基于指令的遗忘方法是否适用于扩散模型等其他生成模型。 Method: 通过在多个概念和提示变体上进行受控实验,并分析CLIP文本编码器和去噪过程中的交叉注意力动态,评估扩散模型对自然语言遗忘指令的响应。 Result: 扩散模型无法仅通过自然语言遗忘指令系统性地抑制目标概念;遗忘指令未能持续降低对目标概念词元的注意力,导致目标概念表征在整个生成过程中持续存在。 Conclusion: 提示级别的指令在扩散模型中存在根本局限,有效的遗忘需要超越推理时语言控制的干预手段。 Abstract: Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.[18] Read More, Think More: Revisiting Observation Reduction for Web Agents
Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada
Main category: cs.CL
TL;DR: 本文重新审视了Web代理中HTML观察表示的使用,发现最优表示取决于模型能力和思考token预算:低能力模型适合紧凑的可访问性树,高能力模型则受益于详细的HTML表示,且增加思考token会进一步放大HTML的优势;同时,引入观察历史和基于diff的表示可提升性能。
Details
Motivation: 以往研究将HTML的冗长性视为性能障碍,并普遍采用观察缩减策略。本文旨在重新评估这一趋势,探究不同模型能力下最优观察表示的选择依据。 Method: 通过对比不同模型能力(低/高)在不同观察表示(HTML vs. 可访问性树)和思考token预算下的性能表现,结合错误分析、观察历史引入及diff-based表示实验,系统评估观察表示的影响。 Result: 1) 高能力模型在HTML表示下表现更优,且受益于更多思考token;低能力模型更适合可访问性树;2) 高能力模型利用HTML中的布局信息更好定位动作,低能力模型在长输入下易产生幻觉;3) 引入观察历史和diff-based表示均能提升多数设置下的性能。 Conclusion: 应根据模型能力和思考token预算自适应选择观察表示,并推荐结合观察历史与diff-based表示以实现高效稳健的Web代理。 Abstract: Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.[19] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
Main category: cs.CL
TL;DR: 本研究提出一种模型融合框架,通过插值法融合临床基础模型(GatorTronLlama)与通用指令模型(Llama-3.1-8B-Instruct),在保持指令遵循能力的同时提升临床任务性能,缓解大语言模型在医学微调中的灾难性遗忘问题。
Details
Motivation: 解决通用大语言模型在医学领域微调时出现的指令遵循能力严重下降(即“遗忘”)问题,以支持其在临床场景中的可靠部署。 Method: 采用插值型权重融合方法,将临床基础模型GatorTronLlama与通用指令模型Llama-3.1-8B-Instruct进行模型空间融合,不依赖额外训练或大量标注数据。 Result: 融合模型在多个医学基准及五类临床生成任务(如放射科、出院摘要)上显著缓解灾难性遗忘,兼顾临床专业能力与指令遵循能力;在极低监督(64样本)下达到全量微调(256样本)相当的性能。 Conclusion: 权重空间融合是一种高效、可扩展且资源友好的方法,适用于将开源大语言模型快速适配至资源受限的临床应用场景。 Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.[20] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning
Qi Zhang,Shen Huang,Chu Liu,Shouqing Yang,Junbo Zhao,Haobo Wang,Pengjun Xie
Main category: cs.CL
TL;DR: 本文提出DeltaMem,一种单智能体的个性化记忆管理系统,通过借鉴人类记忆演化机制,构建了用户-助手对话数据集及记忆更新标签,并引入基于记忆的Levenshtein距离作为奖励函数,结合强化学习提升性能,在多个长期记忆基准上超越现有方法。
Details
Motivation: 现有以人格为中心的记忆管理多智能体框架存在信息丢失、跨场景鲁棒性差、性能欠佳等问题。 Method: 提出单智能体DeltaMem系统;构建带操作级记忆更新标签的用户-助手对话数据集;设计Memory-based Levenshtein Distance作为记忆更新奖励;采用定制化强化学习框架进行训练。 Result: 训练前(zero-shot)和RL训练后的DeltaMem在LoCoMo、HaluMem、PersonaMem等多个长期记忆基准上均优于所有产品级基线模型。 Conclusion: DeltaMem通过简化架构、引入认知启发式建模与强化学习优化,在 persona-centric 记忆管理任务中实现了更鲁棒、高效且可扩展的性能。 Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.[21] Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression
Ruoling Qi,Yirui Liu,Xuaner Wu,Xiangyu Wang,Ming Li,Chen Chen,Jian Chen,Yin Chen,Qizhen Weng
Main category: cs.CL
TL;DR: 本文提出Swift-SVD,一种激活感知、闭式求解的SVD压缩框架,兼顾理论最优性、实际高效性与数值稳定性,显著提升大语言模型权重与KV缓存压缩的速度与精度。
Details
Motivation: 大型语言模型部署受限于静态权重和动态KV缓存的内存与带宽需求;现有SVD压缩方法在重建误差或计算效率上存在明显缺陷。 Method: 提出Swift-SVD:基于批量输入输出激活协方差的增量聚合与单次特征值分解,实现免训练、快速、最优的逐层低秩近似;引入有效秩分析层可压缩性,并设计联合考虑局部重建误差与端到端层重要性的动态秩分配策略。 Result: 在6个LLM和8个数据集上实验表明,Swift-SVD在压缩精度上达到最优,端到端压缩时间比SOTA方法快3–70倍。 Conclusion: Swift-SVD为LLM高效部署提供了一种硬件友好、理论严谨且实践高效的SVD压缩新范式。 Abstract: The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.[22] Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia
Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Fajri Koto
Main category: cs.CL
TL;DR: 本研究通过对印尼349名K-12教师开展全国性调查,揭示了AI在课堂教学中的实际应用现状、使用差异及主要障碍:小学教师使用更频繁,高中教师参与度较低;中年教师更重视AI,东印尼教师感知价值更高;主要用途是减轻备课负担,但通用输出、基础设施限制和本土化适配不足制约其有效整合。
Details
Motivation: 填补印尼课堂中AI实践应用与教师支持需求的大规模、以教师为中心的实证研究空白,支撑符合本地情境的AI系统开发与政策制定。 Method: 面向印尼小学、初中、高中共349名K-12教师开展全国性问卷调查,分析AI使用频率、教学环节、教师背景(学段、教龄、地域)与感知价值之间的关联。 Result: 发现AI在教学法、内容开发和教学媒体中使用呈上升趋势但不均衡:小学教师使用更一致,高中教师参与较少;中年教师更重视AI;东印尼教师感知价值更高;主要用途为减轻评估、教案设计与材料开发等备课负担;主要障碍包括AI输出通用化、基础设施限制及缺乏本土语境适配。 Conclusion: 需开发更具上下文感知能力、支持本地语言与课程的AI工具,并加强基础设施建设与针对性教师培训,以推动AI在印尼教育中公平、有效、可持续地落地。 Abstract: Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.[23] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
Shou-Tzu Han,Rodrigue Rizk,KC Santosh
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在数学推理任务中对表面扰动的脆弱性,提出了Mechanistic Perturbation Diagnostics(MPD)框架来诊断失败机制,并基于诊断结果构建了失败类型学,验证了不同修复策略的效果。
Details
Motivation: 尽管大语言模型在数学推理基准上表现强劲,但其对语义不变的表面扰动(如名字替换、数字格式改写)异常脆弱,亟需深入理解其失败的内在机制。 Method: 在677个GSM8K问题及其语义等价变体上系统评估Mistral-7B、Llama-3-8B和Qwen2.5-7B;提出MPD框架,整合logit lens分析、激活修补、组件消融与新指标Cascading Amplification Index(CAI);构建机制化失败分类法并开展定向修复实验(如引导向量、层微调)。 Result: 三模型答案翻转率高达28.8%–45.1%,数字改写比名字替换更具破坏性;CAI作为失败预测指标优于首发散层(AUC最高0.679);logit lens显示翻转样本更早偏离正确路径;激活修补揭示架构差异:Llama-3失败可局部修复(43/60),Mistral和Qwen则呈分布式(3/60,0/60);修复实验表明引导向量和层微调对局部型失败(Llama-3)效果最佳(+12.2%),对纠缠型(Qwen)和分布式(Mistral)效果有限(+7.2%,+5.2%)。 Conclusion: LLM数学推理失败具有显著的机制异质性,MPD框架可有效识别失败模式并指导针对性修复,未来工作应结合架构特性设计鲁棒性增强策略。 Abstract: Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.[24] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
Delip Rao,Chris Callison-Burch
Main category: cs.CL
TL;DR: 本文系统分析了24K个声明验证样本的推理模式,发现现有基准主要测试直接证据提取,缺乏多句合成和数值推理;不同数据集存在显著偏差,错误类型也因领域而异;高分更多反映检索加蕴含能力,而非深层推理。
Details
Motivation: 尽管声明验证进展迅速,但对其基准实际考察的推理能力缺乏系统理解。 Method: 使用GPT-4o-mini为9个数据集共24K个样本生成结构化推理轨迹,并用1B参数推理验证器分析五类错误及其领域差异。 Result: 发现直接证据提取占主导,多句合成与数值推理严重不足;各数据集在推理类型分布上偏差显著;错误类型依领域(通用、科学、数学)各异。 Conclusion: 当前高基准分数主要体现检索加蕴含能力,而非真正复杂的推理能力;需构建更具挑战性的评估套件以全面检验验证系统的推理能力。 Abstract: Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.[25] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo,Xiaoyu Zhang,Jing Li,Yan Gao,Donghong Han
Main category: cs.CL
TL;DR: 本文提出PRCCF框架,通过引入角色引导的检索机制和因果感知的认知过滤模块,提升情感支持对话中的上下文理解与共情响应生成能力。
Details
Motivation: 现有情感支持对话方法在深度上下文理解方面存在不足,难以有效建模用户情绪与外部知识间的因果关系。 Method: 提出PRCCF框架,包含 persona-guided retrieval(联合建模语义兼容性与角色一致性)和 causality-aware cognitive filtering(筛选因果相关外部知识)两个核心模块。 Result: 在ESConv数据集上,PRCCF在自动评估指标和人工评价中均优于当前最优方法。 Conclusion: 角色信息与因果知识的有效融合可显著增强情感支持对话系统的认知理解与共情生成能力。 Abstract: Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.[26] PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
Chenning Xu,Mao Zheng,Mingyang Song
Main category: cs.CL
TL;DR: 本文提出PRISM框架,通过引入句子级事实性风险标签和句间依赖标注,在监督微调中对高风险事实位置进行概率重分配,以减少大模型生成中的幻觉问题。
Details
Motivation: 监督微调(SFT)使用词元级硬标签易导致模型过度自信地模仿缺乏事实依据的目标,从而在多句生成中传播幻觉。 Method: 提出PRISM——一种可微的风险门控框架,在事实关键位置动态调整学习:结合句子级事实性风险标签与句间依赖标注,设计轻量、模型感知的概率重分配目标,对高风险目标词元上的高置信度预测施加惩罚,并由跨距级风险权重与模型感知门控调控作用范围。 Result: 在幻觉敏感的事实性基准与通用评测中,PRISM在多种骨干模型上均提升事实性指标,同时保持整体能力竞争力;消融实验表明辅助信号需保守使用,知识掩蔽与模型感知重分配协同实现事实纠正与能力保留的平衡。 Conclusion: PRISM通过结构化事实风险信号与细粒度、模型感知的优化机制,有效缓解SFT中的幻觉放大问题,为可信文本生成提供了实用且兼容性强的新范式。 Abstract: Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.[27] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
Zhaoyi Li,Xiangyu Xi,Zhengyu Chen,Wei Wang,Gangwei Jiang,Ranran Shen,Linqi Song,Ying Wei,Defu Lian
Main category: cs.CL
TL;DR: 本文对比了两个模型(DeepSeek-R1-0528 和 gpt-oss-120b)生成的验证型思维链(CoT)轨迹在监督微调(SFT)中的效果,发现训练损失更低的 DeepSeek 数据反而导致更差的泛化能力;分析揭示其源于 DeepSeek 更发散、多分支的推理模式,易致模型陷入冗余探索;据此提出过滤高频分支轨迹的方法,显著提升多个推理基准上的性能。
Details
Motivation: 探究不同来源的已验证思维链(CoT)轨迹如何影响大模型推理能力的泛化性能,尤其关注训练损失与泛化能力不一致的悖论现象。 Method: 对两个性能相近但推理模式不同的模型(DeepSeek-R1-0528 和 gpt-oss-120b)生成的 CoT 轨迹进行受控对比实验;开展词元级损失与步骤级推理行为的多维度分析;提出并验证基于过滤高分支频率轨迹的数据筛选策略。 Result: 发现 DeepSeek-R1 数据虽带来更低训练损失,但在 AIME25、BeyondAIME 等多个推理基准上泛化性能显著更差;gpt-oss-120b 的收敛性、演绎式推理更利于泛化;过滤 DeepSeek-R1 中高频分支轨迹后,推理性能提升达 5.1%(AIME25)、5.5%(BeyondAIME)、平均 3.6%(五基准)。 Conclusion: CoT 轨迹的质量不仅在于正确性,更取决于其推理结构(如收敛性 vs 发散性);简单过滤低效探索型轨迹即可显著提升 SFT 效果,为高质量推理数据构建提供了新准则。 Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.[28] Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang,Yan Zhu,Peiyao Fu,Te Luo,Zhihua Wang,Xian Yang,Quanlin Li,Pinghong Zhou,Shuo Wang
Main category: cs.CL
TL;DR: 本文提出EndoASR,一种面向胃肠镜检查的领域自适应语音识别系统,通过两阶段合成数据适配策略提升术语准确率与噪声鲁棒性,在多中心临床验证中显著降低字符错误率、提高医学术语准确率,并具备低延迟、小模型、可边缘部署等优势,有效支撑临床人机协同。
Details
Motivation: 现有ASR系统在胃肠镜真实临床场景中受限于专业术语复杂、声学环境嘈杂,可靠性不足,亟需领域适配方案。 Method: 提出EndoASR系统,采用基于合成内镜报告的两阶段适配策略:一阶段优化领域语言建模,二阶段增强噪声鲁棒性;并验证其与大语言模型联用对下游结构化信息抽取的增益。 Result: 回顾性评估中CER从20.52%降至14.14%,Med ACC从54.30%升至87.59%;前瞻性多中心研究中CER从16.20%降至14.97%,Med ACC从61.63%升至84.16%;RTF达0.005(快于Whisper-large-v3十倍),模型仅220M参数。 Conclusion: EndoASR在真实多中心临床环境中展现出强泛化性与实用性,证实了领域自适应ASR可作为胃肠镜人机协同的可靠语音接口。 Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.[29] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
Yanchen Wu,Tenghui Lin,Yingli Zhou,Fangyuan Zhang,Qintian Guo,Xun Zhou,Sibo Wang,Xilin Liu,Yuchi Ma,Yixiang Fang
Main category: cs.CL
TL;DR: 本文系统性地综述并统一建模了LLM智能体中的各类记忆方法,通过在两个基准上的全面实验对比,分析了现有方法的优劣,并基于此提出了一种性能更优的新记忆方法,同时指出了未来研究方向。
Details
Motivation: 现有LLM智能体的记忆方法缺乏在统一实验设置下的系统性、全面性比较,难以准确评估其有效性与适用场景。 Method: 1)提出一个统一框架,整合所有现有智能体记忆方法;2)在两个知名基准上对代表性方法进行大规模实验对比;3)基于分析结果设计一种融合已有模块的新记忆方法。 Result: 揭示了不同记忆方法在任务表现、效率和可扩展性等方面的差异;新提出的记忆方法在多个指标上超越当前最优方法。 Conclusion: 统一建模与实证分析是理解智能体记忆机制的关键路径;所提新方法验证了模块化组合的有效性,为后续研究提供了可复现的基准与明确方向。 Abstract: Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.[30] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
Truc Nguyen,Then Tran,Binh Truong,Phuoc Nguyen T. H
Main category: cs.CL
TL;DR: 本文提出了一种人机协同框架,结合大语言模型(LLM)推理与声学特征模型,通过置信度路由和迭代规则优化,提升越南语语音情感识别在模糊样本和低资源场景下的性能。
Details
Motivation: 越南语语音情感识别面临声学模式模糊、标注数据稀缺、真实场景中情感边界不清晰等挑战,亟需融合人类知识以增强模型鲁棒性。 Method: 构建以LLM推理为核心的人机协同框架:利用声学模型提供置信度与特征级证据;设计置信度驱动的路由机制,将模糊样本交由LLM基于人类标注行为导出的结构化规则进行深度推理;引入迭代误差分析与规则更新策略持续优化系统。 Result: 在包含2764条样本、三类情绪(平静、愤怒、恐慌)、高标注一致性(Fleiss Kappa=0.8574)的越南语数据集上,准确率达86.59%,Macro F1达0.85–0.86。 Conclusion: 融合数据驱动模型与人类推理的协同范式,显著提升了模糊样本识别能力,具备模型无关性,适用于低资源语音情感识别任务。 Abstract: Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.[31] Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
Melania Berbatova,Tsvetoslav Vasev
Main category: cs.CL
TL;DR: 本文提出了一种针对保加利亚语文本的更细致的毒性内容检测方法,通过构建毒性词汇本体和标注数据集,并训练BERT模型,实现了0.89的宏F1分数,兼顾毒性识别与关键信息(如医学术语、少数群体相关文本)的保留。
Details
Motivation: 当前毒性内容检测方法常误伤有价值的文本(如医学术语、少数群体相关内容),亟需一种更精细、文化适配的保加利亚语毒性识别方案。 Method: 构建保加利亚语潜在毒性词本体;人工标注4384句涵盖毒性语言、医学术语、非毒性语言及少数群体相关术语的四类数据集;训练基于BERT的毒性分类模型。 Result: BERT模型在四分类任务上达到0.89的宏F1分数,具备实际部署能力,可集成至内容审核系统。 Conclusion: 该方法在提升保加利亚语毒性检测精度的同时,有效减少了对重要非毒性信息的误删,为小语种、高敏感场景的内容安全提供了可行路径。 Abstract: Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.[32] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Linyang He,Qiyao Yu,Hanze Dong,Baohao Liao,Xinxing Xu,Micah Goldblum,Jiang Bian,Nima Mesgarani
Main category: cs.CL
TL;DR: 本文提出LiveMathematicianBench,一个基于最新arXiv论文、具有逻辑分类与抗干扰机制的动态多选数学推理评测基准,揭示当前最强LLM在研究级数学推理上仍远未达标。
Details
Motivation: 现有数学推理评测基准存在合成性过强和数据污染问题,难以真实评估大语言模型在研究级数学任务中的泛化与理解能力。 Method: 构建基于新近arXiv论文的动态多选基准;设计13类定理逻辑类型学;采用证明概要引导的干扰项生成方法;引入替换抵抗机制以区分答案识别与实质推理。 Result: Gemini-3.1-pro-preview在标准评测中仅达43.5%;在替换抵抗评测下,GPT-5.4最高仅30.6%,Gemini降至17.6%(低于20%随机基线);提供证明概要可稳定提升准确率。 Conclusion: LiveMathematicianBench是一个可扩展、抗污染、细粒度的研究级数学推理评测平台,证实当前LLM在深层数学推理上仍存在根本性局限。 Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.[33] Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
Hanna Hubarava,Yingqiang Gao
Main category: cs.CL
TL;DR: 本文提出了一种基于指令微调与离散控制标记的领域无关可控自动文本简化(CATS)框架,发现小模型(1-3B)在可读性控制上表现良好,但压缩率控制受限于训练数据中目标属性变化不足;同时指出传统评估指标不足以衡量可控性,需采用基于误差的目标-输出对齐度量。
Details
Motivation: 可控自动文本简化(CATS)中,可控性常被简化为解码问题,且缺乏能真实反映控制能力的评估指标;作者指出数据与评估两方面严重制约了可控性。 Method: 提出基于指令微调和离散控制标记的领域无关CATS框架,使用Llama、Mistral、Qwen等1–14B规模模型,在医学、政务、新闻、百科四领域开展实验,并设计误差导向的可控性评估方法及分层采样策略。 Result: 小模型(1–3B)在可读性控制(FKGL、ARI、Dale-Chall)上表现稳健,但压缩控制效果差,主因是现有语料中压缩率变化信号不足;标准简化与相似度指标无法准确评估可控性;随机数据划分会引发分布失配,影响训练与评估可靠性。 Conclusion: 可控性核心依赖训练数据中目标属性的充分变异,而非仅模型规模或解码策略;需构建更具变异性的可控简化数据集,并采用目标对齐的误差型评估指标。 Abstract: Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.[34] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
Liang Zhu,Feiteng Fang,Yuelin Bai,Longze Chen,Zhexiang Zhang,Minghuan Tan,Min Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为DEFT的高效对齐框架,通过差分分布奖励进行数据过滤和分布引导,提升大语言模型与人类价值观对齐的效率和性能,同时缓解泛化能力下降问题。
Details
Motivation: 现有基于人类反馈的强化学习(RLHF)方法如PPO成本高、不稳定;替代方法仍需大量偏好数据,且可能削弱模型泛化能力。 Method: 提出Distribution-guided Efficient Fine-Tuning(DEFT),利用语言模型输出分布与偏好数据差异分布计算差分分布奖励,筛选高质量小规模子集,并融入现有对齐方法以指导模型输出分布。 Result: 实验表明,结合DEFT的方法在对齐能力和泛化能力上均优于原始方法,且训练时间显著减少。 Conclusion: DEFT是一种高效、稳定、兼顾对齐性能与泛化能力的大语言模型对齐新范式。 Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.[35] PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu,Yuelin Bai,Xiankun Ren,Jiaxi Yang,Lei Zhang,Feiteng Fang,Hamid Alinejad-Rokny,Minghuan Tan,Min Yang
Main category: cs.CL
TL;DR: 本文提出PLOT方法,通过最优传输理论构建词元级损失函数,提升大语言模型偏好学习的性能、稳定性和全局语义建模能力。
Details
Motivation: 现有偏好学习方法存在性能提升有限、计算成本高、超参数敏感、缺乏对全局词元关系建模等问题。 Method: 将偏好学习建模为最优传输问题,设计基于词元嵌入的词元级损失函数,在微调对齐中实现输出与人类偏好的对齐,同时保持原始LLM分布。 Result: 在人类价值观和逻辑与问题求解两大类共七个子偏好任务上,PLOT持续提升对齐性能,同时保持生成流畅性与连贯性。 Conclusion: 最优传输为偏好学习提供了原理性、理论扎实的新框架,为大语言模型偏好学习带来新洞见。 Abstract: Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.[36] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
Liang Zhu,Haolin Chen,Lidong Zhao,Xian Wu
Main category: cs.CL
TL;DR: 本文提出了一种自适应占位符补全(APC)框架,通过在高熵位置输出显式占位符来替代传统硬补全(HC),从而降低用户编辑成本,并在不牺牲标准补全性能的前提下实现理论最优与实证改进。
Details
Motivation: 传统大语言模型代码补全采用硬补全(HC)范式,在上下文不足时仍强制生成具体代码,导致大量建议被编辑或拒绝;作者基于300万真实交互数据分析发现该策略存在显著低效问题。 Method: 提出自适应占位符补全(APC)框架,将代码补全建模为不确定性下的成本最小化问题,理论推导出熵阈值并证明APC优于HC;利用真实编辑日志构建训练数据,设计基于成本的奖励函数进行强化学习训练。 Result: 在1.5B–14B参数模型上广泛验证,APC将预期编辑成本降低19%–50%,同时保持原有HC性能不变。 Conclusion: APC为不确定性感知的代码补全提供了理论基础与实用训练框架,证明端到端学习自适应‘ abstention ’(即插入占位符)是可行且有效的,无需牺牲传统补全质量。 Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.[37] Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
Samuel Rose,Debarati Chakraborty
Main category: cs.CL
TL;DR: 本文提出了一种用于识别拼写错误是否源于阅读障碍(dyslexia)的二分类方法,结合语音、字形和形态特征,并设计双输入神经网络模型,在作者无关条件下达到93.01%准确率;同时强调伦理先行,系统探讨公平性、可解释性、知情同意、透明度与人为监督等关键问题,指出高技术可行性不等于可直接部署于教育场景。
Details
Motivation: 现有研究多聚焦于矫正阅读障碍者的拼写错误,而忽视错误归因(即判断错误是否由阅读障碍导致),且缺乏对自动归因所引发的伦理风险(如标签化、算法偏见、机构滥用)的系统考量。 Method: 将阅读障碍错误归因建模为二分类任务,构建涵盖字形、语音和形态特征的综合特征集,提出双输入神经网络模型,并在作者无关设置下与传统机器学习基线对比评估。 Result: 神经模型取得93.01%准确率和94.01% F1分数;语音合理错误和元音混淆是最强归因信号;同时完成伦理框架分析与部署指南制定。 Conclusion: 尽管高精度的阅读障碍错误归因在技术上可行,但其在教育等高风险场景中的实际部署必须以健全的伦理与法律框架为前提,仅靠性能指标不足以支撑负责任的应用。 Abstract: Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.[38] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations
Yiqiang Cai,Chengyan Wu,Bolei Ma,Bo Chen,Yun Xue,Julia Hirschberg,Ziwei Gong
Main category: cs.CL
TL;DR: SURE框架通过不确定性感知的专家混合模块、迭代推理模块和Transformer门控模块,提升了多模态对话情感识别的鲁棒性和上下文建模能力,在基准数据集上超越现有方法。
Details
Motivation: 现有方法过度关注模态融合,忽视了噪声特征中的不确定性及细粒度上下文推理需求。 Method: 提出SURE框架,包含不确定性感知的专家混合模块(处理模态特异性噪声)、迭代推理模块(支持多轮上下文推理)和Transformer门控模块(建模模态内与模态间交互)。 Result: 在多个MERC基准数据集上持续优于当前最优方法,验证了不确定性建模与迭代推理的有效性。 Conclusion: 不确定性建模与迭代上下文推理对提升多模态对话情感识别性能至关重要。 Abstract: Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.[39] Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
Géraud Faye,Benjamin Icard,Morgane Casanova,Guillaume Gadek,Guillaume Gravier,Wassila Ouerdane,Céline Hudelot,Sylvain Gatepaille,Paul Égré
Main category: cs.CL
TL;DR: 本文提出了一种结合非上下文文本嵌入(fastText)与符号化概念特征(如体裁、主题和说服技巧)的神经符号方法,以提升宣传新闻检测的鲁棒性和泛化能力,实验表明其优于纯文本方法。
Details
Motivation: 现有基于BERT等语言模型的宣传新闻检测方法易因数据采集偏差而过拟合,泛化能力差,需增强鲁棒性与跨信源适应能力。 Method: 提出一种神经符号混合方法:融合fastText文本嵌入与符号化特征(体裁、主题、说服技巧),并进行可解释性分析与消融研究。 Result: 该方法在宣传新闻分类任务上性能优于纯文本基线;消融实验和可解释性分析验证了符号特征的有效性。 Conclusion: 引入符号化概念特征能显著提升模型鲁棒性与泛化能力,神经符号融合是应对信息失序问题的有效路径。 Abstract: Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness[40] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: 本文提出了一种基于置换多面体(permutohedron)上交换距离最小化的数学框架,用于评估语言和手势中词序/手势序的最优性,并将二次分配问题(QAP)引入语言学研究,提出统一的最优分配原理。
Details
Motivation: 探究语言中词序是否在置换多面体上最小化相邻交换距离,以解释其认知或交际成本最低的潜在优化机制。 Method: 构建基于permutohedron的交换距离度量模型,量化跨语言词序/手势序对交换距离最小化的接近程度;引入二次分配问题(QAP)作为统一优化框架。 Result: 实证表明跨语言手势序至少77%最优,且该高优度不太可能源于偶然;确立了swap距离最小化作为语言与手势系统中顺序优化的理论基础。 Conclusion: 词序与手势序遵循一种广义的最优分配原则,swap距离最小化是该原则的具体体现之一;QAP可作为整合多种语言优化现象的统一理论范式。 Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.[41] Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann,Bernhard Stadler,Michael Färber
Main category: cs.CL
TL;DR: 本文提出了一种三步自动化质量保证方法,用于评估和提升机器翻译基准数据集(EU20)的质量,发现低COMET得分的数据集在片段级存在更多准确性错误,并发布了清洗后的数据集与代码。
Details
Motivation: 机器翻译基准数据集虽具成本与规模优势,但其噪声、结构丢失与质量不均削弱了可信度;亟需可扩展的翻译可靠性评估与验证方法。 Method: 采用三步自动化质量保障流程:(i) 结构化语料审计与修复;(ii) 基于神经指标COMET(含无参考与有参考模式)的质量画像,并对比DeepL/ChatGPT/Google等译服务;(iii) 利用大语言模型构建片段级翻译错误图谱。 Result: 发现COMET得分较低的数据集(如HellaSwag)在片段级准确率错误比例更高;ARC相对更干净;MMLU上基于人工校对样本的有参考COMET结果趋势一致;并开源了清洗后的EU20数据集及全部代码。 Conclusion: 自动化质量保障能提供实用、可扩展的质量指标,辅助优先开展人工审查,是对人类金标准的补充而非替代。 Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.[42] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Daeyong Kwon,Soyoung Yoon,Seung-won Hwang
Main category: cs.CL
TL;DR: 本文提出SAFE框架,通过知识图谱(KG)支持的验证流程,在训练和推理阶段分别消除多跳问答中不 grounded 的推理步骤,提升模型推理的可验证性与准确性。
Details
Motivation: 现有多跳问答基准常因虚假正确性奖励大语言模型,掩盖其不 grounded 或有缺陷的推理过程,亟需更严格的推理评估机制。 Method: SAFE框架包含两个阶段:(1) 训练时验证:构建原子级错误分类体系,并基于知识图谱设计验证流水线,识别并剔除不可回答样本;(2) 推理时验证:使用在验证数据上训练的反馈模型,实时检测并修正不 grounded 的推理步骤。 Result: SAFE识别出最多14%的样本为不可回答,并在推理中保证可验证路径,平均准确率提升8.4个百分点,显著优于基线方法。 Conclusion: SAFE有效提升了多跳问答中推理过程的可验证性与可靠性,为构建更严谨的推理基准提供了新范式。 Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.[43] $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection
Kahim Wong,Kemou Li,Haiwei Wu,Jiantao Zhou
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、查询高效的零样本LLM生成文本检测方法kNNProxy,通过重用kNN-LM检索机制对固定代理LLM进行领域自适应,并进一步扩展为多代理混合(MoP)以提升跨领域鲁棒性。
Details
Motivation: 现有零样本检测方法依赖与源LLM对齐的代理模型,但在黑盒场景下难以满足;而现有对齐方法需监督微调或频繁API调用,导致成本高、易受API变更影响且域迁移鲁棒性差。 Method: 提出kNNProxy框架:构建轻量目标相关LGT语料库的数据存储,推理时利用k近邻检索得到token级预测分布,并与代理LLM输出插值得到对齐预测;进一步扩展为MoP,按输入路由至对应领域数据存储。 Result: 在多个基准上实验表明,kNNProxy在检测性能、查询效率和跨领域鲁棒性方面均显著优于现有零样本及部分学习型方法。 Conclusion: kNNProxy是一种训练免费、查询高效、无需微调代理模型的零样本LGT检测新范式,MoP扩展进一步增强了其实际部署中的泛化能力。 Abstract: LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.[44] Why Gaussian Diffusion Models Fail on Discrete Data?
Alexander Shabalin,Simon Elistratov,Viacheslav Meshchaninov,Ildus Sadrtdinov,Dmitry Vetrov
Main category: cs.CL
TL;DR: 本文研究了高斯扩散模型(DDPM)在离散数据生成中的采样问题,发现其在噪声化数据密度多峰的关键区间内易陷入低密度区域,导致生成质量下降;提出结合自条件机制与q-采样策略可显著提升文本、代码和蛋白质等离散数据的生成效果。
Details
Motivation: 高斯扩散模型在连续域生成中已成主流,但在离散数据(如文本、代码)上表现不佳,亟需理解其根本限制并提出改进方案。 Method: 构建随机层次模型(Random Hierarchy Model)进行理论分析,识别出导致采样失败的关键多模态噪声区间;引入并分析自条件(self-conditioning)与新提出的q-sampling求解器,并在关键区间动态切换求解器。 Result: 在文本、编程代码和蛋白质序列等真实离散数据任务中,结合自条件与区间内切换至q-sampling显著提升了生成质量(如BLEU、edit distance、validity等指标)。 Conclusion: DDPM在离散数据上的失效源于多模态噪声区间的低密度采样陷阱;通过机理驱动的求解器切换与自条件协同,可有效缓解该问题,为离散扩散建模提供新范式。 Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.[45] Tracking the emergence of linguistic structure in self-supervised models learning from speech
Marianne de Heer Kloots,Martijn Bentum,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema
Main category: cs.CL
TL;DR: 本文研究了六种Wav2Vec2和HuBERT模型在荷兰语语音预训练过程中,不同层次和训练阶段对多种语言结构的编码规律,发现语言结构的出现具有层次和时间上的差异性,并受预训练目标层级(如伪标签迭代精化)显著影响。
Details
Motivation: 探究自监督语音模型中语言结构何时以及如何在训练过程中涌现。 Method: 对六种Wav2Vec2和HuBERT模型在荷兰语上的多层中间检查点进行探针分析,系统评估其对多种语言结构的编码能力。 Result: 不同语言结构展现出显著不同的层间分布模式和学习轨迹;抽象程度和时间尺度影响其编码位置;高阶预测任务(如迭代伪标签)增强了表征的层间并行性。 Conclusion: 语言结构在语音模型中的浮现不仅依赖于模型深度,更与预训练目标的抽象层级和时序整合方式密切相关。 Abstract: Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).[46] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Nicolas Boizard,Théo Deschamps-Berger,Hippolyte Gisserot-Boukhlef,Céline Hudelot,Pierre Colombo
Main category: cs.CL
TL;DR: 本文提出了一种将因果生成式语言模型转化为双向编码器的新方法,通过引入先验掩码阶段、线性权重融合与轻量多领域数据混合策略,缓解灾难性遗忘,并融合专用因果模型以增强多模态能力,最终开源了性能优越的BidirLM编码器家族。
Details
Motivation: 现有将因果生成模型转为双向编码器的方法缺乏训练目标共识、存在大规模下的灾难性遗忘问题,且难以灵活整合多样化的专用生成模型。 Method: 基于Gemma3和Qwen3系列进行系统消融实验,发现先验掩码阶段至关重要;提出无原始预训练数据下的双策略:线性权重合并 + 轻量多领域数据混合;进一步通过融合专用因果模型增强编码器的模态与领域能力。 Result: 构建出开源BidirLM编码器家族(共5个),在文本、视觉和音频表征基准上均超越现有替代方案。 Conclusion: 该方法为任意因果解码型大语言模型提供了可复现、可扩展、模块化的双向编码器构建范式,显著提升多模态表征能力。 Abstract: Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.[47] Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Tao Jin,Phuong Minh Nguyen,Naoya Inoue
Main category: cs.CL
TL;DR: 本文提出GOOSE框架,通过构建自适应脊树(anisotropic tree)来提升大语言模型的推测解码效率,利用不同来源token的质量差异优化树结构,在不损失精度的前提下实现1.9–4.3倍加速。
Details
Motivation: 现有无训练推测解码方法未区分不同token来源(如n-gram匹配与统计预测)的质量差异,导致树结构低效;而高质量token与低质量token接受率差距显著(中位数达6倍),亟需适配其特性的树形设计。 Method: 提出GOOSE框架,构建‘自适应脊树’:以高接受率的上下文匹配token构成深度主干链,每个节点挂载低接受率替代token作为宽分支;理论证明该结构在固定验证预算下优于平衡树,并保证吞吐不低于任一单源方案。 Result: 在5个7B–33B规模LLM和5个基准上,GOOSE实现1.9–4.3倍无损加速,相比平衡树基线提升12%–33%。 Conclusion: 当token来源存在显著质量差异时,非对称(anisotropic)树结构是最优选择;GOOSE通过显式建模质量异质性,在无需训练前提下显著提升推测解码效率。 Abstract: Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.[48] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia
Main category: cs.CL
TL;DR: 本文提出ReRanking Preference Optimization (RRPO),一种基于强化学习的重排序框架,通过LLM反馈直接优化重排序以提升下游生成质量,无需人工标注,且在多个基准上显著优于现有方法。
Details
Motivation: 当前重排序模型仅依赖静态人工标注的相关性标签,与下游大语言模型(LLM)生成任务脱节,导致高相关性文档未必具备实际生成效用。 Method: 将重排序建模为序列决策过程,提出RRPO强化学习框架,利用LLM对上下文效用的反馈进行优化,并引入参考锚定的确定性基线以保障训练稳定性。 Result: 在知识密集型基准上显著超越强基线(如RankZephyr);框架可泛化至不同LLM读者(如GPT-4o)、兼容查询扩展模块(如Query2Doc),并对噪声监督鲁棒。 Conclusion: RRPO有效弥合了重排序与LLM生成之间的目标鸿沟,实现了无需人工标注、端到端对齐生成效用的重排序优化。 Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.[49] Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Haitong Sun,Stephen McIntosh,Kwanghee Choi,Eunjung Yeo,Daisuke Saito,Nobuaki Minematsu
Main category: cs.CL
TL;DR: 本文提出了一种名为'prosodic ABX'的新框架,用于评估自监督语音模型(S3Ms)对韵律对比(如重音、声调、音高重音)的敏感性,并构建了英语、日语和汉语的最小对立对数据集进行验证。
Details
Motivation: 现有研究关注S3Ms对音素对比的敏感性,但尚未直接测量其对韵律对比的敏感性。 Method: 扩展ABX判别任务为'prosodic ABX',利用少量无标签的最小对立对数据评估韵律对比;构建并发布英语、日语及已有汉语韵律最小对立数据集;在不同语言韵律特征(英语重音、日语音高重音、汉语声调)上进行实验评估。 Result: prosodic ABX能有效评估S3Ms对韵律对比的建模能力;模型与层的性能排序在多种实验条件下具有一致性,适用于低资源场景。 Conclusion: S3Ms确实具备捕捉韵律对比的能力,prosodic ABX是一种高效、低资源依赖的评估方法,为语音表征的韵律建模研究提供了新工具。 Abstract: Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.[50] Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Haomin Zhuang,Hojun Yoo,Xiaonan Luo,Kehan Guo,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于稳定性过滤和内容子空间投影的推理行为控制方法,显著提升了大语言模型中自发型推理行为(如自我反思)的可控性与泛化能力。
Details
Motivation: 现有基于关键词匹配识别自发推理行为(如自我反思)的方法假设所有检测到的边界都代表真实行为信号,但实证发现其中93.3%的行为不稳定、不可复现,导致构建的转向向量效果差。 Method: 提出概率模型刻画内在推理行为为上下文依赖的随机事件;设计稳定性过滤机制,仅保留行为可复现的边界;结合内容子空间投影去除问题相关噪声;最终构建高质量转向向量。 Result: 在MATH-500上达到0.784准确率(较最强基线+5.0);转向向量可在同架构家族模型间直接迁移,提升Nemotron-Research-Reasoning-1.5B(+5.0)和DeepScaleR-1.5B-Preview(+6.0)。 Conclusion: 行为稳定性是构建有效转向向量的关键前提;所提方法克服了关键词匹配的固有缺陷,为训练无关的推理控制提供了更鲁棒、可迁移的新范式。 Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.[51] GaelEval: Benchmarking LLM Performance for Scottish Gaelic
Peter Devine,William Lamb,Beatrice Alex,Ignatius Ezeani,Dawn Knight,Mícheál J. Ó Meachair,Paul Rayson,Martin Wynne
Main category: cs.CL
TL;DR: 本文提出了GaelEval,首个针对苏格兰盖尔语的多维评测基准,涵盖形态句法、文化翻译与文化知识问答三类任务;实验发现前沿闭源模型(如Gemini 3 Pro)在语法任务上超越母语者水平,盖尔语提示有小幅增益,且闭源模型整体显著优于开源模型。
Details
Motivation: 多语言大模型在未官方支持的语言(如形态丰富的小语种苏格兰盖尔语)中存在性能不均、评测不足的问题,现有翻译基准无法有效衡量其结构语言能力。 Method: 构建首个苏格兰盖尔语多维评测基准GaelEval,包括专家编写的形态句法多项选择题(MCQA)、文化适配的翻译任务和大规模文化知识问答任务;对19个LLM在30名流利母语者基准下进行系统评估,并对比英语与盖尔语提示效果。 Result: Gemini 3 Pro Preview在形态句法任务中达83.3%准确率,超过人类基线(78.1%);闭源模型持续优于开源模型;盖尔语提示带来+2.4%稳定提升;文化知识任务中领先模型超90%准确率,但盖尔语提示反而降低多数模型表现。 Conclusion: 前沿大模型已在苏格兰盖尔语若干语法维度上实现超人类表现;GaelEval揭示了模型在小语种上的真实能力边界,证实了语言特异性提示的有效性及闭源模型的系统性优势。 Abstract: Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.[52] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi
Main category: cs.CL
TL;DR: 本文系统研究了链式思维(CoT)推理长度对函数调用型语言代理性能的影响,发现极短推理(8–16 tokens)效果最优,过长推理反而严重损害准确率;并据此提出结构化简短CoT方法FR-CoT,在保持高性能的同时彻底消除函数幻觉。
Details
Motivation: 链式思维(CoT)被广泛认为能提升代理性能,但在结构化工具使用场景中,推理长度与准确率的关系尚不明确,尤其缺乏对函数调用代理中CoT预算效应的系统实证分析。 Method: 在Berkeley Function Calling Leaderboard v3 Multiple基准的200个任务上,对Qwen2.5-1.5B-Instruct模型进行六档token预算(0–512)的系统性消融实验;结合三类错误分解(函数选错、参数错、幻觉)、oracle分析及提出的结构化FR-CoT方法(模板化为'Function: [name] / Key args: [...]')进行验证。 Result: 发现CoT效果呈显著非单调性:32 token CoT使准确率从44.0%跃升至64.0%(+45%相对提升),而256 token则降至25.0%;错误分析显示短CoT大幅降低函数误选(30.5%→1.5%),长CoT却导致28.0%误选和18.0%幻觉;oracle表明88.6%可解任务仅需≤32 token,最优区间为8–16 token;FR-CoT实现与自由32-token CoT相当的准确率,且函数幻觉降为0.0%。 Conclusion: CoT并非越长越好,其核心价值在于轻量级函数路由而非深度推理;结构化简短CoT(如FR-CoT)可在不依赖预算调优的前提下,兼顾性能与可靠性,为工具调用代理设计提供新范式。 Abstract: How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.[53] AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
Main category: cs.CL
TL;DR: 本文介绍了AstroConcepts语料库,用于研究天体物理学文本多标签分类中的极端类别不平衡问题,并提出了频率分层评估方法以揭示不同方法在罕见术语上的性能差异。
Details
Motivation: 科学领域的多标签文本分类面临极端的类别不平衡问题,特别是专业术语呈现严重的幂律分布,而现有语料库缺乏全面的受控词汇表,难以系统研究此类不平衡。 Method: 构建了包含21702篇天体物理论文摘要、标注了2367个统一天文词表概念的AstroConcepts语料库;采用传统模型、神经网络模型及词汇约束的大语言模型进行基线实验;提出频率分层评估策略。 Result: 词汇约束的大语言模型在天体物理分类中表现接近领域适配模型;领域适配对罕见术语提升更显著但绝对性能仍有限;频率分层评估能揭示聚合指标掩盖的性能模式。 Conclusion: AstroConcepts为科学领域极端不平衡研究提供了新资源和基准,提出的评估方法与发现为科学NLP提供了可操作的洞见。 Abstract: Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.[54] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
Atilla Kaan Alkan,Felix Grezes,Jennifer Lynn Bartlett,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
Main category: cs.CL
TL;DR: 本文参与SOMD 2026共享任务,提出两种无需微调的方法(Fuzzy Matching和Context Aware Representations)解决跨文档软件提及共指消解问题,CAR在性能、鲁棒性和可扩展性上整体优于FM。
Details
Motivation: 软件名称表面形式高度规范,可能降低对复杂语义推理的需求;需探索轻量、高效且鲁棒的共指消解方法以支持大规模、噪声敏感的实际场景。 Method: 比较两种无微调方法:1)Fuzzy Matching(基于字符串相似度的词法匹配);2)Context Aware Representations(融合提及级与文档级嵌入的表示方法);并开展噪声注入实验与推理效率分析。 Result: CAR在CoNLL F1上稳定领先FM约1个百分点(0.94–0.96);抗边界噪声能力更强(F1仅降0.07 vs. FM降0.20),而FM在提及替换下退化更平缓;CAR推理时间近似线性扩展,FM为超线性。 Conclusion: 方法选择应兼顾上游提及检测器的噪声特性与目标语料规模;CAR更适合大规模低边界噪声场景,FM在高提及替换噪声下更具韧性;代码已开源以推动该冷门任务发展。 Abstract: We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.[55] Adam's Law: Textual Frequency Law on Large Language Models
Hongyuan Adam Lu,Z. L.,Victor Wei,Zefan Zhang,Zhao Hong,Qiqi Xiang,Bowen Cao,Wai Lam
Main category: cs.CL
TL;DR: 本文提出文本频率对大语言模型(LLM)性能具有重要影响,构建了包含文本频率定律(TFL)、文本频率蒸馏(TFD)与课程式文本频率训练(CTFT)的完整框架,并在多个任务上验证其有效性。
Details
Motivation: 尽管文本频率已被证实影响人类阅读速度,但其对大语言模型(LLMs)的影响却鲜有研究;且现有LLM训练数据常不公开,缺乏可靠的句子级频率评估方法。 Method: 提出文本频率定律(TFL),利用在线资源估计句子级频率;设计输入改写器将输入转为更高频表达;提出文本频率蒸馏(TFD)通过LLM续写扩展语料以优化频率估计;最后采用课程式文本频率训练(CTFT),按频率递增顺序微调LLM。 Result: 在自建的Textual Frequency Paired Dataset(TFPD)上,于数学推理、机器翻译、常识推理和智能体工具调用任务中均取得性能提升。 Conclusion: 文本频率是影响LLM性能的关键因素;所提TFL-TFD-CTFT框架可有效提升模型表现,为LLM的数据选择与训练策略提供了新范式。 Abstract: While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.[56] The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Jeremy Herbst,Jae Hee Lee,Stefan Wermter
Main category: cs.CL
TL;DR: 本文研究了MoE架构中专家(experts)与密集前馈网络(FFNs)在可解释性上的差异,发现MoE中的专家神经元更倾向于单义性(monosemanticity),尤其在路由更稀疏时更明显;进而提出以专家为基本分析单元,自动解释数百个专家,揭示其实际扮演的是细粒度语言或语义任务专家(如LaTeX括号闭合),而非宽泛领域或简单token处理者,表明MoE在专家层级具有天然可解释性。
Details
Motivation: 探究MoE架构的稀疏性是否使其比密集FFN更易解释,解决其专家是否具备可解释性及专业化的争议。 Method: 采用k-稀疏探测(k-sparse probing)比较MoE专家与密集FFN的神经元多义性,并以专家为单位进行大规模自动解释分析。 Result: MoE专家神经元显著更少多义,且随路由稀疏性增强而单义性更突出;专家并非宽泛领域专家或token处理器,而是执行细粒度语言/语义任务(如LaTeX括号闭合);专家层级具有天然可解释性。 Conclusion: MoE架构因其结构稀疏性,在专家层级具备内在可解释性,为大模型可解释性研究提供了更有效、更可扩展的分析单元。 Abstract: Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis[57] Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Jaemin Kim,Jae O Lee,Sumyeong Ahn,Seo Yeon Park
Main category: cs.CL
TL;DR: 本文提出Neuro-RIT框架,通过神经元级归因分析识别并区分处理相关与无关检索上下文的神经元,并采用两阶段指令微调策略实现噪声抑制与证据提炼,显著提升检索增强语言模型在噪声环境下的鲁棒性。
Details
Motivation: 现有检索增强语言模型(RALMs)在面对无关或噪声检索上下文时性能易下降,而主流鲁棒性增强方法仅在模块或层级别进行粗粒度参数更新,忽略了大语言模型固有的神经元级稀疏性。 Method: 提出Neuro-RIT:首先基于归因法进行神经元挖掘,解耦负责处理相关与无关上下文的神经元;再通过两阶段指令微调,一方面功能上失活仅响应无关上下文的神经元以实现直接噪声抑制,另一方面优化特定层以增强证据提炼能力。 Result: 在多个问答基准上的大量实验表明,Neuro-RIT持续优于强基线及各类鲁棒性增强方法。 Conclusion: 神经元级精细调控比传统粗粒度适配更有效,Neuro-RIT为提升RALMs在知识密集型任务中的噪声鲁棒性提供了新范式。 Abstract: Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.[58] Towards Position-Robust Talent Recommendation via Large Language Models
Silin Du,Hongyan Liu
Main category: cs.CL
TL;DR: 本文提出L3TR框架,通过块注意力机制和局部位置编码解决LLM在人才推荐中的位置偏差和令牌偏差问题,实现列表级推荐。
Details
Motivation: 现有基于大语言模型(LLM)的人才推荐系统多采用逐点范式,导致高token消耗、无法建模候选人关系,并受位置偏差和'中间丢失'问题影响。 Method: 提出L3TR框架,包括块注意力机制、局部位置编码方法和ID采样策略,并设计评估与训练无关的去偏方法。 Result: 在两个真实数据集上的实验表明,L3TR相较现有基线具有一致性提升。 Conclusion: L3TR有效缓解了LLM在列表级人才推荐任务中的位置偏差与并发token偏差,提升了推荐性能与效率。 Abstract: Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.[59] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
Youssef Saidi,Haroun Elleuch,Fethi Bougares
Main category: cs.CL
TL;DR: 本文提出了首个面向阿拉伯语语音的端到端命名实体识别(NER)公开数据集CV-18 NER,并在该数据集上对比了端到端模型与级联模型的性能,发现端到端方法显著更优。
Details
Motivation: 阿拉伯语因其形态复杂性、缺失短元音及标注资源匮乏,在端到端语音NER领域尚未被充分研究,亟需高质量数据集和基准评估。 Method: 构建CV-18 NER数据集(基于Common Voice 18并按Wojood细粒度模式人工标注21类实体),并在其上评测ASR+文本NER级联系统与基于Whisper和AraBEST-RQ的端到端模型;分析预训练策略与模型规模对低资源语音NER的影响。 Result: 端到端模型显著优于最佳级联系统:AraBEST-RQ 300M达37.0% CoER,Whisper-medium达38.0% CVER;阿拉伯语专用自监督预训练利于ASR,而多语言弱监督更利于联合语音到实体学习;大模型在低资源下适应性反而下降。 Conclusion: 端到端语音NER是阿拉伯语NER的有效路径,CV-18 NER为该领域提供了首个开放基准,推动低资源语音理解研究。 Abstract: End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.[60] No Single Best Model for Diversity: Learning a Router for Sample Diversity
Yuhan Liu,Fangyuan Xu,Vishakh Padmakumar,Daphne Ippolito,Eunsol Choi
Main category: cs.CL
TL;DR: 本文研究如何通过多模型协同生成开放性问题的多样化答案,提出多样性覆盖率评估指标,并设计路由机制选择最优模型,显著提升答案多样性。
Details
Motivation: 当面对允许大量有效答案的开放性提示时,全面生成这些答案是满足不同用户需求的第一步;同时发现没有单一模型能在所有开放性提示下都表现最优,但每个提示下总存在一个最优模型。 Method: 提出多样性覆盖率(diversity coverage)作为评估指标,衡量预测答案集中每个唯一答案的质量得分总和相对于同规模最优答案集的比率;评估18个大语言模型在多样回答生成上的表现;基于观察结果设计并训练一个路由模型,为每个查询选择最优生成模型。 Result: 在NB-Wildchat数据集上,所提出的路由器相比单个最优模型基线将多样性覆盖率从23.8%提升至26.3%;并在NB-Curated(域外)数据集及不同提示策略下验证了泛化能力。 Conclusion: 多模型协同配合路由机制是提升开放性问题答案多样性与全面性的可行路径,为利用模型套件生成综合性回答奠定了基础。 Abstract: When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.[61] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Daiwei Chen,Zhoutong Fu,Chengming Jiang,Haichao Zhang,Ran Zhou,Tan Wang,Chunnan Yao,Guoyao Li,Rui Cai,Yihan Cao,Ruijie Jiang,Fedor Borisyuk,Jianqiang Shen,Jingwei Wu,Ramya Korlakai Vinayak
Main category: cs.CL
TL;DR: 本文发现语言模型中新增词汇的均值初始化策略会导致新词嵌入坍缩到退化子空间,损害其区分能力;为此提出基于语言学基础的GTI初始化方法,在微调前利用配对语言监督将新词映射到预训练嵌入空间中有意义的位置,显著提升生成式推荐等任务性能。
Details
Motivation: 现有语言模型扩展新词汇时采用均值初始化,但作者发现该策略导致新词嵌入坍缩、丢失区分性,成为模型扩展的关键瓶颈,因此需探索更优的初始化机制。 Method: 提出Grounded Token Initialization(GTI)方法:在微调前,利用少量配对语言监督(如文本描述与新词的对应关系),将新词嵌入映射到预训练词向量空间中语义合理且互异的位置。 Result: GTI在多个生成式推荐基准(含工业级和公开数据集)上显著优于均值初始化及现有辅助任务适配方法;分析表明GTI产生的嵌入具有更丰富的词间结构,且该结构在微调后仍保持稳定。 Conclusion: 新词初始化质量是语言模型词汇扩展的关键瓶颈;语言学引导的初始化(GTI)能有效激活预训练知识,提升下游任务性能,为模型可扩展性提供新思路。 Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.cs.CV [Back]
[62] CLPIPS: A Personalized Metric for AI-Generated Image Similarity
Khoi Trinh,Jay Rothenberger,Scott Seidenberger,Dimitrios Diochnos,Anindya Maiti
Main category: cs.CV
TL;DR: 本文提出CLPIPS,一种基于LPIPS的轻量级、人类反馈驱动的可定制感知图像相似度度量方法,通过仅微调LPIPS层间组合权重,显著提升其与人类主观排序判断的一致性。
Details
Motivation: 现有图像相似度指标(如LPIPS、CLIP)虽客观但常与人类主观判断不一致,尤其在面向用户或上下文相关的文本生成图像任务中;亟需一种能自适应对齐人类感知的可定制相似度度量。 Method: 提出Customized Learned Perceptual Image Patch Similarity(CLPIPS),在人类对生成图像对的成对排序数据集上,采用margin ranking loss,仅微调LPIPS的层组合权重;评估指标为Spearman秩相关系数和组内相关系数(ICC)。 Result: CLPIPS在人类被试数据集上相比原始LPIPS展现出更高的Spearman秩相关和ICC,证明其与人类主观相似度排序更一致;即使仅微调少量参数,也能显著提升感知对齐能力。 Conclusion: 轻量级、人类增强的微调可有效提升图像相似度指标与人类判断的一致性;CLPIPS作为可适配组件,有望增强人机协同文本生成图像工作流中的反馈质量与可控性。 Abstract: Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.[63] Camouflage-aware Image-Text Retrieval via Expert Collaboration
Yao Jiang,Zhongkuan Mao,Xuan Wu,Keren Fu,Qijun Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的伪装感知图像-文本检索(CA-ITR)任务,构建了首个专用数据集CamoIT,并设计了伪装专家协同网络(CECNet)以提升跨模态对齐性能,显著优于现有方法。
Details
Motivation: 现有伪装场景理解(CSU)中,鲁棒的图像-文本跨模态对齐研究不足,限制了对伪装场景的深层理解与应用。 Method: 构建包含约10.5K样本、多粒度文本标注的CamoIT数据集;提出双分支视觉编码器的CECNet模型,其中一分支建模整体图像表征,另一分支注入伪装物体表征,并引入置信度条件图注意力(C²GA)机制融合双分支信息。 Result: 在CamoIT上,CECNet相较七种代表性检索模型整体CA-ITR准确率提升约29%;基准实验揭示了伪装属性和复杂图像内容是当前方法的主要挑战。 Conclusion: CA-ITR是一个具有实际意义的新任务,CamoIT数据集和CECNet模型为伪装场景下的跨模态理解提供了有效基础与新思路。 Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.[64] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi
Main category: cs.CV
TL;DR: 本文提出Look Twice(LoT)框架,无需训练即可在推理阶段提升多模态大语言模型(MLLMs)对视觉与外部知识联合推理的能力,通过注意力机制识别并高亮关键视觉区域和文本证据,显著提升知识密集型视觉问答性能。
Details
Motivation: 现有MLLMs在知识密集型视觉问答中难以准确识别和融合相关视觉线索与外部检索文本,尤其面对噪声或部分相关文本及细粒度视觉定位时表现不足。 Method: 提出无训练的推理时框架LoT,利用预训练MLLM的注意力模式自动估计查询相关的视觉区域和文本片段,并通过轻量级提示标记高亮这些证据,引导模型在生成答案时重新关注关键多模态线索。 Result: 在多个知识型VQA基准上显著超越零样本MLLM;在纯视觉任务和幻觉评估中也验证了仅视觉高亮即可提升性能,且无需额外训练或模型修改。 Conclusion: LoT是一种通用、高效、即插即用的推理增强方法,有效提升MLLM对多模态证据的选择性利用能力,为知识驱动的视觉理解提供了新思路。 Abstract: Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.[65] Sparse Spectral LoRA: Routed Experts for Medical VLMs
Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz
Main category: cs.CV
TL;DR: 本文提出MedQwen,一种参数高效的医学视觉语言模型,通过谱路由的混合专家(MoE)结构和理论支持的缩放规则,在保持基座架构不变的前提下,显著提升跨数据集鲁棒性与持续学习能力,大幅减少参数量并缓解灾难性遗忘。
Details
Motivation: 大型视觉语言模型在通用基准上表现优异,但在医学影像领域因监督信号异构导致跨数据集干扰、对数据范式敏感,且临床场景中数据与任务顺序到达,易引发灾难性遗忘。 Method: 提出MedQwen:采用谱路由的MoE结构;基于非重叠SVD分段初始化各专家;引入残差补偿与缩放机制以稳定专家特化和分布偏移下的路由一致性;设计理论对齐的低秩更新规则,匹配全秩微调MoE性能。 Result: 在23个医学数据集(涵盖VQA、报告生成、放射分类、幻觉缓解)上验证:零样本分类接近全微调性能,仅需1/339可训练参数;顺序学习遗忘率约5%,而强基线下降超20–50%。 Conclusion: MedQwen通过结构创新与理论驱动的参数高效微调策略,有效解决了医学VLM在异构监督与持续学习下的鲁棒性与可扩展性难题。 Abstract: Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5\% where strong baselines degrade by $>$20-50\%.[66] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos
Syed Ahsan Masud Zaidi,William Hsu,Scott Dietrich
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉Transformer的视频分析方法,用于在美式橄榄球训练视频中检测危险擒抱动作,并构建了一个包含733个标注片段的更大规模数据集,显著提升了对罕见但关键安全事件的检测性能。
Details
Motivation: 早期识别接触性运动中的危险动作可及时干预并提升运动员安全性。 Method: 采用基于视觉Transformer的模型,并结合针对类别不平衡的训练策略,在新构建的大规模数据集(733个单人-假人擒抱片段,带SATT-3击打区域标注)上进行危险擒抱检测。 Result: 交叉验证下危险擒抱召回率达0.67,Risky F1达0.59;相较先前在小数据集上的基线(召回率0.58,F1 0.56),召回率提升超8个百分点。 Conclusion: 视觉Transformer结合不平衡学习能可靠检测罕见但关键的安全相关擒抱模式,为教练导向的防伤工具提供了可行路径。 Abstract: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.[67] Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset
Léa Drolet-Roy,Victor Nogues,Sylvain Gaudet,Eve Charbonneau,Mickaël Begon,Lama Séoud
Main category: cs.CV
TL;DR: 本文提出了一种通过在合成蹦床姿态数据集(STP)上微调ViTPose模型来提升蹦床运动中极端姿态估计精度的方法,显著提升了2D姿态估计和3D三角化性能。
Details
Motivation: 现有姿态估计模型在蹦床运动这类极端人体姿态和非常规视角下表现不佳,亟需针对性改进。 Method: 基于动作捕捉数据构建合成蹦床姿态数据集(STP),通过拟合噪声动捕数据至参数化人体模型并生成多视角逼真图像;在此数据集上微调ViTPose模型,并在真实多视角蹦床图像上测试。 Result: 2D姿态估计达到该挑战性数据集上的SOTA;3D MPJPE降低12.5 mm,相对预训练ViTPose提升19.6%。 Conclusion: 在高质量合成数据上微调是提升极端姿态估计性能的有效途径,可显著缩小常见姿态与极端姿态之间的性能差距。 Abstract: Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.[68] Regularizing Attention Scores with Bootstrapping
Neo Christopher Chung,Maxim Laletin
Main category: cs.CV
TL;DR: 本文提出了一种基于自助法(bootstrapping)的注意力正则化方法,用于量化视觉Transformer(ViT)中注意力分数的不确定性,从而去除噪声引起的虚假注意力,提升注意力图的稀疏性与可解释性。
Details
Motivation: 视觉Transformer中的注意力分数通常非零且噪声大,导致注意力图模糊、可解释性差,亟需一种能衡量其不确定性的正则化方法。 Method: 将注意力分数建模为含独立噪声的统计量,通过自助采样输入特征构建注意力分数的基线分布,进而估计各分数的显著性和后验概率,实现注意力正则化。 Result: 在自然图像和医学图像上,该方法有效剔除噪声引发的虚假注意力,显著提升注意力图的稀疏性与收缩性;仿真与真实数据实验均验证了其有效性。 Conclusion: 自助法是一种实用且有效的注意力正则化工具,可显著增强ViT中注意力机制作为解释手段的可靠性与可解释性。 Abstract: Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization[69] Perceptual misalignment of texture representations in convolutional neural networks
Ludovica de Paolis,Fabio Anselmi,Alessio Ansuini,Eugenio Piasini
Main category: cs.CV
TL;DR: 本文探讨了卷积神经网络(CNN)在纹理感知建模中的适用性,发现CNN的视觉系统建模质量(如Brain-Score)与其对人类纹理感知的表征能力之间并无相关性,暗示纹理感知可能依赖于CNN(尤其是用于物体识别训练的CNN)未充分建模的机制,例如上下文整合。
Details
Motivation: 探究CNN作为视觉系统模型的质量是否与其对人类纹理感知的表征能力一致,即更‘生物合理’的CNN是否也具备更‘人类相似’的纹理表征。 Method: 比较多种CNN提取的非线性特征相关性(Gram矩阵)所表征的纹理感知内容,并将其与各CNN在Brain-Score上对哺乳动物视觉系统的建模质量进行关联分析。 Result: 未发现CNN的Brain-Score得分与其纹理感知表征能力之间存在显著相关性。 Conclusion: 人类纹理感知可能依赖于当前主流基于物体识别训练的CNN未能有效建模的机制,例如对上下文信息的整合。 Abstract: Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.[70] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation
Nermin Samet,Gilles Puy,Renaud Marlet
Main category: cs.CV
TL;DR: 本文提出了一种用于3D激光雷达数据零样本开放词汇语义分割(OVSS)的新方法,通过文本生成图像来创建原型图像,避免了基于CLIP等视觉语言模型的方法中存在的图像-文本模态差距问题。
Details
Motivation: 解决基于视觉语言模型(如CLIP)的零样本开放词汇语义分割方法中固有的图像-文本模态差距问题。 Method: 利用文本生成图像创建原型图像,结合从2D视觉基础模型蒸馏得到的3D网络,将3D点云特征与原型图像的2D图像特征进行匹配以实现点云标注。 Result: 在nuScenes和SemanticKITTI数据集上实现了零样本开放词汇语义分割的最先进性能。 Conclusion: 该方法有效规避了图像-文本模态差距,提升了3D激光雷达数据的零样本开放词汇语义分割性能。 Abstract: This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.[71] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
Aiza Maksutova,Lalithkumar Seenivasan,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Yiqing Shen,Mathias Unberath
Main category: cs.CV
TL;DR: 本文提出AffordTissue框架,用于胆囊切除术中预测器械-动作特异性的组织可操作区域(密集热图),结合时序视觉编码器、语言条件引导和DiT式解码器,在自建的首个组织可操作性基准(15638个视频片段)上显著优于现有视觉语言模型。
Details
Motivation: 现有手术自动化方法在临床部署中面临两大挑战:难以预测器械与组织表面的交互位置,且缺乏对工具-动作特异性安全交互区域的显式条件控制。 Method: 提出多模态框架AffordTissue,包含三部分:1)捕获多视角器械运动与组织动态的时序视觉编码器;2)支持跨器械-动作泛化的语言条件模块;3)DiT风格解码器用于密集可操作区域预测;并构建首个组织可操作性基准(103例胆囊切除术,6种器械-动作对)。 Result: 在自建基准上,AffordTissue的平均表面距离(ASSD)为20.6像素,显著优于Molmo-VLM(60.2像素)等视觉语言模型基线;验证了任务专用架构在密集外科可操作性预测上的优越性。 Conclusion: AffordTissue通过预测器械-动作特异性的组织可操作区域,为手术自动化提供了显式的空间推理能力,有望实现面向安全区域的策略引导及器械越界时的早期安全中止。 Abstract: Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.[72] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization
Syed Ahsan Masud Zaidi,Lior Shamir,William Hsu,Scott Dietrich,Talha Zaidi
Main category: cs.CV
TL;DR: 本文提出GRAZE无训练流水线,用于在无标签数据下精确定位美式足球训练视频中球员首次接触假人(FPOC)的帧,结合Grounding DINO、运动感知时序推理与SAM2像素级验证,在738个视频上实现77.5%±10帧精度。
Details
Motivation: 美式足球训练视频长且未剪辑,关键生物力学分析依赖于对短暂接触事件(如首次触碰假人)的精准时空定位,但现有方法难以应对相机运动、场景杂乱、多人相似装备及冲击前后快速姿态变化等挑战。 Method: GRAZE是一种无需训练的流水线:首先用Grounding DINO发现候选球员-假人交互区域;其次通过运动感知的时序推理优化时间定位;最后利用SAM2进行像素级接触验证,而非依赖检测置信度,从而解耦候选发现与接触确认。 Result: 在738段实战训练视频上,GRAZE对97.4%的视频成功输出有效结果;其中77.5%的FPOC定位误差在±10帧内,82.7%在±20帧内。 Conclusion: 无需任务特定标注或训练,即可在真实世界训练视频中实现帧级精度的接触起始点定位,验证了无监督/弱监督范式在复杂体育视频分析中的可行性与鲁棒性。 Abstract: American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.[73] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
Fusang Wang,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Fabien Moutarde
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏体素光栅化(SVRaster)的新框架,以解决开放词汇3D场景理解中3D高斯泼溅(3DGS)存在的空间与语义模糊性问题,显著提升了细粒度查询下的性能。
Details
Motivation: 现有基于3D高斯泼溅的开放词汇3D场景理解方法存在空间模糊性(因高斯重叠导致特征注册需概率建模)和多级语义模糊性(因物体掩码池化稀释细节)两大关键缺陷。 Method: 采用稀疏体素光栅化(SVRaster)作为结构化、非重叠几何表示,并以单目深度与法向先验进行正则化;结合AM-RADIO基础模型的密集对齐特性,实现确定性、置信度感知的特征注册,避免语义渗漏和层级训练开销。 Result: 在开放词汇3D物体检索与点云理解基准上达到SOTA,尤其在细粒度查询任务中显著优于现有注册类方法。 Conclusion: SVRaster提供更稳定几何基础与更精准特征注册机制,是替代3DGS用于开放词汇3D理解的有力新范式。 Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.[74] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
Abhishek Saroha,Huajian Zeng,Xingxing Zuo,Daniel Cremers,Xi Wang
Main category: cs.CV
TL;DR: EgoFlow是一种基于流匹配的框架,用于从第一人称视频中生成物理上合理且逼真的6DoF物体运动轨迹,结合混合Mamba-Transformer-Perceiver架构与可微物理约束,在多个真实数据集上显著降低碰撞率并提升泛化能力。
Details
Motivation: 现有生成模型缺乏显式物理推理,难以在遮挡、快速运动等复杂条件下生成物理一致的6DoF轨迹。 Method: 提出EgoFlow,采用混合Mamba-Transformer-Perceiver架构联合建模时序动态、场景几何与语义意图,并通过梯度引导推理施加可微物理约束(如避碰、运动平滑)。 Result: 在HD-EPIC、EgoExo4D和HOT3D数据集上优于扩散模型和Transformer基线,碰撞率最高降低79%,具备强跨场景泛化能力。 Conclusion: 基于流的生成建模为可扩展、物理 grounded 的第一人称运动理解提供了新路径。 Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.[75] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
Derek Austin
Main category: cs.CV
TL;DR: 本文提出用Momentum Human Rig (MHR) 替代SMPL,结合SAM-3D-Body估计,构建更简化的3D高斯点阵人体建模流程,在多个指标上达到最优性能,并通过控制实验验证了人体模型表达能力是当前avatar重建的主要瓶颈。
Details
Motivation: 现有基于SMPL的3D高斯点阵方法虽视觉效果好,但训练架构日益复杂;作者旨在探索是否可简化架构而不牺牲性能。 Method: 用MHR替代SMPL,利用SAM-3D-Body估计MHR姿态与形状;设计无学习形变、无姿态依赖校正的极简训练流程;并通过两种控制实验(网格迁移与姿态迁移)分离评估姿态估计质量与模型表达能力的影响。 Result: 在PeopleSnapshot和ZJU-MoCap数据集上取得最高PSNR及有竞争力或更优的LPIPS和SSIM;控制实验证明MHR的表达能力与SAM-3D-Body的姿态估计质量共同推动性能提升。 Conclusion: 人体模型的表达能力(如MHR相比SMPL/X的几何灵活性)和高质量姿态估计(如SAM-3D-Body提供)是avatar重建的关键,过度复杂的可学习模块并非必需。 Abstract: Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset's SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline's gains.[76] Nonlinear Methods for Analyzing Pose in Behavioral Research
Carter Sale,Margaret C. Macpherson,Gaurav Patil,Kelly Miles,Rachel W. Kallen,Sebastian Wallot,Michael J. Richardson
Main category: cs.CV
TL;DR: 本文提出了一种通用的人类姿态数据分析流程,结合预处理、降维和基于递归的时间序列分析,以提取运动动态的时序结构,并通过多个案例验证其在不同场景下的灵活性与适用性。
Details
Motivation: 高维、含噪且时序复杂的姿态数据难以提取有意义的协调模式与行为变化,亟需一种通用、鲁棒的分析框架。 Method: 提出一个包含原理性预处理、降维和基于递归的时间序列分析的通用分析流程,适用于线性和非线性运动表征。 Result: 通过涵盖面部/全身、2D/3D、单主体/多主体行为的三个案例研究,验证了该流程能灵活适应多种实验情境并提取理论上有意义的行为洞察。 Conclusion: 该分析流程为大规模自然场景下的人类行为研究提供了一个可扩展、可复用且理论驱动的工具框架。 Abstract: Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline's flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.[77] Reinforcing Consistency in Video MLLMs with Structured Rewards
Yihao Quan,Zeru Shi,Jinman Zhao,Ruixiang Tang
Main category: cs.CV
TL;DR: 本文提出了一种结构化奖励机制,用于提升多模态大语言模型(MLLMs)在视频理解中的视觉与时间定位能力,通过分解字幕为事实性和时序性断言进行一致性审计,并设计包含场景图、时序和视频问答三部分的细粒度奖励,显著改善了模型的忠实性。
Details
Motivation: 现有MLLMs在视频理解中常出现看似合理但缺乏视觉和时间依据的输出(如虚构物体、错误属性或忽略事件重复),标准句子级监督和强化学习奖励难以准确定位和纠正这些底层 grounding 失败。 Method: 提出一种自上而下的组合一致性审计方法,将字幕分解为事实性与时间性断言;并设计结构化强化学习目标,包含:(1) 实例感知的场景图奖励(对象、属性、关系);(2) 时间奖励(事件顺序与重复);(3) 视频支撑的VQA奖励(分层自验证)。 Result: 在时序理解、通用视频理解和幻觉导向基准上,该结构化奖励方法在多个开源骨干模型上均带来一致性能提升,显著降低幻觉并增强grounding可靠性。 Conclusion: 结构化奖励塑形是提升视频理解忠实性的实用有效路径,优于粗粒度句子级监督与奖励。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.[78] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
Yunbei Zhang,Chengyi Cai,Feng Liu,Jihun Hamm
Main category: cs.CV
TL;DR: 本文提出AReS方法,通过单次API调用对本地预训练编码器进行轻量级微调,再在本地模型上进行白盒重编程,从而避免反复调用昂贵的闭源API,在多个数据集上显著提升性能并大幅降低API调用开销。
Details
Motivation: 现有基于零阶优化(ZOO)的闭源服务模型(如GPT-4o)重编程方法存在API调用成本高、优化不稳定、且对现代大模型输入扰动不敏感等问题。 Method: 提出AReS方法:首先通过单次API交互对本地预训练编码器添加并训练一个轻量层(priming),使其适配目标任务;随后在该本地代理模型上执行白盒重编程,完全脱离API进行后续适配与推理。 Result: 在GPT-4o上相对零样本基线提升27.8%,而ZOO方法几乎无增益;在10个数据集上平均超越SOTA方法(VLMs +2.5%,标准VMs +15.6%),API调用量减少超99.99%。 Conclusion: AReS为现代闭源大模型提供了一种高效、稳定、低成本的适应方案,兼具实用性与可扩展性。 Abstract: Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.[79] UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
Zhisheng Huang,Jiahao Chen,Cheng Lin,Chenyu Hu,Hanzhuo Huang,Zhengming Yu,Mengfei Li,Yuheng Liu,Zekai Gu,Zibo Zhao,Yuan Liu,Xin Li,Wenping Wang
Main category: cs.CV
TL;DR: UniRecGen 提出了一种统一框架,将前馈重建与扩散生成结合,在共享规范空间中实现协同学习,从而在稀疏视角下生成高保真、结构完整且多视角一致的3D模型。
Details
Motivation: 稀疏视角3D建模面临重建保真度与生成合理性之间的根本张力:前馈重建高效但缺乏全局先验,扩散生成细节丰富但多视角不一致。 Method: 提出UniRecGen框架,通过共享规范空间对齐两种范式,并采用解耦协同学习策略;重建模块提供规范几何锚点,扩散生成器通过潜空间增强条件进行几何细化与补全。 Result: 实验表明UniRecGen在稀疏视角输入下,生成的3D模型在保真度、完整性与多视角一致性方面均优于现有方法。 Conclusion: 将前馈重建与扩散生成在规范空间中协同整合是提升稀疏视角3D建模性能的有效路径,兼顾效率、结构完整性和几何细节。 Abstract: Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.[80] Universal computational thermal imaging overcoming the ghosting effect
Hongyi Xu,Du Wang,Chenjun Zhao,Jiashuo Chen,Jiale Lin,Liqin Cao,Yanfei Zhong,Yiyuan She,Fanglin Bao
Main category: cs.CV
TL;DR: 本文提出了一种名为TAG(thermal anti-ghosting)的通用计算热成像框架,用于解决材料非均匀性导致的热成像鬼影效应,实现高保真夜视,首次在热成像中实现人脸纹理、表情恢复、3D拓扑对齐与情绪识别。
Details
Motivation: 现有热成像受鬼影效应限制,细节丢失严重;HADAR虽有突破但仅适用于均匀材质场景,而现实世界普遍存在材料非均匀性,亟需通用抗鬼影方案。 Method: 提出TAG框架,基于超光谱光子流进行非参数化纹理重建,无需材质先验模型,适应任意材料分布场景。 Result: 实验首次在热成像中实现幽灵般模糊人脸的高保真纹理与表情恢复;显著优于HADAR;揭示材料非均匀性对HADAR性能的影响边界;首次实现热成像3D拓扑对齐与情绪识别。 Conclusion: TAG为高保真计算夜视建立了通用基础,拓展了热成像在自动驾驶、侦察、医疗和野生动物监测等领域的应用潜力。 Abstract: Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces -- the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR's effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.[81] Prototype-Based Low Altitude UAV Semantic Segmentation
Da Zhang,Gao Junyu,Zhao Zhiyuan
Main category: cs.CV
TL;DR: 本文提出了一种面向低空无人机影像语义分割的高效原型分割框架PBSeg,通过原型交叉注意力(PBCA)和结合可变形卷积与上下文感知调制的多尺度特征提取模块,在保持高精度的同时显著降低计算开销。
Details
Motivation: 现有基于Transformer的方法计算开销大,而轻量级方法难以在高分辨率航拍场景中捕捉细粒度细节,且无人机影像存在尺度变化剧烈、边界复杂、边缘设备资源受限等挑战。 Method: 提出PBSeg框架,包含原型基交叉注意力(PBCA)以利用特征冗余降低计算复杂度,以及融合可变形卷积(DConv)与上下文感知调制(CAM)的高效多尺度特征提取模块。 Result: 在UAVid和UDD6两个无人机数据集上分别达到71.86%和80.92%的mIoU,性能具有竞争力且计算高效。 Conclusion: PBSeg在精度与效率之间取得良好平衡,适用于资源受限的无人机影像实时语义分割任务。 Abstract: Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86\% mIoU on UAVid and 80.92\% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.[82] Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization
Zhanqiang Guo,Jianjiang Feng,Jie Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于潜在血管相似性和生成-分割网络协同优化的跨域视网膜血管分割新框架,显著提升了在模态差异大的临床场景下的分割性能。
Details
Motivation: 现有基于CNN的方法在训练与测试数据存在域偏移时性能显著下降,亟需提升跨域泛化能力。 Method: 提出一种域迁移框架:1)分别预训练源域和目标域的生成网络;2)利用源域条件扩散模型进行确定性反演,构建域无关的血管图像潜在原型以合成目标域图像;3)通过循环参数更新实现分割网络与生成网络的迭代协同优化。 Result: 在跨域视网膜血管分割任务上达到当前最优性能,尤其在模态差异显著的临床场景中表现突出。 Conclusion: 所提框架通过联合建模生成与分割任务,并利用潜在空间的血管相似性,有效缓解了域偏移问题,为医学图像跨域分析提供了新思路。 Abstract: Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.[83] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
Yanzhe Liang,Ruijie Zhu,Hanzhi Chang,Zhuoyuan Li,Jiahao Lu,Tianzhu Zhang
Main category: cs.CV
TL;DR: ReFlow提出了一种无需外部运动引导的单目动态场景重建框架,通过自校正光流匹配机制和分离式建模,实现更鲁棒准确的4D重建。
Details
Motivation: 现有单目动态场景重建方法常因动态区域初始化不完整而导致重建与运动估计不稳定,依赖外部密集运动引导(如预计算光流)会增加复杂性和误差传播风险。 Method: ReFlow包含完整规范空间构建模块(增强静态与动态区域初始化)、基于分离的动态场景建模模块(解耦静态与动态成分以实现针对性运动监督),以及核心的自校正光流匹配机制(含全光流匹配与相机光流匹配)。 Result: 在多种场景下实验表明,ReFlow显著提升了重建质量与鲁棒性,优于现有方法。 Conclusion: ReFlow建立了一种新颖的自校正范式,为单目4D重建提供了统一、稳定且无需外部运动先验的解决方案。 Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.[84] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
Jiahao Meng,Tan Yue,Qi Xu,Haochen Wang,Zhongwei Ren,Weisong Liu,Yuhao Wang,Renrui Zhang,Yunhai Tong,Haodong Duan
Main category: cs.CV
TL;DR: 本文提出VideoZeroBench,一个用于长视频问答的分层基准测试,强调对时空证据的严格验证;实验表明当前模型在需同时正确回答与精准时空定位的任务(Level-5)下准确率低于1%,揭示了视频多模态模型在证据驱动推理上的严重不足。
Details
Motivation: 现有视频多模态大模型评估存在两大缺陷:一是分数虚高掩盖细粒度视觉理解与推理能力不足;二是仅判断答案正确性,未检验模型是否定位到支撑答案的确切时空证据。 Method: 构建包含500个手工标注问题、对应时间区间和空间边界框作为证据的VideoZeroBench基准;设计五级评估协议,逐级增强对时空定位的要求,以解耦答案生成、时间定位和空间定位能力。 Result: Gemini-3-Pro在标准端到端问答(Level-3)下正确率不足17%;在最严苛的Level-5(需答案正确且时空定位精准)下,所有模型准确率均低于1%,多数模型零正确预测;揭示了表面答案正确性与真实证据推理之间存在巨大鸿沟。 Conclusion: grounded video understanding(基于证据的视频理解)仍是长视频问答的关键瓶颈;VideoZeroBench为未来 grounded video reasoning 研究提供了新基准与分析视角。 Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.[85] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning
Longfei Huang,Yang Yang
Main category: cs.CV
TL;DR: 本文提出了一种梯度对齐交替学习(GAAL)范式,通过交替单模态学习与共享分类器,并设计基于不确定性的跨模态梯度手术,缓解多模态融合中的梯度冲突问题,显著提升表格-图像融合性能。
Details
Motivation: 现有表格-图像多模态融合方法受模态间梯度冲突限制,误导单模态学习器的优化。 Method: 提出梯度对齐交替学习(GAAL)范式:采用交替单模态学习与共享分类器解耦多模态梯度,并设计基于不确定性的跨模态梯度手术实现选择性梯度对齐。 Result: 在多个主流数据集上,GAAL在表格-图像融合及测试时表格缺失场景下均显著优于各类SoTA基线方法。 Conclusion: GAAL能有效提供单模态辅助,提升整体融合性能,为多模态梯度协同优化提供了新思路。 Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.[86] Satellite-Free Training for Drone-View Geo-Localization
Tao Liu,Yingzhi Zhang,Kan Ren,Xiaoqi Zhao
Main category: cs.CV
TL;DR: 本文提出了一种无需卫星图像训练的无人机视角地理定位(DVGL)框架(SFT),通过多视角无人机图像重建3D场景、生成几何归一化的伪正射影像,并利用纯无人机数据学习特征聚合模型,实现高精度跨视角检索。
Details
Motivation: 现有DVGL方法依赖卫星图像进行训练(配对监督或无监督对齐),在卫星数据不可用或受限的实际场景中难以部署;且多数方法仅使用单张倾斜无人机图像,忽略了多视角信息。 Method: 提出卫星免训练(SFT)框架:1)基于多视角无人机图像,用3D Gaussian Splatting重建稠密3D场景;2)通过PCA引导的正交投影生成伪正射影像(无需相机参数);3)轻量几何引导修复提升纹理完整性;4)从生成的伪正射影像中提取DINOv3 patch特征,仅用无人机数据训练Fisher向量聚合模型,并复用于编码卫星图进行跨视角检索。 Result: 在University-1652和SUES-200数据集上,SFT显著优于其他卫星免训练基线方法,并大幅缩小与使用卫星图像训练的方法之间的性能差距。 Conclusion: 纯无人机数据可有效支撑高质量跨视角地理定位,SFT框架为GPS拒止环境下无卫星依赖的实用化DVGL提供了新范式。 Abstract: Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.[87] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Maja Noack,Qinqian Lei,Taipeng Tian,Bihan Dong,Robby T. Tan,Yixin Chen,John Young,Saijun Zhang,Bo Wang
Main category: cs.CV
TL;DR: 本文提出SHOE,一种语义感知的开放词汇HOI检测评估框架,通过LLM计算动词和物体的语义相似性来替代传统精确匹配指标,更符合人类对交互的理解。
Details
Motivation: 现有HOI评估指标(如mAP)仅依赖精确字符串匹配,无法衡量语义相近但用词不同的预测(如“lean on couch” vs. “sit on couch”),难以适用于开放词汇场景。 Method: SHOE将HOI预测分解为动词和物体两部分,利用多个大语言模型(LLMs)的平均嵌入计算其与真值的语义相似度,并融合为最终相似分;支持在HICO-DET等标准基准上评估各类HOI方法。 Result: SHOE评分与人类评判一致性达85.73%,显著优于现有LLM或嵌入式基线指标,验证了其语义合理性与实用性。 Conclusion: SHOE为开放词汇HOI检测提供了更自然、可扩展且语义 grounded 的评估范式,推动HOI系统向真实世界泛化与多模态推理发展;代码与指标将开源。 Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.[88] Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation
Wenjie Zhao,Jia Li,Xin Dong,Yapeng Tian,Yu Xiang,Yunhui Guo
Main category: cs.CV
TL;DR: 本文提出ROSETTA方法,通过引入角度损失和特征范数损失来解决开放集测试时适应(OSTTA)中熵最小化与最大化之间的固有冲突,从而在保持ID样本分类性能的同时提升csOOD样本的检测能力。
Details
Motivation: 现有方法在OSTTA中采用熵最小化和熵最大化分别应对csID分类与csOOD检测,但二者存在内在冲突,导致性能权衡;需一种新策略协同优化两者。 Method: 提出ROSETTA框架:引入角损失调节特征范数大小,结合特征范数损失抑制csOOD样本的logits输出,避免依赖熵最大化。 Result: 在CIFAR-10-C、CIFAR-100-C、Tiny-ImageNet-C和ImageNet-C上实现强OOD检测与高ID分类精度;Cityscapes语义分割与HAC数据集验证其跨任务泛化性。 Conclusion: ROSETTA有效缓解熵目标冲突,在多种基准和实际场景中兼顾csID分类鲁棒性与csOOD检测准确性,为OSTTA提供更稳健的解决方案。 Abstract: Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.[89] Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition
Tianyi Shang,Zhenyu Li
Main category: cs.CV
TL;DR: 本文提出SympLoc框架,通过粗到细的多级对齐策略(实例级、关系级和全局级)提升文本到点云定位性能,在KITTI360Pose数据集上Top-1召回率@10m提升19%。
Details
Motivation: 现有方法依赖全局描述符进行相似性检索,导致严重信息丢失且难以捕捉判别性场景结构。 Method: 提出SympLoc框架:粗阶段包含三个互补对齐层级——1)实例级对齐:利用双曲空间中的黎曼自注意力建立点云对象与文本提示的直接对应;2)关系级对齐:使用信息辛关系编码器(ISRE),结合Fisher-Rao度量与哈密顿动力学建模对象间空间关系;3)全局级对齐:通过谱流形变换(SMT)提取图谱结构不变特征生成判别性全局描述符。 Result: 在KITTI360Pose数据集上,SympLoc相比现有最先进方法Top-1召回率@10m提升19%。 Conclusion: SympLoc通过多层次几何一致对齐显著提升了文本到点云定位的鲁棒性和精度,为自然语言驱动的空间理解提供了新范式。 Abstract: Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.[90] Towards Minimal Focal Stack in Shape from Focus
Khurram Ashfaq,Muhammad Tariq Mahmood
Main category: cs.CV
TL;DR: 本文提出了一种基于物理的双图焦点堆栈增强方法,结合全焦图像(AiF)和能量差异图(EOD),并设计了多尺度ConvGRU深度网络,使形状从焦点(SFF)方法仅用两张图像即可实现高精度深度估计。
Details
Motivation: 现有SFF方法依赖大量密集采样的焦点堆栈,实用性受限,亟需减少图像数量同时保持精度。 Method: 提出物理驱动的焦点堆栈增强:生成全焦图像(AiF)和能量差异图(EOD)作为辅助线索;构建端到端深度网络,利用ConvGRU在多尺度上迭代优化深度图。 Result: 在合成与真实数据集上验证,所提增强策略显著提升多种SFF模型性能,仅用两张图像即达到与传统大堆栈方法相当的精度,维持SOTA水平。 Conclusion: 双图焦点堆栈增强是可行且高效的,为轻量化、实用化SFF提供了新范式。 Abstract: Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.[91] F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling
Morui Zhu,Mohammad Dehghani Tezerjani,Mátyás Szántó,Márton Vaitkus,Song Fu,Qing Yang
Main category: cs.CV
TL;DR: F3DGS是一种面向多智能体分布式3D重建的联邦式3D高斯泼溅框架,通过共享几何骨架初始化与可见性感知的属性聚合,在不传输原始数据的前提下实现高质量协同重建。
Details
Motivation: 现有3D高斯泼溅(3DGS)方法依赖集中式数据,难以适用于分布式机器人系统;多智能体场景下存在通信开销大、几何不一致和部分可观测等挑战。 Method: 首先基于多客户端LiDAR点云配准构建共享几何骨架以初始化全局3DGS模型;在联邦优化中固定高斯位置以保持几何对齐,仅本地更新协方差、不透明度和球谐系数;服务器采用可见性感知聚合(按各客户端对每个高斯的观测频率加权)融合更新。 Result: 在自建多序列室内外LiDAR-RGB-IMU同步数据集上验证,F3DGS重建质量媲美集中式训练,同时支持真正去中心化的多智能体联合优化。 Conclusion: F3DGS有效解决了分布式多智能体3D重建中的几何一致性、通信效率与部分可观测性问题,为隐私敏感与资源受限场景提供了可行的联邦式重建范式。 Abstract: We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.[92] NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy
Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Hyunsu Go,Eunseob Choi,Seongbin Park,Junsu Lim,Jiwon Yang,Sumin Lee,Insung Hwang,Ken Ying-Kai Liao,Nam-Joon Kim
Main category: cs.CV
TL;DR: 本文提出NEMESIS,一种面向3D CT影像的内存高效、解剖感知的掩码自编码器框架,通过局部超块处理、噪声增强重建、双维度掩码Transformer模块和跨尺度token设计,在多器官分类任务中实现高准确率与强标签效率。
Details
Motivation: 3D CT影像标注成本高,需自监督学习;但全体积Transformer内存开销大,且传统掩码策略难以建模CT数据的各向异性空间结构。 Method: 提出NEMESIS框架:1)在128x128x128局部超块上操作以降低内存;2)噪声增强重建作为预训练任务;3)Masked Anatomical Transformer Blocks(MATB)实现平面与轴向并行掩码;4)NEMESIS Tokens(NT)聚合跨尺度上下文。 Result: 在BTCV基准上,冻结主干+线性分类器达mean AUROC 0.9633,优于SuPreM(0.9493)和VoCo(0.9387);仅用10%标签时AUROC仍达0.9075;单次前向计算量降至31.0 GFLOPs(对比全体积基线985.8 GFLOPs)。 Conclusion: NEMESIS为3D医学影像提供了可扩展、鲁棒且标签高效的自监督学习新范式。 Abstract: Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.[93] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Jiawei Chen,Simin Huang,Jiawei Du,Shuaihang Chen,Yu Tian,Mingjie Wei,Chao Yu,Zhaoxia Yin
Main category: cs.CV
TL;DR: 本文提出Tex3D框架,首次实现面向视觉-语言-动作(VLA)模型的端到端3D对抗纹理优化,通过Foreground-Background Decoupling(FBD)和Trajectory-Aware Adversarial Optimization(TAAO)解决3D仿真中不可微难题,实验证明其可导致高达96.7%的任务失败率,揭示VLA系统在物理世界中的严重鲁棒性缺陷。
Details
Motivation: 现有VLA模型对物理可实现的对抗攻击(如3D纹理攻击)的鲁棒性研究不足;2D视觉或语言扰动攻击缺乏物理真实性,而3D对抗纹理更贴近真实部署场景且更具破坏性,但受限于标准3D仿真器不可微,难以端到端优化。 Method: 提出Foreground-Background Decoupling(FBD)实现双渲染器对齐下的可微纹理优化;设计Trajectory-Aware Adversarial Optimization(TAAO),聚焦行为关键帧并采用顶点参数化稳定优化;构建Tex3D框架,在VLA仿真环境中端到端优化3D对抗纹理。 Result: Tex3D在仿真与真机实验中显著降低多种操作任务成功率,任务失败率最高达96.7%;验证了3D对抗纹理对VLA系统的强有效性与物理可行性。 Conclusion: VLA系统对物理 grounded 的3D对抗纹理高度脆弱,亟需引入鲁棒性感知训练机制;Tex3D为评估和提升具身智能模型物理鲁棒性提供了新范式和实用工具。 Abstract: Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.[94] Automatic Image-Level Morphological Trait Annotation for Organismal Images
Vardaan Pahuja,Samuel Stevens,Alyson East,Sydne Record,Yu Su
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏自编码器和基础模型特征的自动化形态性状标注方法,构建了包含8万条标注的Bioscan-Traits数据集,显著提升了大规模生态研究中形态性状提取的可扩展性与生物学合理性。
Details
Motivation: 形态性状提取目前依赖专家、效率低,且缺乏高质量图像-性状配对数据集,制约了大规模生态研究。 Method: 利用稀疏自编码器处理基础模型特征,提取单义、空间定位的神经元;结合显著区域定位与视觉-语言提示生成可解释的性状描述;构建Bioscan-Traits数据集并开展系统消融实验。 Result: 构建了含80K条性状标注、覆盖19K张昆虫图像的Bioscan-Traits数据集;人工评估证实生成描述具有生物学合理性;消融实验验证了各模块设计的有效性。 Conclusion: 该模块化自动标注流程可替代高成本人工标注,为大尺度形态分析提供可扩展方案,并弥合生态学意义与机器学习实用性的鸿沟。 Abstract: Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.[95] LivingWorld: Interactive 4D World Generation with Environmental Dynamics
Hyeongju Mun,In-Hwan Jin,Sohyeong Kim,Kyeongbo Kong
Main category: cs.CV
TL;DR: LivingWorld是一个从单张图像生成具有环境动态(如云、水、烟)的4D交互式世界框架,通过几何感知对齐模块和哈希运动场实现全局一致、低延迟的动态建模。
Details
Motivation: 现有3D场景生成方法主要关注静态几何重建,缺乏对场景级环境动态(如云、水、烟)的建模,且难以在场景扩展中保持运动一致性与实时交互性。 Method: 提出渐进式构建全局一致运动场的方法;引入几何感知对齐模块解决多视角下的方向与尺度歧义;采用紧凑的哈希运动场表示以支持高效查询与稳定动态传播,并支持渲染时双向运动传播。 Result: 在单块RTX 5090 GPU上,每次场景扩展耗时9秒,运动对齐与更新耗时3秒,可实现实时、全局一致的4D世界生成;无需依赖计算昂贵的视频后处理即可生成长时序、时间连贯的4D序列。 Conclusion: LivingWorld首次实现了从单图出发、交互式生成具全局动态一致性的4D环境,为沉浸式内容创作提供了新范式。 Abstract: We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.[96] TOL: Textual Localization with OpenStreetMap
Youqi Liao,Shuhao Kang,Jingyu Xu,Olaf Wysocki,Yan Xia,Jianping Li,Zhen Dong,Bisheng Yang,Xieyuanli Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于自然语言描述到OpenStreetMap(OSM)的地图定位新任务(T2O),构建了首个大规模多城市基准TOL,并设计了粗-细两级定位框架TOLoc,在多个定位精度指标上显著优于现有方法。
Details
Motivation: 现有定位方法多依赖点云或高分辨率影像,而OSM语义丰富、轻量免费,但文本到OSM的定位(T2O)尚未被探索;需支持无几何观测与GNSS初值的纯文本驱动全局定位。 Method: 提出TOLoc框架:粗阶段提取方向感知的文本与OSM全局描述符以检索候选位置;细阶段通过专用对齐模块融合文本描述符与局部地图特征,回归2自由度位姿。 Result: TOLoc在5m/10m/25m阈值下分别比最优基线提升6.53%、9.93%、8.31%,且在未见环境中泛化性强。 Conclusion: T2O定位是可行且有效的范式;TOL基准和TOLoc框架为文本驱动地理定位提供了坚实基础与新方向。 Abstract: Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.[97] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
Junyoung Jung,Seokwon Kim,Jun Uk Kim
Main category: cs.CV
TL;DR: 本文提出了一种面向稀疏标注单目3D目标检测的新框架,包含道路感知补丁增强(RAPA)和基于原型的过滤(PBF)两个核心模块,以提升在标注稀缺条件下的检测性能。
Details
Motivation: 单目3D目标检测在密集标注数据集上表现优异,但现实中3D标注成本高昂,常面临仅部分目标被标注的稀疏标注场景,现有方法在此类设置下性能显著下降。 Method: 提出Road-Aware Patch Augmentation(RAPA):将标注对象的分割补丁按几何一致性增强到道路区域;提出Prototype-Based Filtering(PBF):利用2D RoI特征原型与深度不确定性联合筛选高质量伪标签;整体采用几何保持增强与原型引导伪标签相结合的训练策略。 Result: 在稀疏标注设定下,该方法显著提升了单目3D检测性能,大量实验验证了其有效性。 Conclusion: RAPA与PBF协同作用,有效缓解了稀疏标注带来的监督不足问题,为实际应用中低成本部署单目3D检测提供了可行方案。 Abstract: Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .[98] Moiré Video Authentication: A Physical Signature Against AI Video Generation
Yuan Qing,Kunyu Zheng,Lingxiao Li,Boqing Gong,Chang Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于莫尔效应的物理认证签名,利用真实相机拍摄时自然产生的干涉条纹特性,而生成式AI模型无法准确复现这一现象,从而实现对AI生成视频的有效鉴别。
Details
Motivation: 随着视频生成技术的进步,AI合成内容与真实视频越来越难以区分,亟需一种可靠、物理可验证的鉴别方法。 Method: 利用相机拍摄紧凑双层光栅结构时产生的莫尔效应,推导出莫尔运动不变量(即条纹相位与光栅图像位移之间的线性耦合关系),并通过视频中提取这两个信号并检验其相关性来进行验证。 Result: 在多种先进AI视频生成模型生成的视频与真实拍摄视频上验证表明,二者在该不变量的相关性上存在显著差异,证明该方法具有良好的鉴别能力。 Conclusion: 确定性的光学现象(如莫尔效应)可作为物理基础坚实、可验证的防伪签名,为AI生成视频检测提供了新思路。 Abstract: Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.[99] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Wonjoon Jin,Jiyun Won,Janghyeok Han,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho
Main category: cs.CV
TL;DR: DynaVid 是一种新型视频合成框架,通过在训练中引入合成的光学流运动数据,解耦运动与外观建模,提升动态视频生成的真实感与运动可控性。
Details
Motivation: 现有视频扩散模型难以生成高度动态或需精细运动控制的逼真视频,主因是真实训练数据中此类高动态样本稀缺。 Method: 提出两阶段框架:先用合成光学流(经图形管线渲染)训练运动生成器;再以该运动为条件,驱动视频生成器合成真实外观的视频帧;光学流仅编码运动、与外观解耦,避免合成外观失真。 Result: 在剧烈人体运动生成和极端相机运动控制两大挑战场景中,DynaVid 显著提升了生成视频的动态真实性与运动可控性。 Conclusion: 利用合成光学流作为可控、多样且外观无关的运动监督信号,可有效弥补真实数据缺陷,实现运动建模与视觉保真度的协同优化。 Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.[100] Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion
Juncen Guo,Xiaoguang Zhu,Jingyi Wu,Jingyu Zhang,Jingnan Cai,Zhenghao Niu,Liang Song
Main category: cs.CV
TL;DR: 本文提出了一种无需领域标识和历史样本的增量学习框架,通过解耦表征和权重融合策略,提升具身多媒体系统在动态环境中的持续适应能力与泛化性能。
Details
Motivation: 现有领域增量感知方法依赖测试阶段预知的领域ID,且易过拟合场景特有感知噪声,导致泛化差和灾难性遗忘,难以适用于开放物理空间中未知交互场景。 Method: 提出无领域ID、无样例的增量学习框架:1)设计解耦表征机制,消除非本质环境风格干扰,聚焦跨场景共享的语义本征特征;2)采用权重融合策略,在参数空间动态整合新旧环境知识,无需存储历史数据。 Result: 在多个标准基准数据集上实验表明,该方法在完全无样例、无领域ID设定下显著缓解灾难性遗忘,准确率优于现有SOTA方法。 Conclusion: 所提框架能有效实现具身感知系统在动态开放环境下的鲁棒持续适应,兼顾泛化能力与旧知识保留。 Abstract: Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.[101] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
Main category: cs.CV
TL;DR: 本文提出频率域自适应(FDA)和调和约束最优传输(HOT)方法,以提升远程光电容积描记法(rPPG)模型在跨域场景下的鲁棒性与泛化能力。
Details
Motivation: 现有深度学习rPPG方法易过拟合于光照、相机特性等外观相关因素,导致跨域性能显著下降。 Method: 提出频率域自适应(FDA)建模外观变化,并设计调和约束最优传输(HOT)利用心率信号的谐波特性对齐表征。 Result: 在多个数据集上的跨域实验表明,FDA+HOT框架显著提升了rPPG模型的鲁棒性和泛化能力。 Conclusion: FDA与HOT联合策略能有效解耦外观变化与生理信号,增强rPPG模型的域适应能力。 Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.[102] GPA: Learning GUI Process Automation from Demonstrations
Zirui Zhao,Jun Hao Liew,Yan Yang,Wenzhuo Yang,Ziyang Luo,Doyen Sahoo,Silvio Savarese,Junnan Li
Main category: cs.CV
TL;DR: 本文提出GUI Process Automation (GPA),一种轻量级、基于视觉的机器人流程自动化方法,通过顺序蒙特卡洛定位、就绪校准和本地化执行,实现鲁棒、确定性高且隐私安全的GUI任务自动化,并在实验中显著优于Gemini 3 Pro。
Details
Motivation: 解决传统RPA的脆弱性和当前视觉语言模型GUI代理的不确定性风险,满足企业工作流对适应性、鲁棒性和安全性的需求。 Method: 引入基于顺序蒙特卡洛的定位以增强鲁棒性,采用就绪校准保障确定性与可靠性,并通过快速全本地执行确保隐私;同时支持作为MCP/CLI工具供其他具备编码能力的智能体调用。 Result: 在长周期GUI任务中,GPA相比Gemini 3 Pro(搭配CUA工具)成功率更高、执行速度快10倍。 Conclusion: GPA是一种高效、稳定、安全的GUI自动化方案,兼具企业级实用性和多智能体协同潜力。 Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.[103] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
Yuheng Jiang,Yiwen Cai,Zihao Wang,Yize Wu,Sicheng Li,Zhuo Su,Shaohui Jiao,Lan Xu
Main category: cs.CV
TL;DR: 本文提出Director,一种统一的时空高斯表示方法,联合建模人类动作、高质量渲染和实例级语义,通过语言对齐的语义监督与光流引导的运动优化,实现动态场景中稳定、可解释的4D重建。
Details
Motivation: 现有基于高斯的体视频方法侧重外观重建,缺乏实例级结构建模,难以支持动态场景下的稳定跟踪与语义推理。 Method: 提出时空统一的高斯表示;利用多模态大语言模型生成的句子嵌入和时序对齐的实例掩码,通过两个MLP解码器监督高斯语义特征;融合2D光流优化高斯运动以提升时间稳定性;引入几何感知的SDF约束及表面连续性正则化。 Result: 在保持高保真渲染的同时,实现了时间一致的4D重建,并支持实例分割与开放词汇查询。 Conclusion: 嵌入实例一致的语义能自然增强4D建模能力,Director为动态场景理解提供了兼具几何精度、时间稳定性与语义可解释性的新范式。 Abstract: Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.[104] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
Main category: cs.CV
TL;DR: 本文提出BTS-rPPG框架,利用正交蝶形时序移位(BTS)和正交特征传输(OFT)机制,增强远距离时序建模能力,提升无接触rPPG信号估计性能。
Details
Motivation: 现有深度学习方法在rPPG中多依赖局部时序操作(如时序移位或卷积),导致时序感受野有限、难以建模长程生理动态。 Method: 提出基于FFT蝶形通信模式的正交蝶形时序移位(BTS),通过XOR配对实现结构化帧间交互;引入正交特征传输(OFT)机制,在时序移位前滤除冗余特征,仅传递与目标上下文正交的特征分量。 Result: 在多个基准数据集上实验表明,BTS-rPPG显著提升长程时序建模能力,并持续优于现有rPPG时序建模方法。 Conclusion: BTS-rPPG通过结构化长程交互与正交特征精炼,有效克服了传统方法局部建模局限,为rPPG提供了更鲁棒、更具泛化性的时序建模新范式。 Abstract: Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.[105] From Understanding to Erasing: Towards Complete and Stable Video Object Removal
Dingming Liu,Wenjing Wang,Chen Li,Jing Lyu
Main category: cs.CV
TL;DR: 本文提出了一种结合外部知识蒸馏与内部帧级上下文交叉注意力机制的视频目标移除方法,提升了对目标物体及其物理效应(如阴影、反射)的理解,实现了更连贯、清晰的视频修复效果,并构建了首个真实世界视频目标移除基准。
Details
Motivation: 现有扩散模型在视频目标移除中难以消除目标引发的副作用(如阴影、反射、光照变化),根源在于对目标物体及其与场景物理和语义交互理解不足。 Method: 从外部引入基于视觉基础模型的知识蒸馏方案,传递目标物体与其诱导效应的关系;从内部设计帧级上下文交叉注意力机制,在每个去噪块中利用未遮罩区域的上下文信息进行建模。 Result: 在多个指标上达到SOTA性能,并构建了首个真实世界视频目标移除基准数据集。 Conclusion: 通过内外协同引导,模型能更好理解目标物体、其诱导效应及全局背景,显著提升视频目标移除的合理性与时空一致性。 Abstract: Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.[106] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于时间循环一致性的双向视频帧插值框架,通过可学习的方向性token和课程学习策略,在不增加推理开销的前提下显著提升了长序列插值的运动一致性与质量。
Details
Motivation: 现有生成式视频帧插值方法多为单向,缺乏自验证时间一致性的机制,易导致运动漂移、方向模糊和边界错位,尤其在长序列中问题突出。 Method: 提出双向循环一致框架:引入可学习的方向token显式建模时间方向,共享骨干网络联合优化前向合成与后向重建;采用课程学习从短序列逐步过渡到长序列训练;循环约束仅用于训练,推理仍为单次前向传播。 Result: 在37帧和73帧任务上均达到图像质量、运动平滑性和动态控制的SOTA性能,优于强基线,且无额外计算开销。 Conclusion: 时间循环一致性是一种有效正则化机制,能提升生成运动路径的逻辑可逆性与时空一致性,所提双向框架兼顾性能与效率。 Abstract: Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.[107] Bias mitigation in graph diffusion models
Meng Yu,Kun Zhan
Main category: cs.CV
TL;DR: 本文提出了一种综合方法,通过Langevin采样对齐前向最大扰动分布以缓解反向起始偏差,并引入基于新定义得分差的得分校正机制来解决暴露偏差,无需修改网络结构,在多个模型、数据集和任务上实现了SOTA效果。
Details
Motivation: 现有图扩散模型存在显著偏差问题,包括前向扩散最大扰动分布偏离标准高斯分布导致的反向起始偏差,以及扩散模型固有的暴露偏差,共同导致生成质量下降。 Method: 1)设计新的Langevin采样算法,使反向采样起始点与前向最大扰动分布对齐,缓解反向起始偏差;2)基于新定义的得分差引入得分校正机制,缓解暴露偏差;整个方法无需修改网络结构。 Result: 在多个图扩散模型、数据集和任务上验证了该方法的有效性,取得了当前最优(state-of-the-art)的生成性能。 Conclusion: 所提出的无网络修改的综合偏差校正方法,能有效缓解图扩散模型中的反向起始偏差与暴露偏差,显著提升生成质量。 Abstract: Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp[108] End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement
Chihiro Nakatani,Norimichi Ukita,Jean-Marc Odobez
Main category: cs.CV
TL;DR: 本文提出了一种通过群体检测实现端到端共享注意力(SA)估计的新方法,联合优化群体检测与SA热图生成,显著提升了性能。
Details
Motivation: 以往方法未显式检测关注同一目标的群体,或假设图像中仅存在单一共享注意力点,限制了实际应用与性能。 Method: 采用两阶段流程:(i) 基于个体注视热图和群体隶属度标量生成SA热图;(ii) 利用初始SA热图反向优化群体隶属度,并输出最终SA热图。 Result: 在群体检测与共享注意力估计任务上均优于现有方法,并通过消融实验验证各模块有效性。 Conclusion: 联合建模群体结构与共享注意力可提升二者估计精度,为真实场景中的协同感知提供更实用的解决方案。 Abstract: This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.[109] SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
Thinh Dao,Zhen Wang,Kien T. Pham,Long Chen
Main category: cs.CV
TL;DR: 本文提出SteerFlow,一种模型无关的文本引导图像编辑框架,通过引入摊销不动点求解器、轨迹插值和自适应掩码机制,在保持源图像高保真度的同时提升编辑质量。
Details
Motivation: 现有基于流的生成模型在文本引导图像编辑中难以兼顾源图像保真度与编辑灵活性:高阶求解器增加计算开销,截断反演限制编辑能力,特征注入方法缺乏架构可迁移性。 Method: 提出SteerFlow框架,包含三部分:(1)前向过程使用摊销不动点求解器,通过强制连续时间步间速度一致性来隐式拉直前向轨迹,获得高保真潜变量;(2)反向过程采用轨迹插值,自适应融合目标编辑与源重建速度以锚定编辑轨迹;(3)引入自适应掩码机制,结合概念引导分割与源-目标速度差进行空间编辑约束。 Result: 在FLUX.1-dev和Stable Diffusion 3.5 Medium上实验表明,SteerFlow在编辑质量上持续优于现有方法,并天然支持多轮编辑且无漂移累积。 Conclusion: SteerFlow是一种具备强理论保真保证、模型无关、可扩展至多轮编辑的通用图像编辑框架,有效解决了源保真度与编辑可控性之间的权衡难题。 Abstract: Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.[110] Setup-Independent Full Projector Compensation
Haibo Li,Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang
Main category: cs.CV
TL;DR: 本文提出SIComp,首个无需微调或重训练即可泛化至新投影设置的全投影补偿框架,通过构建大规模真实数据集和解耦几何与光度补偿的协同自适应设计实现。
Details
Motivation: 现有投影补偿方法高度依赖特定设置,缺乏大规模多样训练数据且几何校正模型难以泛化到新几何配置。 Method: 提出SIComp框架,包含在线光学流几何校正模块和新型光度补偿网络,并引入强度变化表面先验以增强光照鲁棒性;构建含277种不同投影-相机设置的大规模真实世界数据集。 Result: 在多种未见设置下均获得高质量补偿效果,显著优于现有方法,首次实现了投影补偿的通用化。 Conclusion: SIComp是首个真正实现设置无关、零样本泛化的全投影补偿框架,为该领域建立了新的基准。 Abstract: Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/[111] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
Hongru Chen,Jiyang Huang,Jia Wan,Antoni B. Chan
Main category: cs.CV
TL;DR: 本文提出Dense Point-to-Mask Optimization(DPMO)和Reinforced Point Selection(RPS)框架,利用点标注生成密集人群的实例分割掩码,并提升计数精度。
Details
Motivation: 现有稠密人群数据集多为点标注,区域标注(如框)稀少且不准;而主流大模型(如SAM)在稠密场景下表现不佳,需更适配的实例分割方法。 Method: 1) 提出DPMO方法,结合SAM与最近邻互斥圆(NNEC)约束,从点标注生成密集实例掩码;2) 构建RPS框架,采用分组相对策略优化(GRPO)强化学习选择最优预测点;3) 设计基于掩码监督的新损失函数。 Result: 在ShanghaiTech、UCF-QNRF、JHU-CROWD++和NWPU-Crowd四个基准上达到SOTA实例分割性能,并验证掩码标注显著提升各类模型计数精度。 Conclusion: 高质量掩码标注可有效桥接点标注与密度图/区域标注,所提DPMO与RPS方法为稠密人群实例分割提供了新范式,且掩码监督对下游计数任务具有普适增益。 Abstract: Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.[112] Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
Haoyuan Li,Wen Yang,Fang Xu,Hong Tan,Haijian Zhang,Shengyang Li,Gui-Song Xia
Main category: cs.CV
TL;DR: 本文提出了一种几何感知的无人机地理定位框架,通过重建局部3D场景并生成正射校正的鸟瞰图(BEV),统一实现跨视角粗粒度位置检索与细粒度6自由度姿态估计,显著提升GNSS拒止环境下无人机在卫星地图上的定位精度。
Details
Motivation: 解决无人机在无GNSS环境下,因倾斜视角图像与正交卫星地图之间严重几何差异导致的跨视角地理定位难题;现有方法将透视畸变视为外观噪声,缺乏对几何结构的显式建模。 Method: 提出基于视觉几何的Transformer(VGGT)重建多视角无人机图像序列的局部3D场景,并渲染几何一致的虚拟鸟瞰图(BEV)作为跨视角中介;引入卫星级注意力模块(Satellite-wise Attention Block)实现多候选区域的无干扰高效匹配;发布重校准的University-1652数据集。 Result: 在重校准的University-1652和SUES-200数据集上显著超越现有最优方法,实现鲁棒的米级定位精度,并在复杂城市环境中表现出更强泛化能力。 Conclusion: 显式建模3D几何结构并引入BEV中介表示,可有效弥合跨视角几何鸿沟,为GNSS拒止下的无人机精准定位提供新范式。 Abstract: Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.[113] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Jiayun Jin,Haolong Chai,Xueying Huang,Xiaoqing Guo,Zengwei Zheng,Zhan Zhou,Junmei Wang,Xinyu Wang,Jie Liu,Binbin Zhou
Main category: cs.CV
TL;DR: 本文提出US-365K超声图像-文本数据集与UDT知识体系,并设计Ultrasound-CLIP框架,通过语义软标签、语义损失及异构图模态建模,显著提升超声影像的跨模态理解性能。
Details
Motivation: 现有视觉语言预训练模型(如CLIP)难以直接适配超声影像,因其解剖结构异质性强、诊断属性多样,缺乏专用数据集与领域知识体系支撑。 Method: 构建大规模超声图像-文本数据集US-365K(365k样本,52类解剖部位);建立超声诊断分类体系UDT,含解剖层级分类与九维诊断属性框架(如回声、边界、血供等);提出Ultrasound-CLIP框架,引入语义软标签、语义损失,并基于诊断属性构建异构图模态以支持病变-属性关系推理。 Result: 在分类与检索任务上达到SOTA;在零样本、线性探测和微调任务中展现强泛化能力;所有实验均采用患者级数据划分,确保临床真实性。 Conclusion: 本工作为超声影像的视觉语言建模奠定数据与知识基础,所提方法有效提升模型对医学语义的理解与推理能力,推动超声AI向临床实用化迈进。 Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.[114] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Edoardo A. Dominici,Thomas Deixelberger,Konstantinos Vardis,Markus Steinberger
Main category: cs.CV
TL;DR: 本文提出了一种轻量级架构和训练策略,利用自监督学习(如DINO)提取的高维特征,解耦外观与其他需保留的特征,从而实现对视频域迁移和视频生成(如从3D生成视频)中风格化与重光照等外观变化的鲁棒控制。
Details
Motivation: 现有视频扩散模型多依赖感知、几何或简单语义信号进行条件控制,而自监督图像/点云特征虽富含场景信息,但因风格、光照、语义高度纠缠,限制了其在生成任务中的可控性;本文旨在探索将其作为通用条件信号用于预训练视频扩散模型。 Method: 提出轻量级网络架构与训练策略,显式解耦外观特征(如风格、光照)与其他结构/语义特征;并利用高维低分辨率特征补偿空间细节损失,提升从显式空间表征(如3D)生成视频时的可控性。 Result: 实现了视频域迁移与视频-3D生成任务中对风格化、重光照等外观属性的精细、鲁棒控制;验证了高维特征可弥补低空间分辨率带来的细节缺失,增强生成可控性。 Conclusion: 自监督视觉特征经适当解耦后,可作为强大且通用的条件信号驱动预训练视频扩散模型,在保持内容结构一致性的同时,支持灵活的外观编辑与生成,拓展了视频生成模型的可控性边界。 Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.[115] Cosine-Normalized Attention for Hyperspectral Image Classification
Muhammad Ahmad,Manuel Mazzara
Main category: cs.CV
TL;DR: 本文提出了一种基于余弦归一化的注意力机制,用于高光谱图像分类,强调角度关系并减少对幅度变化的敏感性,在极低监督条件下优于现有Transformer和Mamba模型。
Details
Motivation: 传统Transformer的点积注意力混合了特征的模长和方向,可能不适用于高光谱数据的几何特性。 Method: 引入余弦归一化注意力:将查询和键嵌入投影到单位超球面,并采用平方余弦相似度计算注意力分数,以突出角度关系、抑制幅度干扰;该方法嵌入空间-光谱Transformer架构中。 Result: 在三个基准数据集上,该方法在极低监督设置下持续超越多种最新Transformer及Mamba模型,且仅需轻量骨干网络;控制实验表明余弦评分提供了可靠的归纳偏置。 Conclusion: 从几何视角重构注意力评分函数(特别是余弦归一化)能更契合高光谱数据的内在结构,显著提升少样本条件下的分类性能。 Abstract: Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.[116] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning
Seyed Amir Kasaei,Arash Marioriyad,Mahbod Khaleti,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文提出RebusBench基准,用于评估大视觉语言模型在需要多步神经符号推理(如字谜求解)中的能力,发现当前SOTA模型表现极差(<10%准确率),揭示其缺乏连接感知与知识的推理机制。
Details
Motivation: 现有LVLM虽擅长显式视觉识别,但在需以视觉为线索、经多步推理(如提取属性、调用语言先验、抽象映射)才能得出答案的任务(如字谜)上存在显著认知鸿沟。 Method: 构建包含1164个字谜的RebusBench基准,系统评测Qwen、InternVL、LLaVA等SOTA模型在Exact Match和语义准确率上的表现,并分析模型缩放与上下文学习的影响。 Result: 所有被测模型在Exact Match上低于10%,语义准确率低于20%,且模型规模扩大或使用ICL均未带来显著提升。 Conclusion: 当前LVLM具备视觉与语言组件,但缺乏将二者融合所需的神经符号推理‘粘合剂’,亟需新架构或训练范式来弥合该认知缺口。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.[117] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Yang Zhou,Xiaofeng Wang,Hao Shao,Letian Wang,Guosheng Zhao,Jiangnan Shao,Jiagang Zhu,Tingdong Yu,Zheng Zhu,Guan Huang,Steven L. Waslander
Main category: cs.CV
TL;DR: DriveDreamer-Policy 是一种统一的驾驶世界-动作模型,通过整合深度生成、未来视频生成与运动规划,在几何感知的世界表征基础上提升预测连贯性与驾驶决策质量。
Details
Motivation: 现有世界-动作模型(WAM)多建模2D外观或潜在表示,缺乏对物理世界至关重要的几何接地能力,限制了具身系统在真实驾驶场景中的表现。 Method: 提出DriveDreamer-Policy:以大语言模型处理语言指令、多视角图像和动作输入,后接三个轻量级生成器分别输出深度图、未来视频帧和驾驶动作;显式学习几何感知的世界表征,并统一用于未来预测与动作规划。 Result: 在Navsim v1/v2上分别达到89.2 PDMS和88.7 EPDMS,优于现有基于世界模型的方法;生成的未来视频与深度图质量更高;消融实验证明显式深度学习可增强视频想象能力并提升规划鲁棒性。 Conclusion: 几何感知的统一架构能有效提升世界建模与动作规划的协同性,在保持模块化与可控延迟的同时,显著增强自动驾驶系统的想象与决策能力。 Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.[118] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Taimur Khan,Hannes Feilhauer,Muhammad Jazib Zafar
Main category: cs.CV
TL;DR: 本文提出FSKD框架,通过知识蒸馏将LiDAR数据中的森林结构信息(如CHM、PAI、FHD)迁移至仅使用RGBI影像的SegFormer模型,在无需配准、跨季节数据下实现高精度、多指标森林结构预测,显著优于现有方法。
Details
Motivation: 高分辨率森林结构数据对碳汇、生物多样性与生态系统监测至关重要,但机载LiDAR成本高、获取频率低;而现有RGB/多光谱方法难以准确反演多维结构指标(如CHM、PAI、FHD),亟需一种低成本、可扩展、多指标联合估计的替代方案。 Method: 提出FSKD(LiDAR-to-RGBI知识蒸馏)框架:以融合LiDAR衍生平面指标(如CHM)和垂直剖面特征的多模态交叉注意力教师网络为指导,训练仅输入RGBI影像的SegFormer学生网络,实现端到端多指标(CHM/PAI/FHD)联合预测;采用不对称蒸馏策略,并在德国萨克森州384 km²数据上训练,在8个地理异质区域测试。 Result: 学生模型在零样本CHM预测上达SOTA:MedAE=4.17 m,R²=0.51,IoU=0.87;MAE较HRCHM/DAC基线降低29–46%(5.81 m vs. 8.14–10.84 m),相关系数提升至0.713;多指标联合预测能力首次实现CHM+PAI+FHD同步输出;且在冬夏时相不匹配(冬季LiDAR+夏季RGBI)下仍稳健有效。 Conclusion: FSKD成功实现了从昂贵LiDAR到低成本RGBI影像的高质量森林结构知识迁移,突破了传统遥感反演的单指标、严配准、同季节限制,为数字孪生德国等国家级高分辨率生态监测项目提供了可扩展、操作性强的技术路径。 Abstract: Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.[119] GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
Mengtian Li,Fan Yang,Ruixue Xiong,Yiyan Fan,Zhifeng Xie,Zeyu Wang
Main category: cs.CV
TL;DR: 本文提出GardenDesigner框架,结合江南园林美学原则与程序化建模代理链,实现快速、美观、可交互的江南园林数字生成。
Details
Motivation: 江南园林作为重要文化遗产,在影视、游戏和数字旅游中潜力巨大,但传统手工建模依赖专家经验、耗时长,亟需自动化、智能化生成方法。 Method: 提出GardenDesigner框架,包含基于水系地形与路径探索规则的地形分布与道路生成代理、遵循美学与文化约束的资产选择与布局优化代理,并构建含专家标注知识的GardenVerse知识库;在Unity中开发支持文本输入的交互式编辑界面。 Result: 实验与人工评估表明,GardenDesigner可在一分钟内为非专家用户生成多样且符合审美的江南园林数字场景。 Conclusion: 该工作首次将江南园林美学系统编码为可计算规则并集成多智能体程序化建模,显著提升古典园林数字化效率与可及性,为文化遗产智能生成提供新范式。 Abstract: Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.[120] PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Leezy Han,Seunggyu Kim,Dongseok Shim,Hyeonbeom Lee
Main category: cs.CV
TL;DR: 本文提出了一种利用轮式里程计增强时序一致性的单目深度估计框架,通过光流三角化估计稀疏深度并递归更新尺度,再校准预训练深度模型的相对深度输出。
Details
Motivation: 现有单目深度估计方法在连续帧间难以保持时间一致性,导致深度抖动和突变场景下的估计失败。 Method: 利用轮式里程计信息,结合连续帧间的光流进行三角化,估计相机位姿和稀疏深度;用稀疏深度递归贝叶斯更新度量尺度,并用于重标定预训练深度模型输出的相对深度。 Result: 在KITTI、TartanAir、MS2及自建数据集上验证了该方法具有鲁棒且准确的深度估计性能。 Conclusion: 引入外部运动先验(轮式里程计)可有效提升单目深度估计的时间一致性与稳定性,尤其适用于移动机器人平台。 Abstract: Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.[121] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Francisco Mario Calisto,Wolfgang Birkfellner,Inna Servetnyk,Yinyin Yuan,Sepideh Hatamikia
Main category: cs.CV
TL;DR: 本文提出了一种结合NSGA-II优化与蒙特卡洛Dropout不确定性估计的深度学习框架,从H&E染色全切片图像中直接预测PAM50亚型,减少对昂贵分子检测的依赖。
Details
Motivation: 降低乳腺癌PAM50亚型分类对 costly 分子检测的依赖,利用常规H&E染色图像实现精准、可扩展的亚型预测。 Method: 采用ResNet18提取特征、自定义CNN头分类;联合优化补丁信息量、空间多样性、不确定性及补丁数量,使用NSGA-II与蒙特卡洛Dropout不确定性估计进行补丁选择。 Result: 在TCGA-BRCA内部数据集上F1=0.8812、AUC=0.9841;在CPTAC-BRCA外部验证集上F1=0.7952、AUC=0.9512。 Conclusion: 该优化驱动、不确定性感知的补丁选择方法性能高、计算高效,有望作为临床决策支持的可扩展影像学替代方案。 Abstract: Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.[122] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Yaxin Luo,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的随机标签桥接训练方法,使大语言模型(LLM)参数能有效适配视觉任务,并发现部分LLM层具备强基础性,无需微调即可直接用于视觉任务,为跨模态迁移提供了新路径。
Details
Motivation: 现有研究普遍认为语言预训练模型因参数空间与视觉模型差异大而不适合视觉下游任务,本文挑战该假设,探索语言模型向视觉任务迁移的可行性。 Method: 提出随机标签桥接训练(random label bridge training)作为模态适配学习器,在语言模型和视觉任务间建立参数对齐;同时探索部分桥接训练策略,识别并保留LLM中天然适用于视觉任务的层。 Result: 验证了LLM可通过桥接训练有效适配视觉基础任务;发现部分LLM层具备强基础性质,不需微调即对视觉任务有益;随机标签桥接训练在无标注条件下仍具有效性。 Conclusion: 语言与视觉模态间的参数差异并非不可逾越障碍,通过桥接训练尤其是部分桥接训练,可高效实现跨模态迁移,拓展了大语言模型在多模态领域的应用潜力。 Abstract: The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.[123] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
Emad Bahrami,Olga Zatsarynna,Parth Pathak,Sunando Sengupta,Juergen Gall,Mohsen Fayyaz
Main category: cs.CV
TL;DR: STRIVE是一种面向视频问答的时空强化学习框架,通过构建输入视频的多种时空变体并联合归一化文本生成与视觉变体,提升奖励信号质量与策略更新稳定性;引入重要性感知采样机制,在保持时间覆盖的同时聚焦问题相关帧,从而增强多视角鲁棒推理能力。
Details
Motivation: 现有基于组的策略优化方法在大模态模型中常因响应正确性相近而导致奖励方差低,优势估计弱或不稳定。 Method: 提出STRIVE框架:1)构造视频的多个时空变体;2)对文本生成和视觉变体进行联合归一化;3)设计重要性感知采样机制,优先选择与问题最相关的帧,同时保留时间覆盖。 Result: 在VideoMME、TempCompass、VideoMMMU、MMVU、VSI-Bench和PerceptionTest共六个视频推理基准上,STRIVE在多个大型多模态模型上持续超越强强化学习基线。 Conclusion: 结构化的时空探索是一种稳定多模态强化学习、提升视频推理性能的原理性机制。 Abstract: We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.[124] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao
Main category: cs.CV
TL;DR: 本文提出HieraVid,一种分层视频令牌剪枝框架,通过段级、帧级和层级三个层次动态减少视频冗余,在仅保留30%视频令牌的情况下,仍保持接近原始模型的性能,并在多个视频理解基准上达到新SOTA。
Details
Motivation: 现有VideoLLM方法主要在输入层面剪枝视频令牌,忽略了视频本身和大语言模型内部的信息结构,导致计算开销大且剪枝效果受限。 Method: 基于视频的段-帧结构和多模态信息在LLM中单向传播的观察,提出三层剪枝:段级(时空合并)、帧级(段内相似帧联合剪枝保多样性)、层级(随LLM层数加深渐进减少冗余)。 Result: 在四个主流视频理解基准上验证有效;仅保留30%令牌时,性能达新SOTA,分别维持LLaVA-Video-7B和LLaVA-OneVision-7B原始性能的98%和99%以上。 Conclusion: HieraVid通过结构感知的分层剪枝策略,显著降低VideoLLM计算成本而不牺牲性能,为高效视频理解提供了新范式。 Abstract: Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.[125] SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Min Yang
Main category: cs.CV
TL;DR: 本文提出SafeRoPE,一种轻量、细粒度的安全生成框架,通过分析MMDiT中注意力机制的安全关键子空间,并利用Rotary Positional Embedding(RoPE)扰动实现对有害语义的精准抑制,兼顾安全性与生成保真度。
Details
Motivation: 现有T2I模型(如SD3、FLUX)虽生成质量高,但易受多词交互触发不安全语义;而传统缓解方法计算开销大、且难以适配Transformer架构(如MMDiT),亟需专为MMDiT设计的高效安全机制。 Method: 基于对MMDiT注意力头的分析,识别出承载不安全语义的低维可解释子空间及关键注意力头;构建头级不安全子空间并计算输入向量的潜在风险分(LRS);设计头级RoPE扰动,在查询/键向量上实施风险导向的旋转,以选择性抑制有害语义。 Result: SafeRoPE在MMDiT上实现了SOTA级别的有害内容缓解效果,同时显著保持图像质量和良性内容生成能力;实验验证其有效性、轻量性和泛化性。 Conclusion: SafeRoPE揭示了MMDiT中安全语义的结构化分布特性,证明了RoPE扰动作为可控干预手段的有效性,为Transformer-based扩散模型提供了可扩展、无需微调的安全生成新范式。 Abstract: Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.[126] Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification
Shota Harada,Ryoma Bise,Kiyohito Tanaka,Seiichi Uchida
Main category: cs.CV
TL;DR: 本文提出了一种用于医学图像严重程度分类的半监督域自适应新方法,通过跨域排序和连续分布对齐来对齐源域与目标域的等级分数分布。
Details
Motivation: 现有半监督域自适应方法在医学图像严重程度分类中表现不佳,主要因类别边界模糊且严重程度标签具有自然序结构,增加了域适配难度。 Method: 提出跨域排序(Cross-Domain Ranking)和连续分布对齐(Continuous Distribution Alignment)两个模块,利用带序标签学习等级分数并对其分布进行对齐。 Result: 在溃疡性结肠炎和糖尿病视网膜病变数据集上的实验表明,该方法能有效对齐类别特异的等级分数分布,提升严重程度分类性能。 Conclusion: 所提方法通过建模类别序关系与分布对齐,显著改善了半监督域自适应在有序严重程度分类任务中的效果。 Abstract: Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.[127] Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers
Mohammadreza Heidarianbaei,Max Mehltretter,Franz Rottensteiner
Main category: cs.CV
TL;DR: 本文提出了一种纹理感知的Transformer模型,用于直接从网格面的原始像素中学习,并结合几何描述符进行多尺度特征聚合,在语义分割任务中取得了显著性能提升。
Details
Motivation: 现有基于深度学习的网格语义分割方法大多忽略纹理信息,而纹理与几何、拓扑共同构成3D网格的关键属性。 Method: 提出纹理感知Transformer:通过纹理分支将面级像素编码为可学习token,与几何描述符融合;引入两阶段Transformer块(TSTB)实现局部与全局信息交互;采用分层学习机制进行多尺度特征聚合。 Result: 在SUM基准上达到81.9% mF1和94.3% OA;在新建的文化遗产屋顶瓦片数据集上达到49.7% mF1和72.8% OA,显著优于现有方法。 Conclusion: 纹理信息对网格语义分割至关重要,所提纹理感知Transformer能有效融合纹理与几何特征,提升分割精度,尤其适用于真实场景中的复杂纹理网格。 Abstract: Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.[128] Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Jamie S. J. Stirling,Noura Al-Moubayed,Hubert P. H. Shum
Main category: cs.CV
TL;DR: 本文提出了一种位置无关的向量量化自编码器(PI-VQ),通过去除潜在码的位置信息,使其仅捕获全局语义特征,并引入基于最优二分匹配的匹配量化方法提升瓶颈容量,从而支持无需先验的直接图像插值与单步生成。
Details
Motivation: 探究离散表示中位置信息是否对空间对齐数据(如图像)是必要的,挑战现有VQ-VAE/VQ-GAN中位置依赖性带来的建模复杂性。 Method: 提出PI-VQ模型,强制潜在码为排列不变(即无位置信息);设计匹配量化(matching quantization)替代传统最近邻量化,利用最优二分匹配提升有效瓶颈容量;利用码的组合结构实现插值式采样。 Result: 在CelebA、CelebA-HQ和FFHQ上实现了具有竞争力的生成质量指标(precision/density/coverage);支持无需学习先验的图像直接插值与单前向 pass 合成;码具有更好可分性与可解释性。 Conclusion: 位置信息并非离散图像表征所必需;移除位置约束可增强语义抽象能力与生成灵活性,但需权衡信息容量与结构建模,为未来无位置表征研究提供新方向。 Abstract: Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.[129] FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
Pawel Tomasz Pieta,Rasmus Juul Pedersen,Sina Borgi,Jakob Sauer Jørgensen,Jens Wenzel Andreasen,Vedrana Andersen Dahl
Main category: cs.CV
TL;DR: 本文提出FaCT-GS,一种基于高斯点绘(Gaussian Splatting)的快速灵活CT重建框架,通过深度优化体素化和光栅化流程,显著提升速度并支持先验引导与压缩表示。
Details
Motivation: 现有GS方法在CT重建中虽性能不俗,但提速与实用性不足,难以替代传统成熟算法;需解决其效率低、扩展性差及缺乏先验利用机制等关键限制。 Method: 提出FaCT-GS框架,核心是针对体素化与光栅化流水线的深度优化,支持从预存体数据快速拟合高斯分布,从而实现warm-start重建或作为压缩表示。 Result: 在512×512投影下比当前最优GS-CT方法快4倍以上,在2K投影下快13倍以上;具备良好可扩展性,并支持先验引导与体积压缩表示。 Conclusion: FaCT-GS显著提升了GS在CT重建中的实用性与效率,为临床与工业级应用提供了更可行的神经渲染方案。 Abstract: Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.[130] Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Jason Qiu,Zachary Meurer,Xavier Thomas,Deepti Ghadiyaram
Main category: cs.CV
TL;DR: 本文揭示了当前视觉-语言模型(VLMs)在基本几何变换(如旋转、缩放)下缺乏空间不变性与等变性,导致其在稀疏语义内容(如简笔画、抽象艺术)中识别物体能力显著下降,暴露了语义理解与空间推理之间的系统性差距。
Details
Motivation: 现代VLMs虽在语义任务上表现优异,但在基础空间变换下的鲁棒性不足,亟需探究其空间推理能力的脆弱性。 Method: 通过在符号草图、自然照片和抽象艺术等多样化视觉域上,系统评估不同架构、模型规模和提示策略下的VLMs对旋转、缩放和恒等变换的响应。 Result: VLMs在几何变换下性能急剧下降,尤其在语义稀疏图像中;该现象跨模型架构、容量和提示方式普遍存在。 Conclusion: 当前VLMs存在语义理解与空间推理之间的根本脱节,未来多模态系统需强化几何感知与建模能力。 Abstract: This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.[131] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation
Hinako Mitsuoka,Kazuhiro Hotta
Main category: cs.CV
TL;DR: 本文提出了一种轻量级双损失训练框架,用于提升时序动作分割(TAS)的细粒度质量,仅需增加一个输出通道和两个辅助损失项,无需大幅修改模型结构。
Details
Motivation: 现有TAS方法依赖复杂架构,不利于实际部署;需在不显著增加计算开销的前提下提升分割精度与边界质量。 Method: 引入两种损失:1)单通道边界回归损失,提升时间边界定位精度;2)基于累积分布函数(CDF)的段级正则化损失,增强预测段内结构一致性;框架与架构无关,可即插即用地集成到主流TAS模型中。 Result: 在三个基准数据集上,该方法在F1和Edit分数上均取得提升,同时保持帧级准确率基本不变,验证了其在段级一致性和边界质量上的有效性。 Conclusion: 通过精巧的损失函数设计而非模型结构复杂化或推理阶段优化,即可显著提升TAS性能,为轻量高效的时间动作分割提供了新思路。 Abstract: Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.[132] MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Kai Dong,Tingting Bai
Main category: cs.CV
TL;DR: 本文提出MAR-MAER,一种分层自回归文本到图像生成框架,通过度量感知嵌入正则化和条件变分模块,提升生成图像质量一致性与对模糊提示的语义灵活性。
Details
Motivation: 解决现有自回归文本到图像模型在生成图像质量不达标及难以处理多义性提示方面的两大问题。 Method: 提出MAR-MAER框架,包含:1)基于轻量投影头和自适应核回归损失的度量感知嵌入正则化,对齐CLIPScore、HPSv2等人偏好指标;2)条件变分模块引入可控随机性,增强分层token生成中对模糊语义的建模能力。 Result: 在COCO和新构建的Ambiguous-Prompt Benchmark上实验表明,相比Hi-MAR基线,CLIPScore提升+1.6,HPSv2提升+5.3;对模糊提示生成多样性显著提高,且经人工评估与自动指标双重验证。 Conclusion: MAR-MAER有效提升了自回归图像生成模型在人类评价一致性与语义开放性之间的平衡,为处理模糊、开放提示提供了新范式。 Abstract: Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.[133] GeoAI Agency Primitives
Akram Zaytar,Rohan Sawahn,Caleb Robinson,Gilles Q. Hacheme,Girmaw A. Tadesse,Inbal Becker-Reshef,Rahul Dodhia,Juan Lavista Ferres
Main category: cs.CV
TL;DR: 本文提出了一套面向地理空间人工智能(GeoAI)助手的代理基础能力(agency primitives),旨在弥合大模型能力与GIS从业者实际工作流之间的鸿沟,强调以人类为中心、迭代协作的‘代理层’,并定义了9个核心原始能力及配套基准。
Details
Motivation: 现有GeoAI模型(如卫星图像描述、视觉问答等)虽有进展,但未能提升GIS从业者在制图、矢量层生成等实际任务中的生产力,根本原因在于缺乏支持人机迭代协作的代理层。 Method: 提出一个包含9个基础能力(如导航、感知、地理参考记忆、双重建模等)的代理层词汇体系,并设计了一个衡量人类生产力的基准。 Result: 构建了一套可实现、可测试、可比较的GeoAI代理辅助能力框架及评估基准。 Conclusion: 代理层是连接大模型与GIS实际工作流的关键;该研究为GeoAI助手的实用化、标准化和人机协同提供了系统性基础。 Abstract: We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.[134] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
Di Li,Jie Feng,Guanbin Li,Ronghua Shang,Yuhui Zheng,Weisheng Dong,Guangming Shi
Main category: cs.CV
TL;DR: 本文提出A3R框架,将细粒度功能推理重构为顺序证据获取过程,通过多维证据逐步减少歧义,并利用基于GRPO的策略学习提升效率和准确性。
Details
Motivation: 现有方法将功能推理视为静态单次预测,但在复杂3D场景中常因固定观测下任务相关证据不全而失败。 Method: 提出A3R框架,基于MLLM的策略迭代选择证据获取动作,并通过三维几何与二维语义跨维证据融合更新功能信念;引入GRPO策略学习优化顺序决策。 Result: 在场景级基准上,A3R持续超越静态单次基线模型。 Conclusion: 代理式跨维证据获取显著提升了复杂3D高斯场景中细粒度功能推理的性能。 Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.[135] GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
Xianben Yang,Tao Wang,Yuxuan Li,Yi Jin,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出GS²方法,通过图结构特征编码、ELBO自适应稠密化和不透明度感知渐进剪枝策略,在显著减少高斯点数量(仅12.5%)的同时提升渲染质量与内存效率。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成和实时渲染中表现优异,但因高斯点数量巨大导致内存开销高;现有基于剪枝的压缩方法常损害空间一致性并引入渲染伪影。 Method: 提出图结构引导的空间分布优化框架GS²,包含三部分:1)基于证据下界(ELBO)的自适应稠密化策略;2)不透明度感知的渐进式剪枝策略;3)图结构特征编码模块实现特征引导的高斯点位移调整。 Result: GS²在仅使用约12.5%高斯点的情况下,PSNR高于原始3DGS,并全面优于所有对比基线,在渲染质量和内存效率两方面均取得SOTA结果。 Conclusion: GS²通过协同优化高斯点的空间分布与数量,在保证甚至提升重建质量的前提下,大幅降低内存占用,为3DGS的实际部署提供了高效可行的压缩方案。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.[136] Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters
Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda
Main category: cs.CV
TL;DR: 本文揭示了当前文本到图像生成模型在安全过滤方面存在严重漏洞,仅通过自然语言提示即可实施低 effort 的 jailbreak 攻击,成功率达74.47%。作者提出五类视觉越狱策略,并系统评估其在多个主流模型上的有效性,指出当前安全机制缺乏对语义层面恶意意图的理解能力。
Details
Motivation: 现有文本到图像生成模型依赖安全过滤器防止滥用,但其实际防护能力尚不明确;作者旨在揭示这些系统在无模型访问、无需优化或对抗训练前提下的脆弱性。 Method: 提出并系统研究五类基于自然语言提示的视觉 jailbreak 技术(艺术重构、材料替换、伪教育框架、生活方式美学伪装、模糊动作替换),在多个SOTA文本到图像模型上进行黑盒攻击评估。 Result: 所提策略在多个主流文本到图像模型上实现高达74.47%的攻击成功率(ASR),显著绕过现有安全过滤机制。 Conclusion: 当前基于表面提示过滤的安全机制无法理解深层语义意图,亟需更鲁棒、语义感知的多模态安全检测方法。 Abstract: Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.[137] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Ke Li,Ting Wang,Di Wang,Yongshan Zhu,Yiming Zhang,Tao Lei,Quan Wang
Main category: cs.CV
TL;DR: 本文提出ProVG框架,通过解耦语言表达为全局上下文、空间关系和物体属性,并采用渐进式跨模态调制器实现粗到细的视觉-语言对齐,显著提升了遥感图像中基于自然语言表达的目标定位精度。
Details
Motivation: 现有方法依赖句子级视觉-语言对齐,难以利用细粒度语言线索(如空间关系和物体属性),而这些线索在不同定位阶段起不同作用,需有针对性地利用。 Method: 提出ProVG框架,将语言表达解耦为全局上下文、空间关系和物体属性;设计渐进式跨模态调制器(survey-locate-verify机制);引入跨尺度融合模块和语言引导校准解码器;使用统一多任务头支持指代表达理解和分割任务。 Result: 在RRSIS-D和RISBench两个基准上,ProVG持续超越现有方法,达到新的SOTA性能。 Conclusion: ProVG通过细粒度语言解耦与渐进式对齐机制,有效提升了遥感视觉定位任务的精度与鲁棒性,验证了分阶段利用语言线索的有效性。 Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.[138] SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes
Panagiotis Sapoutzoglou,George Terzakis,Maria Pateraki
Main category: cs.CV
TL;DR: SHARC是一种基于球谐函数(SH)距离场表示的新框架,通过在物体内部最优位置放置参考点并采样可见距离场,实现高保真、 genus-agnostic 的三维形状合成。
Details
Motivation: 现有方法在重建精度、效率与模型简洁性之间难以兼顾,且对任意拓扑(genus-agnostic)形状的建模能力有限。 Method: 提出SHARC框架:在物体内部优化选取参考点(联合优化稀疏性、中心性和表面可见性),对每个点通过光线投射采样可见距离场,用快速球谐变换(FSHT)计算SH系数,并施加低通滤波和基于邻近性的局部一致性约束以提升几何保真度。 Result: 在重建精度和时间效率上均优于当前最先进方法,同时保持模型简洁性(parsimony)。 Conclusion: SHARC为任意拓扑三维形状提供了高效、精准且紧凑的表示与合成新范式。 Abstract: We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.[139] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
Xilai Li,Chusheng Fang,Xiaosong Li
Main category: cs.CV
TL;DR: 本文提出FTPFusion,一种基于频率感知、时间扰动和稀疏跨模态交互的红外与可见光视频融合方法,旨在同时提升空间细节保持与时间稳定性。
Details
Motivation: 现有方法难以兼顾时间稳定性与空间细节保留:帧级增强方法缺乏时序建模,而强时空聚合方法又易丢失高频细节。 Method: FTPFusion将特征分解为高低频分量:高频分支通过稀疏跨模态时空交互捕获运动上下文与互补细节;低频分支引入时间扰动策略以增强对闪烁、抖动和局部错位等干扰的鲁棒性;并设计偏移感知的时间一致性约束显式稳定跨帧表征。 Result: 在多个公开基准上,FTPFusion在空间保真度和时间一致性各项指标上均持续超越当前最优方法。 Conclusion: FTPFusion通过频率解耦与协同建模,有效解决了红外-可见光视频融合中时空平衡难题,为智能监控与弱光监测提供了更鲁棒、更精细的融合方案。 Abstract: Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.[140] Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Pan Yi,Weijie Li,Xiaodong Chen,Jiehua Zhang,Li Liu,Yongxiang Liu
Main category: cs.CV
TL;DR: 本文提出Light-ResKAN,一种基于Kolmogorov-Arnold网络(KAN)的轻量级SAR图像识别模型,通过引入可学习激活、Gram多项式激活函数和通道级参数共享策略,在MSTAR、FUSAR-Ship和SAR-ACD数据集上实现高精度与低计算开销的平衡。
Details
Motivation: 大型SAR图像尺寸大,难以在资源受限的边缘设备上部署深度学习模型;现有轻量模型难以兼顾高精度特征提取与低计算需求。 Method: 1)将ResNet中的卷积层替换为KAN卷积以实现自适应特征提取;2)采用Gram多项式作为可学习激活函数,适配SAR图像的复杂非线性特性;3)设计通道级参数共享策略,降低参数量和FLOPs。 Result: 在MSTAR、FUSAR-Ship和SAR-ACD数据集上分别达到99.09%、93.01%和97.26%准确率;在1024×1024 MSTAR图像上相比VGG16减少82.90× FLOPs和163.78×参数量。 Conclusion: Light-ResKAN为边缘端SAR图像识别提供了高效、高精度的解决方案,验证了KAN结构在遥感图像处理中的潜力。 Abstract: Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.[141] Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Yixin Chen,Yaowei Zhang,Huangyue Yu,Junchao He,Yan Wang,Jiangyong Huang,Hongyu Shen,Junfeng Ni,Shaofei Wang,Baoxiong Jia,Song-Chun Zhu,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出利用网络上大量未标注视频,通过精心设计的数据引擎自动生成3D场景理解任务的训练数据,弥补标注数据稀缺的问题,并在多个3D感知与推理任务上验证了其有效性。
Details
Motivation: 3D场景标注数据稀缺且昂贵,而互联网上存在大量未标注视频,亟需有效利用这些廉价资源提升3D场景理解能力。 Method: 设计并分析自动化数据生成的数据引擎,识别影响无标签数据学习效率与效果的关键瓶颈,并在多粒度3D感知任务(如3D目标检测、实例分割、3D空间VQA和VLN)上进行验证。 Result: 基于生成数据训练的模型展现出强零样本性能,微调后进一步提升;证明了利用网络未标注视频构建高效3D场景理解系统的可行性。 Conclusion: 网络未标注视频可通过合适的数据引擎转化为高质量训练信号,是推动3D场景理解系统更强大、更可扩展的重要路径。 Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.[142] Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci
Main category: cs.CV
TL;DR: 本文提出了一种基于2D几何与星图识别启发的多光斑(glint)检测与匹配框架,强调可复现性与清晰评估,通过SLA方法实现稳定、身份保持的对应关系,并开源代码与工具。
Details
Motivation: 现有角膜反射(glint)检测多依赖嵌入式启发式方法,导致跨硬件平台难以复现;需一种结构化、可评估、可复现的多光斑处理方案。 Method: 提出基于星座结构的2D几何驱动流程,借鉴‘迷失空间’星体识别思想;设计相似性-布局对齐(SLA)算法,整合可控过检测、自适应候选回退、外观感知打分及可选语义布局先验,并显式分离检测与匹配步骤。 Result: 在公开多LED数据集上验证了该系统在噪声条件下仍能提供稳定且身份保持的glint对应关系。 Conclusion: 该框架提升了多光斑检测的可复现性与鲁棒性,支持透明评估与跨平台比较,并通过开源代码、预设与脚本促进社区复现与标注。 Abstract: Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.[143] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Yifan Gao,Tao Zhou,Yi Zhou,Ke Zou,Yizhe Zhang,Huazhu Fu
Main category: cs.CV
TL;DR: 本文提出KnowMVG框架,通过知识增强提示和全局-局部注意力机制提升医学视觉定位(MVG)的空间精度,显著优于现有方法。
Details
Motivation: 现有视觉语言模型(VLMs)在医学视觉定位任务中缺乏显式的定位先验,仅依赖潜在嵌入导致空间定位精度不足。 Method: 从注意力机制角度分析问题,提出KnowMVG框架:1)知识增强提示策略,将医学短语相关知识编码为紧凑嵌入;2)全局-局部注意力机制,联合利用粗粒度全局信息与细粒度局部线索以提升定位精度。 Result: 在四个MVG基准上实验表明,KnowMVG在AP50和mIoU指标上分别超越SOTA方法3.0%和2.6%;消融与定性分析验证各模块有效性。 Conclusion: KnowMVG通过引入知识先验与改进注意力机制,有效桥接高层语义理解与细粒度视觉感知,在不增加文本推理开销前提下显著提升医学图像中诊断短语的定位精度。 Abstract: Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.[144] Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
George Sebastian,Philipp Berthold,Bianca Forkel,Leon Pohl,Mirko Maehlisch
Main category: cs.CV
TL;DR: 本文探讨了是否能直接从预波束成形的每根天线的范围-多普勒(RD)测量数据中学习到有意义的空间结构,而不依赖传统的波束成形角度域表示。实验基于商用汽车雷达,在端到端、完全数据驱动框架下,使用双chirp共享权重编码器处理RD张量,并以鸟瞰图(BEV)占用率为几何探针评估空间可恢复性;监督信号来自可见性感知、跨模态的LiDAR标注,并建模雷达视场与遮挡感知的LiDAR可观测性。结果表明,无需显式角度域构造或手工信号处理,即可从原始RD数据中学习空间结构。
Details
Motivation: 传统汽车雷达感知流程依赖波束成形构建角度域表示,本文旨在探究能否绕过该步骤,直接从更底层的预波束成形每根天线的RD数据中端到端学习空间结构,以简化流程并挖掘原始数据潜力。 Method: 采用6发×8收(48虚拟天线)商用CS-FMCW雷达,利用A/B chirp序列实现可控的单/多发射天线配置;对预波束成形的每根天线RD张量,使用双chirp共享权重编码器进行端到端训练;以BEV occupancy为几何探针评估空间恢复能力;监督信号来自建模了雷达FOV和射线可见性的LiDAR数据,实现可见性感知与跨模态监督。 Result: 通过chirp消融(A-only、B-only、A+B)、频段分析及物理对齐基线实验发现:仅用预波束成形的每根天线RD张量,即可有效恢复空间结构;不同发射配置影响几何可恢复性,但整体上无需显式角度域构造或手工信号处理模块。 Conclusion: 空间结构可以完全从预波束成形的每根天线RD数据中端到端学习得到,验证了抛弃传统波束成形与手工特征工程的可行性,为雷达感知提供了更简洁、数据驱动的新范式。 Abstract: Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.[145] Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
Yimin Fu,Songbo Wang,Feiyan Wu,Jialin Lyu,Zhunga Liu,Michael K. Ng
Main category: cs.CV
TL;DR: 本文提出了一种面向跨域红外小目标检测(IRSTD)的空间-频谱协同感知网络S²CPNet,通过频域视角揭示域差异主要表现为相位不一致性,并设计相位校正模块(PRM)、正交注意力机制(OAM)和选择性风格重构成(SSR)来提升模型在未见域上的泛化能力,实验表明其在多个跨域设置下达到SOTA性能。
Details
Motivation: 现有红外小目标检测方法多局限于域一致设定,难以应对训练与测试数据间因观测条件和环境变化导致的分布偏移;加之红外小目标本身信噪比低、特征微弱,易导致模型过拟合于源域特有模式,从而在跨域部署时性能显著下降。 Method: 提出空间-频谱协同感知网络S²CPNet:1)从频域角度分析域差异,发现谱相位不一致是主要表现;2)设计相位校正模块(PRM)增强目标感知的泛化性;3)在跳跃连接中引入正交注意力机制(OAM)兼顾位置信息与表征优化;4)采用选择性风格重构成(SSR)缓解对域特有模式的偏差。 Result: 在三个IRSTD数据集上进行大量跨域实验,所提方法在多种跨域设置下均取得当前最优(state-of-the-art)性能。 Conclusion: S²CPNet通过融合空间与频谱感知、显式建模并校正域间相位差异、以及解耦与重组风格信息,有效提升了红外小目标检测模型在未知域上的鲁棒性与泛化能力,为跨域IRSTD提供了新思路。 Abstract: The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.[146] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Sixing Li,Zhibin Gu,Ziqi Zhang,Weiguo Pan,Bing Li,Ying Wang,Hongzhe Liu
Main category: cs.CV
TL;DR: 本文提出ECAC基准数据集和RSRS混合训练框架,用于解决幼儿教育图像描述中的领域特异性与细粒度识别难题,并开发了KinderMM-Cap-3B模型,在专业对象命名准确率(TTS)上显著超越现有方法。
Details
Motivation: 现有图像描述方法在幼儿教育(ECE)场景中面临两大挑战:缺乏大规模、领域专用数据集,导致描述泛化、不精确;传统监督学习或强化学习范式难以提升专业对象(如教具)的细粒度命名能力。 Method: 构建大规模ECE图像描述基准ECAC(含25.6万张真实图像及专家标注),提出领域导向评估指标TTS(Teaching Toy Recognition Score);设计RSRS混合训练框架,动态切换强化学习与监督微调,对零奖励难样本重定向至监督优化以缓解优势坍塌。 Result: 基于ECAC和RSRS开发的KinderMM-Cap-3B模型在TTS上达51.06,显著优于SOTA基线,同时保持高质量描述生成能力。 Conclusion: ECAC数据集、TTS评估协议与RSRS训练框架共同推动了面向幼儿教育的专业化、细粒度图像描述技术发展,验证了领域适配多模态大模型在教育智能评估中的可行性与有效性。 Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.[147] A Self supervised learning framework for imbalanced medical imaging datasets
Yash Kumar Sharma,Charan Ramtej Kodi,Vineet Padmanabhan
Main category: cs.CV
TL;DR: 本文提出AMIMV方法,通过构建非对称多图像多视角对来解决医学图像分类中的数据稀缺与类别不平衡问题,并在MedMNIST数据集上验证其鲁棒性与性能提升。
Details
Motivation: 医学图像分析常面临标注数据不足和类别不平衡两大挑战;现有自监督学习(SSL)方法虽缓解数据稀缺,但其在类别不平衡下的鲁棒性尚未被充分研究。 Method: 扩展先前提出的MIMV方法,引入新数据增强策略构建非对称多图像多视角(AMIMV)对;系统评估AMIMV及8种主流SSL方法在11个MedMNIST数据集、长尾分布与有限监督下的表现。 Result: 在MedMNIST子集上,AMIMV相较基线分别提升:retinaMNIST 4.25%,tissueMNIST 1.88%,DermaMNIST 3.1%。 Conclusion: AMIMV能有效应对医学图像中的数据稀缺与类别不平衡问题;实验表明其在长尾分布下具备良好鲁棒性,且显著优于多种SSL方法。 Abstract: Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.[148] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
Xilai Li,Weijun Jiang,Xiaosong Li,Yang Liu,Hongbin Wang,Tao Ye,Huafeng Li,Haishu Tan
Main category: cs.CV
TL;DR: 本文提出MAVFusion,一种端到端红外与可见光视频融合框架,通过运动感知的稀疏交互机制,在保证高质量融合结果的同时显著提升计算效率。
Details
Motivation: 现有方法多针对静态图像设计,难以有效处理视频帧间运动;当前视频融合方法虽提升时序一致性,但计算开销大。 Method: 利用光流识别多模态序列中的动态区域,对这些稀疏区域自适应地施加高成本的跨模态注意力;对静态背景区域采用轻量级弱交互模块;解耦动态与静态区域处理。 Result: 在多个红外-可见光视频基准上达到SOTA性能,在640×480分辨率下推理速度达14.16 FPS。 Conclusion: MAVFusion通过运动感知稀疏交互,在保持时序一致性和细节保真度的同时大幅加速推理,为实时视频融合提供了高效新范式。 Abstract: Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.[149] Automated Prostate Gland Segmentation in MRI Using nnU-Net
Pablo Rodriguez-Belenguer,Gloria Ribas,Javier Aquerreta Escribano,Rafael Moreno-Calatayud,Leonor Cerda-Alberich,Luis Marti-Bonmati
Main category: cs.CV
TL;DR: 本文提出了一种基于nnU-Net v2框架的专用深度学习方法,利用多模态mpMRI(T2WI、DWI、ADC)自动分割前列腺,显著优于通用分割工具TotalSegmentator,在内部交叉验证和外部测试中分别达到Dice 0.96和0.82。
Details
Motivation: 手动勾画前列腺耗时且存在观察者间差异,通用分割工具在前列腺特异性任务中精度不足。 Method: 采用nnU-Net v2框架,融合T2加权、扩散加权(DWI)及表观扩散系数(ADC)图像进行多模态训练;使用PI-CAI数据集981例全腺体标注训练,并通过5折交叉验证与La Fe医院54例外部数据集验证。 Result: 交叉验证平均Dice为0.96±0.00,外部测试集Dice为0.82;对比TotalSegmentator(Dice仅0.15),本方法显著更优,尤其避免了欠分割问题。 Conclusion: 任务特异、多模态的深度学习策略对前列腺分割至关重要,所提方法具备临床研究落地潜力,模型已容器化并开源供即用推理。 Abstract: Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.[150] Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao,Shenglang Zhang,Pengxiang Zhu,Angela Yao
Main category: cs.CV
TL;DR: 本文提出了首个面向自我中心视频个性化问答的多模态大语言模型(MLLM)系统性分析,构建了首个支持自我定位(ego-grounding)评估的数据集MyEgo,并发现当前主流MLLM在理解‘我’、记忆‘我的过去’方面存在显著瓶颈。
Details
Motivation: 现有MLLM缺乏对相机佩戴者(即‘我’)的建模能力,难以支撑真正个性化的视频问答;亟需专门数据集与基准来评估其ego-grounding与长时记忆能力。 Method: 构建MyEgo数据集(541个长视频、5K个性化问题),涵盖‘我的物品’‘我的活动’‘我的过去’三类问题;在多种类型MLLM(开源/闭源、带推理/不带推理、不同规模)上进行系统评测,并开展消融实验(如提供显式证据)以分析性能瓶颈。 Result: 所有先进MLLM表现远逊于人类(GPT-5约46%,Qwen3-VL约36%,人类超85%);显式推理和模型扩大未带来稳定提升;仅当提供相关证据时性能短暂上升,但随时间推移迅速下降,表明其‘追踪我’和‘记住我的过去’能力严重不足。 Conclusion: ego-grounding与长程记忆是实现自我中心视频个性化问答的核心挑战;MyEgo为该方向提供了关键基准与研究起点。 Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo[151] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
Jie Feng,Jiawei Shen,Junjia Huang,Junpeng Zhang,Mingtao Feng,Weisheng Dong,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出SDesc3D框架,通过多视角结构先验增强和功能感知的布局定位,在短文本引导下生成物理合理、细节丰富的3D室内场景。
Details
Motivation: 现有文本条件3D场景生成方法在短文本输入下物理合理性差、细节不足,主要因其依赖显式语义线索,缺乏足够的3D推理能力(如先验整合与空间锚定)。 Method: 提出SDesc3D框架:1)多视角场景先验增强,将稀疏文本映射为多视角关系先验;2)功能感知布局定位,利用区域功能隐式定义空间锚点并分层推理布局;3)迭代反思-修正机制,逐步提升结构合理性。 Result: 在短文本条件3D室内场景生成任务上显著优于现有方法,生成场景具备更高物理合理性和语义丰富性。 Conclusion: 融合多视角结构先验与功能语义锚定可有效提升短文本驱动的3D场景生成质量,为交互式3D环境构建提供新范式。 Abstract: 3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.[152] NearID: Identity Representation Learning via Near-identity Distractors
Aleksandar Cvejic,Rameen Abdal,Abdelrahman Eldesokey,Bernard Ghanem,Peter Wonka
Main category: cs.CV
TL;DR: 本文提出NearID框架,通过引入语义相似但身份不同的近恒等干扰样本(Near-identity distractors),消除背景上下文干扰,专一评估视觉编码器对对象身份的判别能力;构建NearID数据集并设计严格间隔评估协议,发现现有预训练编码器在此任务上表现极差(SSR仅30.7%),进而提出两层对比学习目标,在冻结骨干网络下显著提升身份判别性能(SSR达99.2%)并更好对齐人类判断。
Details
Motivation: 现有视觉编码器在身份相关任务(如个性化生成与图像编辑)中将对象身份与背景上下文混淆,导致表征和评估指标不可靠,亟需一种能剥离背景干扰、专注身份判别的新评估范式。 Method: 提出NearID框架:1)构造NearID数据集(19K身份、316K匹配背景的近恒等干扰样本);2)设计基于间隔的严格评估协议(Sample Success Rate, SSR);3)在冻结骨干网络上采用两层对比学习目标——强制排序:同一身份 > 近恒等干扰 > 随机负样本。 Result: 预训练编码器在NearID协议下SSR低至30.7%,常将干扰样本排在真实跨视角匹配之前;所提方法将SSR提升至99.2%,部件级判别能力提升28.0%,并在DreamBench++上更贴近人类判断。 Conclusion: NearID揭示了当前视觉编码器在身份判别上的根本缺陷,并提供首个可复现、上下文无关的身份评估基准与有效训练方案,为个性化视觉模型的可信评估与优化奠定基础。 Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/[153] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Yuqing Huang,Guotian Zeng,Zhenqiao Yuan,Zhenyu He,Xin Li,Yaowei Wang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出交互式跟踪新范式,构建首个大规模交互跟踪基准InteractTrack,设计评估协议并发现现有方法在此场景下失效,进而提出基于动态记忆机制的基线模型IMAT。
Details
Motivation: 现有视觉跟踪器以非交互、一次性方式运行,难以适应需要人类参与调整的实际场景,因此需要支持人类在环(human-in-the-loop)的交互式跟踪新范式。 Method: 构建InteractTrack基准(150个带密集标注与时间戳语言指令的视频),设计综合评估协议,并提出Interactive Memory-Augmented Tracking(IMAT)模型,利用动态记忆机制学习用户反馈并实时更新跟踪行为。 Result: 实验表明当前SOTA跟踪器在交互场景下性能显著下降,常规基准上的强性能无法迁移;IMAT作为新基线展现出更好适应性。 Conclusion: 本文奠定了交互式跟踪的研究基础,推动视觉跟踪系统向更智能、自适应和人机协同方向发展。 Abstract: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.[154] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
Antoine Saporta,Baptiste Callard,Corentin Dancette,Julien Khlaut,Charles Corbière,Leo Butsanets,Amaury Prat,Pierre Manceron
Main category: cs.CV
TL;DR: 本文提出了Curia-2,一种专为CT和MRI影像设计的新型医学影像基础模型,通过改进预训练策略与表示质量,并首次实现十亿参数级多模态视觉Transformer,同时构建了包含2D与3D评测轨道的CuriaBench基准。
Details
Motivation: 医学影像快速增长导致放射科医生工作负担过重,现有基础模型在处理复杂放射学体数据方面仍有优化空间。 Method: 基于Curia框架,提出Curia-2,改进预训练策略并提升表征质量;扩展架构至十亿参数Vision Transformer;重构CuriaBench为2D(切片级)和3D(体素级)两个评测轨道。 Result: Curia-2在视觉导向任务上全面超越现有基础模型,在临床复杂任务(如病灶检测)上表现媲美视觉-语言模型。 Conclusion: Curia-2代表了多模态医学影像基础模型的重要进展,其公开权重将推动后续研究。 Abstract: The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.[155] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Boyang Gong,Yu Zheng,Fanye Kong,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的Inertia-aware Visual Excitation(IVE)方法,用于缓解多模态大语言模型(MLLMs)中视觉注意力的惯性问题,从而提升其对对象间关系进行认知推理的能力,尤其在减少认知型幻觉方面效果显著。
Details
Motivation: 现有幻觉缓解方法主要针对感知型幻觉(如物体存在性或属性错误),但难以解决需跨对象关系推理的认知型幻觉;作者发现MLLMs视觉注意力在早期解码后趋于静态(即‘视觉惯性’),阻碍了组合式认知推理。 Method: 通过词元级注意力分析识别视觉惯性现象;提出无需训练的IVE方法:1)动态选取相对于历史注意力趋势新兴的视觉token,区分惯性token;2)引入惯性感知惩罚项,抑制注意力过度集中与局部区域持续聚焦。 Result: IVE在多个基础MLLM和多种幻觉评测基准上均有效,尤其显著改善认知型幻觉;无需微调,具备模型无关性和即插即用特性。 Conclusion: 视觉注意力惯性是导致MLLMs认知推理能力受限的关键因素;IVE通过建模注意力的动态响应性,有效增强组合推理能力,为缓解认知型幻觉提供了新思路。 Abstract: Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.[156] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
Changshe Zhang,Jie Feng,Siyu Chen,Guanbin Li,Ronghua Shang,Junpeng Zhang
Main category: cs.CV
TL;DR: Resonance4D是一种结合3D高斯泼溅与物质点法的轻量级物理驱动4D动态仿真框架,通过双域运动监督(DMS)和全参数物理恢复策略,在保证物理真实性和运动一致性的前提下显著降低计算与内存开销。
Details
Motivation: 现有方法依赖计算昂贵的视频扩散或光流监督,且仅优化部分材料参数,难以建模复杂材质与动力学场景。 Method: 提出Resonance4D框架,融合3D高斯泼溅与物质点法;引入双域运动监督(DMS),联合空间结构一致性与频域谱一致性;结合零样本文本分割与仿真引导初始化,实现高斯体的对象-部件级分解与全材料参数联合优化。 Result: 在合成与真实场景上验证了高物理保真度与运动一致性;峰值GPU显存从35GB以上降至约20GB,支持单张消费级GPU运行。 Conclusion: Resonance4D解决了物理驱动4D仿真中监督成本高、参数优化不完整两大瓶颈,为高效、真实、可扩展的4D动态建模提供了新范式。 Abstract: Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.[157] MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction
Chen Liu,Hengyu Man,Xiaopeng Fan,Debin Zhao
Main category: cs.CV
TL;DR: 本文提出MTLSI-Net,通过线性注意力机制实现多任务密集预测中低复杂度的跨任务交互,在NYUDv2和PASCAL-Context上达到SOTA性能。
Details
Motivation: 标准自注意力在高分辨率特征上具有平方复杂度,难以高效建模多任务密集预测中的全局跨任务交互。 Method: 提出MTLSI-Net,包含三个核心模块:多任务多尺度查询线性融合块(共享全局上下文矩阵实现线性复杂度跨任务建模)、语义令牌蒸馏器(压缩冗余特征、提炼关键跨任务知识)和跨窗口集成注意力块(双分支注入全局语义,兼顾全局一致性和空间精度)。 Result: 在NYUDv2和PASCAL-Context数据集上取得SOTA性能,验证了方法在效果与效率上的优势。 Conclusion: MTLSI-Net以线性复杂度和更少参数实现了全面的跨任务交互建模,为多任务密集预测提供了高效且有效的解决方案。 Abstract: Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.[158] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Sirshapan Mitra,Yogesh S. Rawat
Main category: cs.CV
TL;DR: 本文提出ProDiG框架,通过渐进式高斯点阵变换和扩散引导,从仅有的航拍图像生成地面视角视图和一致的3D场景模型,无需多高度真实数据。
Details
Motivation: 现有方法在极端视角变化、中间观测缺失和尺度差异大时表现不佳,或依赖难以获取的多高度真值数据。 Method: 提出ProDiG(Progressive Altitude Gaussian Splatting):结合几何感知的因果注意力模块注入对极结构,以及距离自适应高斯模块动态调整高斯尺度与不透明度,实现多阶段渐进式重建。 Result: 在合成与真实数据集上显著优于现有方法,在视觉质量、几何一致性及对极端视角变化的鲁棒性方面均有提升。 Conclusion: ProDiG实现了无需额外地面真值视角的、几何可靠且视觉逼真的航拍到地面视角重建。 Abstract: Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.[159] Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Osher Rafaeli,Tal Svoray,Ariel Nahlieli
Main category: cs.CV
TL;DR: 本文提出Prior2DSM,一种无需训练的数字表面模型(DSM)补全框架,利用DINOv3视觉特征与单目深度基础模型,在测试时通过语义特征空间匹配和轻量LoRA+MLP自适应校准,实现高精度、可泛化的metric DSM补全。
Details
Motivation: 现有DSM常存在缺失或过时区域;传统插值法依赖空间连续性假设而失效,学习方法又受限于监督训练和传感器特异性,泛化能力差。 Method: 融合DINOv3自监督ViT特征与单目深度基础模型,在测试时通过语义特征空间对应传播高度先验信息;采用LoRA+轻量MLP进行测试时自适应,预测空变尺度与偏移参数,将相对深度转为metric高度。 Result: 在多项指标上优于插值法、先验重标定法及SOTA单目深度估计模型,RMSE最高降低46%;同时支持DSM更新与RGB-DSM联合生成。 Conclusion: Prior2DSM是一种通用、无需训练、基于基础模型的DSM补全新范式,兼顾精度、结构保真度与跨域泛化能力。 Abstract: Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.[160] Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Jie Feng,Fengze Li,Junpeng Zhang,Siyu Chen,Yuping Liang,Junying Chen,Ronghua Shang
Main category: cs.CV
TL;DR: 本文提出DR-Seg框架,通过解耦CLIP特征为语义主导与结构主导子空间,并利用DINO引导的图校正与不确定性自适应融合,提升遥感图像开放词汇语义分割的边界精度与语义一致性。
Details
Motivation: CLIP全局对齐的视觉表征难以捕捉遥感图像所需的细粒度空间结构,而现有引入DINO特征的方法未区分CLIP通道的功能异质性,导致结构增强缺乏定位性,易破坏语义完整性。 Method: DR-Seg框架包含三部分:1)解耦CLIP特征为语义主导和结构主导子空间;2)基于DINO引导、先验驱动的图校正模块生成高保真结构细化分支;3)不确定性引导的自适应融合模块动态融合细化分支与原始CLIP分支。 Result: 在八个遥感基准上取得新SOTA性能,显著提升边界划分精度,同时保持语言对齐语义的完整性。 Conclusion: CLIP特征通道具有功能异质性,显式解耦并针对性增强结构信息可有效兼顾语义泛化与空间细节,为开放词汇遥感分割提供了新范式。 Abstract: Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.[161] Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Dian Liu,Jie Feng,Di Li,Yuhui Zheng,Guanbin Li,Weisheng Dong,Guangming Shi
Main category: cs.CV
TL;DR: 本文提出LinkS²Bench,首个用于评估视觉语言模型(VLMs)在动态、广域跨视角空间智能能力的综合基准,涵盖1022分钟无人机视频与200 km²高分辨率卫星图像的配对,并构建17.9k高质量问答对;实验发现跨视角动态对齐是关键瓶颈,并提出Cross-View Alignment Adapter有效提升性能。
Details
Motivation: 现有基准仅关注孤立的无人机视频或静态卫星图像,无法评估VLMs在动态局部-全局空间映射与跨视角推理方面的能力,亟需能反映真实应急与安防场景中空天协同空间智能的新基准。 Method: 构建LinkS²Bench基准:通过LMM辅助流程与人工精标,将1022分钟动态无人机视频与覆盖200 km²的高分辨率卫星图像对齐,生成17.9k问答对,涵盖感知、定位、关系、推理四维度共12项细粒度任务;设计Cross-View Alignment Adapter以显式建模跨视角对齐,并开展18个主流VLMs的系统评测与微调实验。 Result: 18个代表性VLMs在LinkS²Bench上显著落后于人类基线,证实跨视角动态对齐是核心瓶颈;所提Adapter有效提升性能;微调实验证明LinkS²Bench可有效推动VLM向复杂空间推理适配。 Conclusion: LinkS²Bench填补了空天协同空间智能评估的空白,揭示了VLMs在动态跨视角理解上的根本局限,并为提升其地理空间认知能力提供了新基准与技术路径。 Abstract: Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.[162] Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen
Main category: cs.CV
TL;DR: 本文提出一种针对空间数据不平衡问题的自编码器改进方法,通过自熵损失函数和样本传播机制提升对稀有空间位置的重建能力。
Details
Motivation: 自编码器在处理医学影像、生物学和物理学中常见的空间采样不均匀图像时表现不佳,因背景占主导导致模型偏向多数模式,丢失细节并产生模糊重建。 Method: 提出两种互补组件:(i) 基于自熵的损失函数,增强统计上罕见空间位置的权重;(ii) 样本传播(Sample Propagation),一种选择性重放难重建样本的训练机制。 Result: 在模拟数据集及三个真实世界(物理、生物、天文)数据集上验证,该方法在多种重建指标上优于基线方法,尤其在空间不平衡分布下效果显著。 Conclusion: 空间数据表示和稀有样本在无监督图像重建中至关重要,所提方法有效缓解空间不平衡带来的重建偏差问题。 Abstract: Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.[163] IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Sebastian-Ion Nae,Radu Moldoveanu,Alexandra Stefania Ghita,Adina Magda Florea
Main category: cs.CV
TL;DR: 本文介绍了IndoorCrowd数据集,用于室内人群检测、实例分割和多目标跟踪,涵盖四个校园场景,包含31个视频共9913帧,并提供了人类标注的分割掩码;通过对比多种基础模型自动标注器与人工标注效果,并建立检测、分割与跟踪基线,分析了不同场景的挑战性。
Details
Motivation: 现有数据集难以在规模上真实反映复杂室内环境中的人员行为,限制了相关应用(如监控、智能建筑、人机交互)的发展。 Method: 构建了多场景室内人群数据集IndoorCrowd,含31个视频(9913帧),提供人工验证的逐实例分割掩码;设计620帧控制子集评估SAM3、GroundingSAM和EfficientGroundingSAM等自动标注器性能;另设2552帧子集支持MOTChallenge格式的多目标跟踪;采用YOLOv8n/YOLOv26n/RT-DETR-L与ByteTrack/BoT-SORT/OC-SORT组合建立检测、分割与跟踪基线。 Result: ACS-EC场景最具挑战性,79.3%帧为高密度,平均实例尺度仅60.8像素;各自动标注器在Cohen's κ、AP、精度、召回率及掩码IoU指标上均低于人工标注;所建基线模型在不同场景中性能差异显著。 Conclusion: IndoorCrowd填补了真实复杂室内人群理解数据集的空白,为检测、分割与跟踪任务提供了高质量基准和可复现的评估框架。 Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.[164] Efficient Reasoning via Thought Compression for Language Segmentation
Qing Zhou,Shiyu Zhang,Yuyu Jia,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang
Main category: cs.CV
TL;DR: WISE是一种新型高效推理范式,通过'思考两次'(先学习后提速)策略,在保持高性能的同时大幅降低推理开销;其变体WISE-S在ReasonSeg上达到SOTA零样本性能(58.3 cIoU),推理长度减少5倍。
Details
Motivation: 链式思维(CoT)虽提升了多模态模型的语言引导分割性能,但因生成冗长推理过程导致计算开销过大,限制实际应用。 Method: 提出WISE框架:训练模型生成‘简洁理由→答案→详细解释’的结构化序列,利用自回归条件机制使简洁理由成为详细解释的充分摘要,并通过兼顾语义保真与简洁性的自蒸馏目标强化该能力;推理时仅用简洁理由,配合WISE-S提示策略注入简洁性指令以缓解分布偏移。 Result: WISE-S在ReasonSeg基准上实现58.3 cIoU的零样本SOTA性能,平均推理长度从112词元降至23词元(压缩约5倍)。 Conclusion: WISE证明了将详细推理内化为简洁表示的可行性,为高效、实用的多模态推理提供了新范式。 Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.[165] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Issa Sugiura,Keito Sasagawa,Keisuke Nakao,Koki Maeda,Ziqi Yin,Zhishen Yang,Shuhei Kurita,Yusuke Oda,Ryoko Tokuhisa,Daisuke Kawahara,Naoaki Okazaki
Main category: cs.CV
TL;DR: 本文提出Jagle,目前最大的日语多模态后训练数据集(920万样本),通过VLM生成、翻译和文本渲染等策略构建,显著提升日语VLM性能,并兼容甚至增强英文能力。
Details
Motivation: 现有英语视觉语言模型(VLM)依赖大规模多源VQA数据集,但其他语言(尤其是日语)缺乏足够规模与领域覆盖的VQA资源,严重制约多语言VLM发展。 Method: 不依赖现有VQA数据,而是收集异构数据源(图像、图文对、PDF文档),结合VLM问答生成、翻译、文本渲染等多种策略自动生成日语VQA对,构建Jagle数据集;并在2.2B模型上进行后训练与评估。 Result: 基于Jagle训练的2.2B模型在10项日语评测任务平均分超越InternVL3.5-2B,接近Qwen3-VL-2B-Instruct(相差约5分);与FineVision联合训练后,英文性能不降反升。 Conclusion: Jagle验证了无需依赖现成VQA数据即可高效构建高质量非英语多模态数据集的可行性,为多语言VLM发展提供了新范式,并开源全部资源以推动后续研究。 Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.[166] True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
Gabriel Ferri Schneider,Erick Menezes,Rafael Mecenas,Paulo Knob,Victor Araujo,Soraia Raupp Musse
Main category: cs.CV
TL;DR: 本文提出了一种全自动、可扩展的方法,用于系统评估虚拟人(VH)生成流程中肤色保真度,发现当前方法对深色肤色存在系统性色度误差。
Details
Motivation: 现有虚拟人建模流程多依赖未经色彩校准的照片输入,易引入肤色不一致与偏差,尤其影响 realism、identity preservation 和 fairness。 Method: 构建端到端自动流程:包括肤色与光照提取(对比颊区采样与全脸多维掩码)、TRUST光照隔离(零训练)、MetaHuman纹理重着色、多光照实时渲染,及CIELAB空间下ΔE与ITA定量评估。 Result: 在约19,848个渲染样本上验证:肤色提取策略表现呈表型依赖性;深色肤色始终呈现更高色度误差(ΔE)。 Conclusion: 该无监督、低算力框架揭示了当前VH管线中肤色再现的固有偏差,为公平性驱动的虚拟人开发提供了可扩展评估基准。 Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.[167] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
Hao Wang,Yanyu Qian,Pengcheng Weng,Zixuan Xia,William Dan,Yangxin Xu,Fei Wang
Main category: cs.CV
TL;DR: 本文提出COMPASS框架,通过生成目标特定的代理令牌来处理缺失模态问题,确保融合头始终接收固定N槽的多模态输入,从而提升跨模态交互和鲁棒性。
Details
Motivation: 现有方法在处理缺失模态时导致融合头输入结构与训练时不一致,造成融合不完整和跨模态交互下降。 Method: COMPASS基于融合完整性原则,为每个缺失模态利用观测模态在共享潜在空间中通过成对源-目标生成器合成目标特定代理令牌,并聚合为单一替换令牌;结合代理对齐、共享空间正则化和逐代理判别监督以保证代理令牌的表征兼容性和任务信息性。 Result: 在XRF55、MM-Fi和OctoNet数据集上,COMPASS在单/多缺失模态场景下多数情况下优于先前方法。 Conclusion: 保持模态完整的融合接口是一种简单而有效的鲁棒多模态感知设计原则。 Abstract: Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.[168] CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Jingliang Li,Jindou Jia,Tuo An,Chuhao Zhou,Xiangyu Chen,Shilin Shan,Boyu Ma,Bofan Lyu,Gen Li,Jianfei Yang
Main category: cs.CV
TL;DR: 本文提出多物体情境下的意图驱动型3D功能接地新任务,构建首个聚焦隐式意图与混淆对的基准CompassAD,并设计CompassNet框架,通过实例约束的跨模态注入(ICI)和双层对比精化(BCR)模块解决混淆物体间的功能区分问题,在仿真与真实机器人抓取中均取得SOTA效果。
Details
Motivation: 现有3D功能识别方法多在单物体、显式类别提示下评估,无法应对真实场景中多个物体共享相同功能但仅一个符合任务意图的‘混淆对’挑战。 Method: 提出CompassNet框架,包含两个核心模块:1)Instance-bounded Cross Injection(ICI),在物体实例边界内约束语言-几何对齐,防止跨物体语义泄露;2)Bi-level Contrastive Refinement(BCR),在几何组和点两个层次进行对比学习,增强目标与混淆表面的判别性。 Result: 在自建基准CompassAD上实现SOTA性能,涵盖已见与未见指令;并在真实机器人平台上成功部署,验证其在混淆多物体场景中抓取任务的有效迁移能力。 Conclusion: 隐式意图驱动的多物体功能接地是迈向真实世界具身智能的关键一步;CompassNet及其模块设计为解决功能歧义提供了可推广的方法范式。 Abstract: When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.[169] Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement
Aditya Humnabadkar
Main category: cs.CV
TL;DR: 本文利用图论方法分析英国89个行业间的53万笔支付记录,发现网络结构特征(如中心性、聚类系数)能显著提升支付流预测精度,尤其在经济冲击(如新冠疫情)期间效果更突出,为实时经济监测和官方统计提供了新思路。
Details
Motivation: 传统双边测量方法难以揭示行业间隐性的结构性经济关系,而网络分析可提供更全面、实时的经济监测视角,尤其在经济波动期需更稳健的预测工具。 Method: 基于2017–2024年英国532,346条跨89个行业的支付记录构建产业支付网络,提取中心性、聚类系数等图论特征,将其融入时间序列预测模型,并对比传统方法与网络增强模型在常态及疫情冲击下的预测性能(R²)。 Result: 网络特征使预测准确率提升8.8个百分点;疫情期间传统模型R²从0.38骤降至0.19,而网络增强模型贡献达+13.8个百分点;识别出金融、批发贸易和专业服务为结构上最中心的行业;网络密度整体上升12.5%,2020年受扰后恢复并超越疫情前水平。 Conclusion: 支付网络结构特征可作为经济结构性变化的领先指标,显著提升经济‘现时预测’(nowcasting)能力,尤其在传统时间模式失效的动荡时期,有望增强官方统计的时效性与稳健性。 Abstract: Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.[170] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Soo Won Seo,KyungChae Lee,Hyungchan Cho,Taein Son,Nam Ik Cho,Jun Won Choi
Main category: cs.CV
TL;DR: 本文提出InCoM-Net框架,通过融合视觉语言模型(VLM)提取的多层级场景上下文与检测器实例特征,提升人-物交互(HOI)检测性能,在HICO-DET和V-COCO上达到SOTA。
Details
Motivation: 现有基于VLM的HOI方法未能充分利用全场景中分布的多样化上下文线索,限制了交互推理能力。 Method: 提出Instance-centric Context Mining Network(InCoM-Net),包含两个核心模块:1)Instance-centric Context Refinement(ICR),分别提取实例内、实例间和全局上下文线索;2)Progressive Context Aggregation(ProCA),迭代融合多上下文特征与实例检测特征。 Result: 在HICO-DET和V-COCO基准上取得当前最优性能(state-of-the-art)。 Conclusion: InCoM-Net有效建模了从实例到场景的多层次上下文关系,显著提升了HOI检测的视觉理解与推理能力。 Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.[171] PLUME: Latent Reasoning Based Universal Multimodal Embedding
Chenwei He,Xiangzhao Hao,Tianyu Yang,Yuxiang Ma,Yuheng Jia,Lingxiang Wu,Chaoyang Zhao,Haiyun Guo,Jinqiao Wang
Main category: cs.CV
TL;DR: PLUME提出一种隐式链式推理框架,用连续潜在状态的自回归展开替代显式文本链式推理(CoT),结合语义锚引导的过渡适配器和渐进式显式到隐式的训练课程,在保持推理能力的同时大幅降低推理开销,显著提升多模态检索效率与性能。
Details
Motivation: 显式链式推理(CoT)虽能提升多模态嵌入效果,但带来高推理开销和文本瓶颈问题,难以高效处理密集、复杂结构的多模态证据(如视频、视觉文档)。 Method: PLUME采用隐式潜变量推理:以短序列连续潜在状态自回归 rollout 替代显式文本CoT;引入语义锚引导的过渡适配器,实现固定计算预算下多样化推理路径;设计渐进式显式→隐式训练课程,仅在训练初期使用显式CoT作为支架,最终完全消除推理时文本生成。 Result: 在78任务MMEB-v2基准上超越强显式CoT基线;推理步骤从数百token降至少于10个潜在步,速度提升超30倍;尤其适用于视频与视觉文档等高密度、复杂结构检索场景。 Conclusion: 结构化潜在计算可在不牺牲中间推理益处的前提下,规避显式理由生成开销,为实用化多模态检索系统提供更优、更高效的范式。 Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.[172] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Taichi Endo,Guoqing Hao,Kazuhiko Sumi
Main category: cs.CV
TL;DR: 本文提出FlowSlider,一种无需训练的连续图像编辑方法,通过分解FlowEdit更新为保真项和引导项,实现平滑可靠的编辑强度控制。
Details
Motivation: 现有基于学习的滑块式连续编辑方法依赖辅助模块和合成监督,导致训练开销大且在分布偏移下可靠性差。 Method: FlowSlider在Rectified Flow框架中,将FlowEdit的更新分解为源图像条件下的保真项(维持身份与结构)和驱动语义变化的引导项,并利用二者近似正交的几何特性,仅缩放引导项以调节编辑强度。 Result: FlowSlider无需后训练即可实现稳定、平滑、可靠的连续编辑,在多种任务上提升了编辑质量。 Conclusion: FlowSlider是一种高效、通用、无需训练的连续图像编辑方案,解决了现有方法对训练分布依赖性强和可靠性不足的问题。 Abstract: Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.[173] Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
Yan Kong,Yuan Yin,Hongan Chen,Yuqi Fang,Caifeng Shan
Main category: cs.CV
TL;DR: 本文提出了一种基于Co-DINO与Swin-Large的细胞检测方法,针对固定尺寸标注设计了中心点预测框架、中心保持增强与几何框优化策略,并通过任务定制损失调优,在RIVA宫颈细胞学挑战赛中取得优异成绩。
Details
Motivation: Pap涂片图像自动分析对宫颈癌筛查至关重要,但因细胞密集分布和形态复杂而极具挑战性;此外,数据集采用固定尺寸边界框标注,需适配新检测范式。 Method: 以Co-DINO框架结合Swin-Large骨干网络为基线,将检测建模为中心点预测任务;提出中心保持的数据增强策略和解析式几何框优化方法以抑制定位抖动;并进行任务特定的损失权重调优。 Result: 在RIVA宫颈细胞学挑战赛中获得Track B第1名、Track A第2名;实验证明所提优化显著提升检测性能。 Conclusion: 所提出的中心点预测范式及配套技术构成一套高效、鲁棒的宫颈细胞图像分析流程,具备临床应用潜力。 Abstract: Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.[174] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
Rong Fan,Kaiyan Xiao,Minghao Zhu,Liuyi Wang,Kai Dai,Zhao Yang
Main category: cs.CV
TL;DR: 本文提出GroundVTS,一种面向视频时序定位任务的视频大语言模型架构,通过查询引导的细粒度视觉标记采样和渐进式优化策略,提升时序信息建模能力,在多个基准上显著优于现有方法。
Details
Motivation: 现有视频大语言模型采用均匀帧采样,导致关键帧稀疏、丢失重要时序线索,限制了其在视频时序定位(VTG)任务中的性能。 Method: 提出Grounded Visual Token Sampling(GroundVTS),包含查询引导的细粒度视觉标记筛选机制和适配非均匀视觉特征分布的渐进式优化策略。 Result: 在三个标准VTG基准上取得显著提升:mIoU提升7.7点(moment retrieval),mAP提升12.0点(highlight detection)。 Conclusion: GroundVTS有效保留时空信息并增强时序建模能力,为Vid-LLMs在VTG等细粒度视频理解任务中提供了更优架构范式。 Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.[175] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Jiachun Jin,Zetong Zhou,Xiao Yang,Hao Zhang,Pengfei Liu,Jun Zhu,Zhijie Deng
Main category: cs.CV
TL;DR: 本文提出LatentUM,一种新型统一模型,通过在共享语义潜在空间中表示所有模态,消除视觉理解与生成间对像素空间中介的依赖,从而实现高效灵活的跨模态推理与生成,并在多个任务上达到SOTA性能。
Details
Motivation: 现有统一模型依赖像素解码作为理解与生成之间的桥梁,导致低效且存在编解码偏差;而真正的跨模态推理需摆脱像素空间限制,实现在统一语义空间中的无缝理解与生成。 Method: 提出LatentUM,将文本、图像等多模态数据映射到一个共享的语义潜在空间,摒弃传统分离的视觉表征和像素级中介,支持端到端的跨模态联合建模与自反射生成。 Result: 在Visual Spatial Planning基准上达到SOTA;显著提升视觉自反射生成质量;支持基于潜在空间的世界建模与未来视觉状态预测;缓解codec偏差,增强跨模态对齐。 Conclusion: 共享语义潜在空间是构建真正统一多模态模型的关键路径,LatentUM验证了其在效率、对齐性与推理能力上的综合优势。 Abstract: Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.[176] CASHG: Context-Aware Stylized Online Handwriting Generation
Jinsu Shin,Sungeun Hong,Jin Yeong Bak
Main category: cs.CV
TL;DR: 本文提出CASHG模型,一种上下文感知的风格化在线手写生成器,通过显式建模字符间连通性,提升句子级手写轨迹合成的连贯性与风格一致性,并引入边界感知评估指标CSM。
Details
Motivation: 句子级在线手写生成需兼顾上下文依赖的字符、笔画连续性和字间距,而现有方法将这些边界特性隐式建模,在句子尺度和组合多样性受限时不可靠。 Method: 提出CASHG:采用字符上下文编码器获取字符身份与句子级上下文记忆;设计双元组感知滑动窗口Transformer解码器,强调前驱-当前字符局部过渡;引入门控上下文融合机制;并采用三阶段课程学习(从单字到整句)训练。 Result: 在边界感知评估指标CSM(连通性与间距度量)上显著优于对比方法;DTW轨迹相似性保持竞争力;人类评估进一步验证生成质量提升。 Conclusion: 显式建模字符间连通性与上下文融合是提升句子级在线手写生成自然性与风格一致性的有效途径,CSM为该任务提供了更贴合实际需求的评估标准。 Abstract: Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.[177] CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Weidong Tang,Hanbin Sun,Zihan Li,Yikai Wang,Feifan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的开放词汇遥感变化检测方法CoRegOVCD,通过后验校准与空间一致性约束提升跨时相概念响应的可比性与语义可靠性。
Details
Motivation: 现有遥感变化检测方法多假设固定标签空间,无法响应任意用户定义的概念查询;而现有无训练开放词汇方法因外观变化、概念间竞争弱及地物空间连续性等问题,导致变化证据噪声大、碎片化、语义不可靠。 Method: 提出CoRegOVCD框架:1)竞争性后验校准(CPC)和语义后验差(SPD)将原始概念响应转化为竞争感知的查询概念后验并量化其跨时相差异;2)几何-标记一致性门(GeoGate)和区域共识差异(RCD)通过几何感知结构验证与区域共识抑制无效响应、增强空间连贯性。 Result: 在四个涵盖建筑导向与多类场景的基准上,CoRegOVCD持续超越最强无训练基线2.24–4.98 F1$_C$点,在SECOND数据集六类平均F1$_C$达47.50%。 Conclusion: CoRegOVCD有效缓解了无训练开放词汇变化检测中概念响应不可比、语义不可靠和空间不连贯问题,为灵活、鲁棒的遥感变化理解提供了新范式。 Abstract: Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.[178] Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation
Saurabh Hinduja,Gurmeet Kaur,Maneesh Bilalpur,Jeffrey Cohn,Shaun Canavan
Main category: cs.CV
TL;DR: 本文揭示了面部动作单元(AU)检测中常用的受试者独占交叉验证存在显著的随机方差,导致报告性能提升可能被高估;提出采用跨数据集的留一数据集法(LODO)以获得更稳定、可解释的评估结果。
Details
Motivation: 现有AU检测研究普遍采用受试者独占交叉验证,但报告的性能提升往往微小,作者怀疑其评估协议本身引入了不可忽视的随机性,从而掩盖真实模型差异。 Method: 通过在BP4D+数据集上重复3折受试者独占划分,量化交叉验证的随机方差;比较F1与AUC等指标的波动性;进一步提出并实施Leave-One-Dataset-Out(LODO)协议,在五个AU数据集上评估跨数据集鲁棒性。 Result: 受试者独占交叉验证在BP4D+上引入±0.065的平均F1噪声,低频AU波动更大;F1等操作点指标比AUC更不稳定,模型排序随折次变化;LODO消除了划分随机性,暴露出单数据集CV无法发现的域级不稳定性。 Conclusion: 许多在交叉验证中宣称的性能增益可能处于协议固有方差范围内;LODO是一种更稳健、更具可解释性的AU检测评估范式。 Abstract: Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings[179] Reflection Generation for Composite Image Using Diffusion Model
Haonan Zhao,Qingyang Liu,Jiaxuan Chen,Li Niu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的反射生成方法,通过引入反射位置与外观先验,并采用类型感知设计,在新构建的大规模反射数据集DEROBA上验证了其物理一致性和视觉真实性。
Details
Motivation: 反射生成在图像合成中长期被忽视,而阴影生成已有大量研究,因此需要专门针对反射生成开展系统性工作。 Method: 将反射位置和外观的先验信息注入基础扩散模型,并将反射分为两类,设计类型感知的模型结构;同时构建首个大规模物体反射数据集DEROBA用于训练。 Result: 实验表明该方法生成的反射具有物理一致性与视觉真实性,建立了反射生成的新基准。 Conclusion: 本工作推动了图像合成中反射建模的发展,为后续研究提供了有效方法与关键数据资源。 Abstract: Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.[180] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Juan Manuel Hernandez,Mariana Fernandez-Espinosa,Denis Parra,Diego Gomez-Zara
Main category: cs.CV
TL;DR: ViT-Explainer是一个面向Vision Transformer的交互式可视化解释系统,支持从图像分块到分类决策的端到端推理过程理解。
Details
Motivation: 现有可解释性工具多聚焦于孤立模块或专家分析,缺乏对Vision Transformer完整推理流程的引导式、端到端理解支持。 Method: 提出ViT-Explainer——一个基于Web的交互式系统,集成动画演示、分块级注意力热图叠加和视觉适配的Logit Lens,并支持引导式与自由探索两种模式。 Result: 用户研究(6名参与者)表明该系统易于学习和使用,能有效帮助用户理解ViT的行为。 Conclusion: ViT-Explainer填补了Vision Transformer全流程可解释性工具的空白,为非专家用户提供直观、集成的模型理解支持。 Abstract: Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.[181] CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
Juno Cho,Dohui Kim,Mingeon Kim,Hyunseo Jang,Chang Sun Lee,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出了一种统一框架,用于胸部X光片(CXR)的多标签分类(已知病变)与零样本分类(未知病变),通过投影特异性模型集成、改进的CheXzero双分支架构(融合对比学习、非对称损失和大语言模型生成提示)以及强数据增强与测试时增强,有效缓解长尾分布问题并提升泛化能力。
Details
Motivation: 解决胸部X光片中已知病变的多标签分类与未知病变的零样本分类双重挑战,尤其应对不同投照角度带来的多样性及严重长尾分布问题。 Method: 集成投影特异性模型构建统一分类框架;改进CheXzero,设计双分支结构,融合对比学习、非对称损失(ASL)和LLM生成的描述性提示;引入强数据增强与测试时增强(TTA)。 Result: 显著提升了已知病变的多标签分类性能与未知病变的零样本分类泛化能力,同时增强了模型在不同投照视角下的鲁棒性。 Conclusion: 所提方法在兼顾多标签与零样本任务的同时,有效缓解了长尾问题,验证了结合多源监督信号与语义提示的可行性,为医学影像开放世界分类提供了新思路。 Abstract: This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.[182] Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
Sorna Shanmuga Raja,Abdelhafid Zenati
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、端到端的高速公路车道线检测架构,融合3D CNN与实例分割,通过两种改进模型(FPN+自注意力、ROI检测头)提升精度与效率,在TuSimple数据集上达93.40%准确率,适用于ADAS/LAS。
Details
Motivation: 为提升真实驾驶场景下车道线检测的鲁棒性,需同时建模空间与时间信息,并兼顾计算效率与精度,现有2D/3D方法在参数量、延迟或误检率方面存在不足。 Method: 提出两种基于3D-ResNet编码器和PINet解码器的联合时空建模模型:其一引入FPN与自注意力机制增强多尺度特征与空间依赖;其二增加ROI检测头以聚焦车道相关区域,降低计算复杂度。 Result: 在TuSimple数据集上,第二模型达到93.40%准确率,显著降低漏检率;相比2D/3D基线,参数更少、延迟更低;已在实验室完成离线训练与实时推理验证。 Conclusion: 所提轻量级端到端架构在性能、效率与实用性间取得良好平衡,适合集成至ADAS,并具备向完整车道辅助系统(LAS)扩展的潜力。 Abstract: This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).[183] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Yongkang Li,Lijun Zhou,Sixu Yan,Bencheng Liao,Tianyi Yan,Kaixin Xiong,Long Chen,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出UniDriveVLA模型,通过Mixture-of-Transformers实现感知与推理专家解耦,解决自动驾驶中视觉-语言-动作模型的空间感知与语义推理冲突问题,并在多项任务上达到SOTA。
Details
Motivation: 现有VLA模型在自动驾驶中面临空间感知与语义推理难以兼顾的困境,根源在于二者在共享参数中耦合优化。 Method: 提出基于Mixture-of-Transformers的UniDriveVLA,包含驾驶理解、场景感知和动作规划三个专家,采用掩码联合注意力协调;结合稀疏感知范式与三阶段渐进训练策略。 Result: 在nuScenes(开环)和Bench2Drive(闭环)上达到SOTA;同时在3D检测、在线建图、运动预测、驾驶导向VQA等多类任务中表现优异。 Conclusion: UniDriveVLA通过专家解耦有效缓解感知-推理冲突,是一个具备广泛适用性的统一自动驾驶VLA模型。 Abstract: Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla[184] SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition
Soroush Oraki,Feng Ding,Jie Liang
Main category: cs.CV
TL;DR: 本文提出SCALE框架,通过语义与置信度感知的列表式能量模型实现零样本骨架动作识别,避免显式骨架-文本对齐,利用条件变分自编码器和新型损失函数提升 unseen 类别识别性能。
Details
Motivation: 现有零样本骨架动作识别方法依赖显式的骨架-文本对齐,但在动作名称无法准确描述细粒度动态、未见类别语义易混淆时表现脆弱。 Method: SCALE是一种轻量级、确定性的语义与置信度感知列表式能量模型;其核心是构建以冻结文本表征为条件的条件变分自编码器(CVAE),用文本表征参数化潜在先验和解码器;引入语义与置信度感知的列表式能量损失,强调语义相似的难负样本并结合后验不确定性调整决策边界;辅以潜在原型对比目标,将后验均值与文本导出的潜在原型对齐。 Result: 在NTU-60和NTU-120数据集上,SCALE持续优于基于VAE和对齐的基线方法,并与基于扩散的方法性能相当。 Conclusion: SCALE通过能量建模、不确定性感知优化与潜在语义对齐,有效缓解了零样本骨架动作识别中语义模糊与类别混淆问题,为无需生成样本的零样本学习提供了新范式。 Abstract: Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.[185] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Qiyao Zhang,Shuhua Zheng,Jianli Sun,Chengxiang Li,Xianke Wu,Zihan Song,Zhiyong Cui,Yisheng Lv,Yonglin Tian
Main category: cs.CV
TL;DR: 本文提出UAV-Track VLA模型,用于无人机在动态城市环境中进行具身视觉跟踪;通过构建大规模数据集(89万帧、176任务)和新基准,结合时间压缩网络与双分支解码器设计,显著提升跟踪成功率、鲁棒性与实时性。
Details
Motivation: 现有VLA模型存在时间特征冗余和缺乏空间几何先验的问题,且缺乏适用于复杂城市动态场景的具身视觉跟踪评估基准。 Method: 构建包含890K帧、176个任务的大规模数据集和专用评估基准;提出UAV-Track VLA模型:基于π₀.₅架构,引入时间压缩网络捕捉帧间动态,并设计并行双分支解码器(空间感知辅助定位头 + 光流匹配动作专家)以解耦跨模态特征并生成细粒度连续动作。 Result: 在CARLA仿真中,长距离行人跟踪任务成功率达61.76%,平均跟踪帧数269.65;具备强零样本泛化能力;单步推理延迟降低33.4%至0.0571秒。 Conclusion: UAV-Track VLA有效解决了动态城市环境下具身视觉跟踪的关键挑战,在性能、泛化性与实时性方面均显著优于现有方法,推动了VLA模型在真实无人机控制中的落地应用。 Abstract: Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.[186] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Naomi Kombol,Ivan Martinović,Siniša Šegvić,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出SPAR,一种无需架构修改、单次前向传播即可处理任意分辨率图像的ViT模型,通过知识蒸馏将滑动窗口教师模型的空间推理能力迁移到学生模型,显著提升高分辨率密集预测任务(如开放词汇分割)的效率与精度。
Details
Motivation: 现有ViT在需要细粒度空间理解的任务(如开放词汇分割)中表现受限,因其固定预训练分辨率和粗粒度patch表示;滑动窗口策略虽提升精度但计算开销大。 Method: 提出SPAR:采用特征回归损失,将高 stride 滑动窗口教师模型的空间推理能力蒸馏至单次前向传播的学生ViT,不依赖像素级监督或架构改动。 Result: 在开放词汇分割任务上,SPAR相比单次前向基线提升最高达10.5 mIoU,并反超教师模型,验证其高效高分辨率推理能力。 Conclusion: SPAR实现了分辨率无关、单次前向、高效且高性能的密集特征提取,为ViT在高分辨率密集预测任务中的落地提供了新范式。 Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR[187] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Yaoteng Tan,Zikui Cai,M. Salman Asif
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、基于推理时梯度反馈的安全引导框架,利用冻结的多模态基础模型作为语义能量估计器,在不修改生成器的前提下实现高质量、可扩展的文本到图像生成安全控制。
Details
Motivation: 现有文本到图像生成模型的安全控制方法通常依赖模型微调或人工筛选数据集,易损害生成质量或难以扩展。 Method: 提出一种推理时引导框架,利用冻结的视觉-语言基础模型在每步采样中提供梯度反馈,并通过干净潜在表示注入该反馈,将安全引导建模为基于能量的采样问题。 Result: 在NSFW红队测试基准上达到最优鲁棒性,支持多目标引导,同时在非目标良性提示下保持高生成质量。 Conclusion: 该框架为利用基础模型作为语义能量估计器提供了原理性方法,实现了可靠、可扩展的文本到图像生成安全控制。 Abstract: Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.[188] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Chongjie Ye,Cheng Cao,Chuanyu Pan,Yiming Hao,Yihao Zhi,Yuanming Hu,Xiaoguang Han
Main category: cs.CV
TL;DR: Omni123 是一种3D原生的多模态大模型,通过将文本、图像和3D统一为离散token序列,在自回归框架中联合建模跨模态一致性,利用丰富2D数据作为几何先验提升3D生成质量,无需严格对齐的文本-图像-3D三元组。
Details
Motivation: 现有方法受限于高质量3D数据稀缺,依赖间接2D编辑再升维的流程,导致几何不一致;需一种能直接、一致地生成3D并利用2D数据先验的新范式。 Method: 提出Omni123模型:1)将文本、图像、3D表示为共享序列空间中的离散token;2)采用交错X-to-X训练范式,在异构配对数据集上协同优化跨模态任务;3)在自回归序列中构建语义-视觉-几何循环(如text→image→3D→image),联合约束语义、外观与多视角几何一致性。 Result: 显著提升了文本引导的3D生成与编辑质量,在多个指标和定性评估中优于现有方法,验证了其几何一致性和可扩展性。 Conclusion: Omni123证明了以2D数据为几何先验、通过跨模态序列建模实现3D原生生成的可行性,为构建多模态3D世界模型提供了可扩展路径。 Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.[189] AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Qiang Ma,Qingjie Meng,Xin Hu,Yicheng Wu,Wenjia Bai
Main category: cs.CV
TL;DR: 本文提出了一种基于概率测度和切片Wasserstein距离的快速表面配准方法AdamFlow,兼顾效率与鲁棒性,在解剖结构配准中表现优越。
Details
Motivation: 现有表面配准方法在效率与鲁棒性之间存在权衡:局部点匹配法快但易受噪声和初值影响;全局配准法鲁棒但计算代价高。 Method: 将表面网格建模为概率测度,配准问题转化为分布优化问题;采用具有对数线性复杂度的切片Wasserstein距离度量网格差异;提出AdamFlow优化器,将Adam算法推广至概率空间以最小化该距离。 Result: 理论证明AdamFlow渐近收敛;实验表明其在仿射与非刚性配准任务中,跨多种解剖结构均优于现有方法,兼具高效性与鲁棒性。 Conclusion: 所提方法有效缓解了效率-鲁棒性权衡问题,为医学影像中的解剖形状分析提供了实用、可靠的表面配准新范式。 Abstract: Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.[190] VOID: Video Object and Interaction Deletion
Saman Motamed,William Harvey,Benjamin Klein,Luc Van Gool,Zhuoning Yuan,Ta-Ying Cheng
Main category: cs.CV
TL;DR: VOID is a video object removal framework that enables physically-plausible inpainting by modeling causal physical interactions, using a synthetic counterfactual dataset and vision-language-guided video diffusion.
Details
Motivation: Current video object removal methods fail to handle scenes where the removed object has significant physical interactions (e.g., collisions), leading to implausible results; there's a need for physically consistent, causally-aware editing. Method: VOID uses Kubric and HUMOTO to generate a paired synthetic dataset of counterfactual object removals with altered physical dynamics; during inference, a vision-language model detects affected regions, guiding a video diffusion model to generate physically consistent outcomes. Result: VOID outperforms prior methods in preserving scene dynamics after object removal on both synthetic and real data, demonstrating improved physical plausibility and causal consistency. Conclusion: VOID advances video editing toward world simulation by integrating high-level causal reasoning and physics-aware generation, offering a framework for more realistic and physically grounded video editing. Abstract: Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.[191] A Simple Baseline for Streaming Video Understanding
Yujiao Shen,Shulin Tian,Jingkang Yang,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出SimpleStream,一种仅使用最近N帧的滑动窗口基线方法,在多个流式视频理解基准上表现优于或媲美现有复杂记忆机制模型,并揭示了感知与记忆之间的权衡关系,主张未来基准应区分近期场景感知与长程记忆。
Details
Motivation: 挑战当前流式视频理解方法依赖复杂记忆机制的趋势,验证简单滑动窗口方法的有效性,并重新审视长上下文对性能提升的真实贡献。 Method: 提出SimpleStream——基于滑动窗口的基线方法,仅输入最近N帧至现成视觉语言模型(VLM);在OVO-Bench和StreamingBench上与13个主流离线/在线视频大模型对比;开展控制变量消融实验分析上下文长度、模型规模与性能的关系。 Result: SimpleStream仅用4帧即在OVO-Bench达67.7%平均准确率、StreamingBench达80.59%;实验证明长上下文增益依赖骨干模型而非单纯扩大规模;发现感知-记忆权衡:增加历史上下文可提升召回但削弱实时感知能力。 Conclusion: 复杂记忆模块不应被默认视为进步,除非其在相同协议下显著超越SimpleStream;建议未来流式基准应解耦近期感知与长程记忆,以更清晰评估新增复杂性的实际价值。 Abstract: Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.[192] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito
Main category: cs.CV
TL;DR: 本文提出Large-Scale Codec Avatars(LCA),通过预训练(100万野外视频)与后训练(高质量多视角数据)的范式,兼顾3D头像建模的高保真度与泛化能力,在身份保持、表情/手指控制、发型/服装/人种多样性等方面表现优异,并展现出光照重置、宽松衣物模拟及风格化图像零样本鲁棒性等涌现能力。
Details
Motivation: 解决高保真3D头像建模中 fidelity(如多视角影棚数据)与 generalization(如大规模野外数据)之间的根本权衡问题。 Method: 提出面向3D头像的大规模编解码模型LCA,采用预训练(1M野外视频学习外观与几何先验)+ 后训练(高质量多视角数据提升表达力与保真度)的两阶段范式。 Result: LCA在面部微表情、手指级关节控制、身份保持、发型/服装/人口统计多样性上实现高质量建模;并涌现出无监督条件下的重光照、宽松衣物建模及风格化图像零样本鲁棒性。 Conclusion: LCA首次将大模型预/后训练范式引入3D头像建模,成功弥合了保真度与泛化性的鸿沟,为世界规模人群的实时、高质量3D头像生成提供了新范式。 Abstract: High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.[193] Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Ruozhen He,Nisarg A. Shah,Qihua Dong,Zilin Xiao,Jaywon Koo,Vicente Ordonez
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉定位基准RSC,聚焦于基于场景的视觉定位任务,要求模型根据角色、意图和关系上下文而非显式命名来推断目标;同时提出了ScenGround方法,通过课程式推理提升模型在该任务上的表现。
Details
Motivation: 现有视觉定位基准主要评估图像区域与字面指代表达之间的对齐,模型常通过匹配显著命名类别即可成功;而本文旨在探索更富挑战性的场景化视觉定位,即需从角色、意图和关系上下文中推断目标。 Method: 构建了Referring Scenario Comprehension(RSC)基准,包含段落级查询和细粒度难度标注;并提出ScenGround方法,结合监督预热与难度感知的强化学习进行课程训练。 Result: 实验表明,场景化查询能暴露当前模型在标准基准中无法发现的系统性缺陷;课程训练不仅提升了困难子集性能,还能迁移到标准基准上。 Conclusion: 基于场景的视觉定位是现有基准的重要补充,RSC和ScenGround为推动深层语义理解与鲁棒推理提供了新方向和实用工具。 Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.[194] Steerable Visual Representations
Jona Ruthardt,Manu Gaur,Deva Ramanan,Makarand Tapaswi,Yuki M. Asano
Main category: cs.CV
TL;DR: 本文提出了一种可引导的视觉表征方法,通过在视觉编码器中早期注入文本提示(轻量级跨模态注意力),使ViT特征能被自然语言灵活引导,兼顾通用性与可控性。
Details
Motivation: 现有预训练ViT(如DINOv2、MAE)虽具通用性但缺乏对非显著概念的定向能力;而多模态大模型(如CLIP)虽支持文本引导,却偏向语言中心、削弱视觉通用性。需兼顾二者优势。 Method: 提出‘可引导视觉表征’:将文本提示通过轻量级跨注意力机制直接注入视觉编码器各层(早期融合),而非传统CLIP式的后融合;设计评估表征可引导性的新基准。 Result: 所提方法能在保持原始表征质量的同时,精准聚焦图像中任意指定对象;在异常检测与个性化目标判别任务上达到或超越专用方法,并具备零样本泛化能力。 Conclusion: 早期文本注入的视觉编码器架构能有效实现视觉表征的语义可引导性,为通用视觉模型引入可控性开辟了新路径。 Abstract: Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.[195] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection
Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano
Main category: cs.CV
TL;DR: ModMap是一个原生多视角、多模态的3D异常检测与分割框架,通过跨模态和跨视角特征映射、特征级调制及跨视角训练策略,结合专为工业数据设计的深度编码器,在SiM3D基准上达到SOTA性能。
Details
Motivation: 现有方法独立处理各视角,缺乏对多视角与多模态间关联的有效建模,且缺乏适配高分辨率工业3D数据的专用编码器。 Method: 提出ModMap框架,采用跨模态-跨视角特征映射、特征-wise调制建模视角依赖关系,并设计跨视角训练策略;同时训练并开源面向工业数据的深度编码器。 Result: 在SiM3D多视角多模态3D异常检测与分割基准上显著超越先前方法,达到当前最优性能。 Conclusion: ModMap验证了联合建模多视角与多模态信息对3D异常检测与分割任务的重要性,为工业3D视觉检测提供了新范式。 Abstract: We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.[196] Generative World Renderer
Zheng-Hui Huang,Zhixiang Wang,Jiaming Tan,Ruihan Yu,Yidan Zhang,Bo Zheng,Yu-Lun Liu,Yung-Yu Chuang,Kaipeng Zhang
Main category: cs.CV
TL;DR: 本文提出一个基于AAA游戏的大规模动态合成数据集,通过双屏拼接捕获方法获取400万帧同步RGB与G-buffer数据,用于提升生成式逆向与正向渲染在真实场景中的 realism 和时序一致性,并设计基于VLM的无真值评估协议。
Details
Motivation: 现有合成数据集在真实感和时序连贯性上不足,导致生成式逆向与正向渲染难以扩展到真实世界场景,存在显著域差距。 Method: 提出双屏拼接捕获方法,从AAA游戏中构建含RGB与5通道G-buffer的大规模动态数据集(4M帧,720p/30FPS);设计基于视觉语言模型(VLM)的语义-空间-时序一致性评估协议;开发支持文本驱动G-buffer编辑的前向渲染工具链。 Result: 在跨数据集泛化与可控视频生成任务上,基于该数据微调的逆向渲染器性能更优;VLM评估结果与人类判断高度相关;前向渲染工具支持AAA游戏风格的文本引导编辑。 Conclusion: 该工作通过高质量游戏合成数据与新型评估范式,有效弥合了生成式渲染在真实场景中的域差距,推动双向渲染向实际应用迈进。 Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.[197] ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven,Ziyi Wu,Igor Gilitschenski,Philip Torr,Sergey Tulyakov,Fabio Pizzati,Aliaksandr Siarohin
Main category: cs.CV
TL;DR: 本文提出ActionParty,一种可控制多主体动作的生成式视频游戏世界模型,通过引入主体状态令牌和空间偏差机制,解决了现有视频扩散模型中动作绑定问题,实现了对最多七个玩家的同时控制。
Details
Motivation: 现有视频扩散模型主要局限于单智能体设置,无法同时控制场景中的多个智能体,且存在动作与主体绑定困难的问题。 Method: 提出ActionParty模型,引入持续表征每个主体状态的主体状态令牌,并通过空间偏差机制联合建模状态令牌与视频潜在表示,从而解耦全局帧渲染与个体动作控制更新。 Result: 在Melting Pot基准测试中,首次实现对46种不同环境中最多七个玩家的同时控制,在动作遵循准确率、身份一致性及复杂交互下的主体自回归跟踪方面均显著提升。 Conclusion: ActionParty有效解决了多主体动作绑定问题,为构建可扩展、可控的多智能体视频世界模型提供了新范式。 Abstract: Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.[198] EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia,Guillermo Gallego
Main category: cs.CV
TL;DR: EventHub 是一种无需真实标注的深度事件立体网络训练框架,利用标准彩色图像生成代理标注和代理事件数据,从而提升事件立体匹配模型的泛化能力,并反哺RGB立体基础模型在夜间等挑战场景下的性能。