Table of Contents
cs.CL [Back]
[1] Evaluating Prompting Strategies for Chart Question Answering with Large Language Models
Ruthuparna Naikar,Ying Zhu
Main category: cs.CL
TL;DR: 本文系统评估了四种提示策略(零样本、少样本、零样本思维链、少样本思维链)在图表问答任务中的性能,发现少样本思维链提示在推理密集型问题上效果最佳。
Details
Motivation: 提示策略对大语言模型推理性能有影响,但在基于图表的问答任务中其作用尚未被充分研究。 Method: 在ChartQA数据集上,对GPT-3.5、GPT-4和GPT-4o模型采用四种提示范式进行系统评估,仅将提示结构作为实验变量,并使用准确率和精确匹配两个指标衡量性能。 Result: 在1200个多样化ChartQA样本上的实验结果表明:少样本思维链提示准确率最高(达78.2%),尤其适用于推理密集型问题;少样本提示提升了格式遵循性;零样本提示仅在高容量模型处理简单任务时表现良好。 Conclusion: 研究为结构化数据推理任务中的提示策略选择提供了实用指导,兼顾效率与准确性,对实际应用具有重要启示。 Abstract: Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2\%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.[2] MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing
Runze Li,Kedi Chen,Guwei Feng,Mo Yu,Jun Wang,Wei Zhang
Main category: cs.CL
TL;DR: MERIT是一种无需训练的知识追踪框架,结合冻结的大型语言模型与结构化教学记忆,通过语义去噪、范式库构建和分层检索机制,在不进行梯度更新的情况下实现高精度且可解释的学生知识状态追踪。
Details
Motivation: 传统深度学习KT模型缺乏可解释性,而现有LLM方法存在上下文限制、幻觉问题及昂贵微调成本,难以兼顾准确性、可解释性与可扩展性。 Method: 提出MERIT框架:1)将交互日志转化为可解释的记忆库;2)语义去噪划分学生认知模式;3)离线构建含显式思维链(CoT)的错误范式库;4)推理时采用分层路由与逻辑增强模块进行上下文检索与预测校准。 Result: 在真实数据集上达到SOTA性能,无需参数更新,显著降低计算开销,支持动态知识更新。 Conclusion: MERIT通过将LLM锚定于可解释记忆,实现了高精度、低成本、强可解释与易扩展的知识追踪,提升了教育诊断的透明性与实用性。 Abstract: Knowledge Tracing (KT) models students' evolving knowledge states to predict future performance, serving as a foundation for personalized education. While traditional deep learning models achieve high accuracy, they often lack interpretability. Large Language Models (LLMs) offer strong reasoning capabilities but struggle with limited context windows and hallucinations. Furthermore, existing LLM-based methods typically require expensive fine-tuning, limiting scalability and adaptability to new data. We propose MERIT (Memory-Enhanced Retrieval for Interpretable Knowledge Tracing), a training-free framework combining frozen LLM reasoning with structured pedagogical memory. Rather than updating parameters, MERIT transforms raw interaction logs into an interpretable memory bank. The framework uses semantic denoising to categorize students into latent cognitive schemas and constructs a paradigm bank where representative error patterns are analyzed offline to generate explicit Chain-of-Thought (CoT) rationales. During inference, a hierarchical routing mechanism retrieves relevant contexts, while a logic-augmented module applies semantic constraints to calibrate predictions. By grounding the LLM in interpretable memory, MERIT achieves state-of-the-art performance on real-world datasets without gradient updates. This approach reduces computational costs and supports dynamic knowledge updates, improving the accessibility and transparency of educational diagnosis.[3] Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
Zaruhi Navasardyan,Spartak Bughdaryan,Bagrat Minasyan,Hrant Davtyan
Main category: cs.CL
TL;DR: 本文挑战了低资源语言(LRLs)语义对齐需大规模高质量数据的假设,提出仅用1万对噪声合成数据微调多语言编码器(mE5),即可在亚美尼亚语等LRL上实现媲美百万级数据的检索性能,揭示‘少即是多’现象,并验证其跨语言泛化性。
Details
Motivation: 低资源语言缺乏高质量、大规模训练数据,限制了文本嵌入模型在RAG和语义搜索等任务中的应用;现有方法依赖大量人工校验翻译或海量数据,成本高、不实用。 Method: 针对亚美尼亚语(独特文字),采用开源权重模型将英文Reddit标题-正文对翻译生成小规模噪声合成数据(仅1万对),微调mE5模型;构建包含现有数据集、翻译数据和人工标注数据的综合评测基准。 Result: 仅用1万对噪声合成数据微调即带来11–12%平均提升、检索性能相对提升超20%,效果媲美百万样本训练;增大数据量、提升翻译质量或拓展领域均未带来显著增益;该结论在另一具有独特文字的LRL上亦成立。 Conclusion: LRL的语义对齐能力早期即饱和,且对噪声高度鲁棒;小规模噪声数据足以支撑高性能嵌入构建,为资源受限社区提供了低成本、可推广的解决方案。 Abstract: Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12\% average improvements across the benchmark with a 20\%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at https://metric-ai-lab.github.io/less-is-more-embeddings/ to facilitate further research.[4] Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali
Medha Sharma,Supriya Khadka,Udit Chandra Aryal,Bishnu Hari Bhatta,Bijayan Bhattarai,Santosh Dahal,Kamal Gautam,Pushpa Joshi,Saugat Kafle,Shristi Khadka,Shushila Khadka,Binod Lamichhane,Shilpa Lamichhane,Anusha Parajuli,Sabina Pokharel,Suvekshya Sitaula,Neha Verma,Bishesh Khanal
Main category: cs.CL
TL;DR: 本文提出了LEAF评估框架,用于全面评估大语言模型在性与生殖健康(SRH)等敏感领域中的表现,涵盖准确性、语言适配性、可用性(相关性、充分性、文化适宜性)和安全性(安全、敏感性、保密性)四大维度;在尼泊尔语SRH数据上的评估显示仅35.1%的回答合格,揭示当前LLM在低资源语言和敏感话题上的显著缺陷。
Details
Motivation: 现有LLM评估方法过于侧重客观查询的准确性,尤其在高资源语言中,缺乏对低资源语言及性与生殖健康等文化敏感领域的可用性与安全性评估标准。 Method: 提出多维评估框架LEAF,涵盖准确性、语言、可用性差距(相关性、充分性、文化适宜性)和安全性差距(安全性、敏感性、保密性),并基于14K条尼泊尔语SRH用户查询、由SRH专家进行人工标注评估。 Result: 仅35.1%的LLM响应被判定为“恰当”(即准确、充分且无重大可用性或安全性缺陷);不同版本ChatGPT在准确性上相近,但在可用性和安全性方面表现差异显著。 Conclusion: 当前LLM在低资源语言及敏感健康领域的实际应用存在严重局限,亟需以LEAF为代表的兼顾可用性与安全性的评估框架指导改进;该框架具备跨语言、跨领域的可扩展性。 Abstract: As Large Language Models (LLMs) become integrated into daily life, they are increasingly used for personal queries, including Sexual and Reproductive Health (SRH), allowing users to chat anonymously without fear of judgment. However, current evaluation methods primarily focus on accuracy, often for objective queries in high-resource languages, and lack criteria to assess usability and safety, especially for low-resource languages and culturally sensitive domains like SRH. This paper introduces LLM Evaluation Framework (LEAF), that conducts assessments across multiple criteria: accuracy, language, usability gaps (including relevance, adequacy, and cultural appropriateness), and safety gaps (safety, sensitivity, and confidentiality). Using the LEAF framework, we assessed 14K SRH queries in Nepali from over 9K users. Responses were manually annotated by SRH experts according to the framework. Results revealed that only 35.1% of the responses were "proper", meaning they were accurate, adequate and had no major usability or safety related gaps. Insights include differences in performance between ChatGPT versions, such as similar accuracy but varying usability and safety aspects. This evaluation highlights significant limitations of current LLMs and underscores the need for improvement. The LEAF Framework is adaptable across domains and languages, particularly where usability and safety are critical, offering a pathway to better address sensitive topics.[5] TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
Yutao Xie,Nathaniel Thomas,Nicklas Hansen,Yang Fu,Li Erran Li,Xiaolong Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为TIPS的奖励塑形框架,通过在每一轮推理和工具调用中提供基于教师模型答案概率提升的密集、细粒度奖励,以解决搜索增强型大语言模型在强化学习训练中因稀疏奖励和信用分配困难导致的不稳定性问题。实验表明,TIPS显著提升了多个开放域问答基准上的性能与训练稳定性。
Details
Motivation: 搜索增强型大语言模型在强化学习训练中面临稀疏奖励和跨推理与工具调用步骤的信用分配困难,导致训练不稳定。 Method: 提出Turn-Level Information Potential Reward Shaping(TIPS),利用教师模型对每轮推理+工具调用生成的中间结果计算答案概率提升作为密集、策略无关的势函数型奖励。 Result: 在七个QA基准上,TIPS持续优于GRPO/PPO基线;以Qwen-2.5 7B Instruct为例,相对PPO平均EM提升11.8%,F1提升13.6%。 Conclusion: TIPS为多轮LLM推理中的稀疏奖励信用分配问题提供了一种有效且通用的解决方案。 Abstract: Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.[6] Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
Michael Keeman
Main category: cs.CL
TL;DR: 本文通过临床心理学启发的实验方法,首次验证了大语言模型中所谓'情绪回路'是否真正理解情绪含义,而非仅识别情绪关键词;结果发现模型具备高度准确的情绪内容感知能力(Affect reception),但具体情绪类别判断(Emotion categorization)仍部分依赖关键词,从而揭示了两种可分离的情绪处理机制。
Details
Motivation: 现有研究均使用含明确情绪词的刺激来声称模型存在情绪表征,但无法区分模型是理解情绪语义还是仅在匹配关键词;因此亟需基于临床有效性的检验方法来验证其真实情绪处理能力。 Method: 采用无情绪关键词的临床情境 vignettes 作为刺激,结合线性探针、因果激活修补、消融实验和表征几何四种机制可解释性方法,在6个主流开源大语言模型上进行系统检验。 Result: 发现两种分离的情绪处理机制:1)情绪内容感知(Affect reception)在所有模型中均达AUROC 1.000,且定位于浅层;2)情绪类别判断(Emotion categorization)在无关键词时性能下降1–7%,且随模型规模增大而提升;因果修补表明两类刺激共享表征空间,传递的是情感显著性而非类别身份。 Conclusion: 模型确实具备对真实情绪意义的感知能力,否定了纯关键词匹配假说;提出了情绪处理的机制性二分法,并确立临床刺激范式为检验LLM情绪能力的新标准,对AI安全与对齐评估具有直接意义。 Abstract: Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discover two dissociable emotion processing mechanisms. Affect reception -- detecting emotionally significant content -- operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization -- mapping affect to specific emotion labels -- is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models -- with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.[7] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs
Haoming Meng,Kexin Huang,Shaohang Wei,Chiyu Ma,Shuo Yang,Xue Wang,Guoyin Wang,Bolin Ding,Jingren Zhou
Main category: cs.CL
TL;DR: 本文系统研究了强化学习与可验证奖励(RLVR)在大语言模型中提升推理能力的token级机制,发现RL微调仅引发稀疏且目标明确的分布变化,并通过跨采样实验证明这些少量关键token决策直接决定了性能提升。
Details
Motivation: 尽管RLVR显著提升了大语言模型的推理能力,但其在token级别上的作用机制尚不清楚。 Method: 本文围绕三个核心分析展开:(1) 基础模型与RL模型之间token级分布偏移的刻画;(2) token级偏移对序列级推理性能的影响(通过跨采样干预);(3) 这些偏移在token级别的细粒度机制分析(如熵、位置集中度、概率质量重分配等)。此外还进行了基于优势信号的加权诊断干预实验。 Result: RL微调仅导致极少数token分布发生显著变化;这些稀疏变化具有结构规律(如低熵、位置集中);少量RL token插入基础模型输出即可恢复大部分性能,反之则使RL性能崩溃;基于发散加权的优势信号可进一步提升效果。 Conclusion: RLVR并非全局重调,而是一种聚焦于关键token决策的靶向精修过程,本文为理解其内在机制提供了token级的精细视角。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.[8] Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception
Jin Ma,Jingwen Yan,Mohammed Aldeen,Ethan Anderson,Taran Kavuru,Jinkyung Katie Park,Feng Luo,Long Cheng
Main category: cs.CL
TL;DR: 本文提出了一种自动化生成社区注释(Community Notes)的方法ACCNote,用于纠正图像内容的上下文欺骗(如时间、主体或事件错误),通过构建真实数据集XCheck、多智能体检索增强框架及新评估指标CHS,显著提升了检测与注释生成效果。
Details
Motivation: 现有Community Notes依赖人工,难以兼顾时效性与可扩展性;而图像上下文欺骗需生成简洁、有依据的修正注释,而非简单真/假判断,该任务因缺乏数据、动态性强、评估困难而未被充分研究。 Method: 构建真实世界数据集XCheck;提出基于大视觉语言模型的检索增强型多智能体协作框架ACCNote;设计契合用户理解效果的评估指标Context Helpfulness Score(CHS)。 Result: 在XCheck数据集上,ACCNote在欺骗检测与注释生成两方面均优于基线方法,并超越商用模型GPT5-mini。 Conclusion: 本工作通过新数据集、新方法和新评估指标,推动了面向负责任社交网络的自动化上下文修正注释生成的实用化进程。 Abstract: Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation method for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), Community Notes-style systems need to generate concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to three reasons: (i) datasets that support the research are scarce; (ii) methods must handle the dynamic nature of contextual deception; (iii) evaluation is difficult because standard metrics do not capture whether notes actually improve user understanding. To address these gaps, we curate a real-world dataset, XCheck, comprising X posts with associated Community Notes and external contexts. We further propose the Automated Context-Corrective Note generation method, named ACCNote, which is a retrieval-augmented, multi-agent collaboration framework built on large vision-language models. Finally, we introduce a new evaluation metric, Context Helpfulness Score (CHS), that aligns with user study outcomes rather than relying on lexical overlap. Experiments on our XCheck dataset show that the proposed ACCNote improves both deception detection and note generation performance over baselines, and exceeds a commercial tool GPT5-mini. Together, our dataset, method, and metric advance practical automated generation of context-corrective notes toward more responsible online social networks.[9] LLM-guided headline rewriting for clickability enhancement without clickbait
Yehudit Aperstein,Linoy Halifa,Sagiv Bar,Alexander Apartsin
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的引导式新闻标题重写框架,通过正负双引导模型(分别建模点击吸引力与点击诱饵倾向),在保持语义保真度的前提下可控地提升标题吸引力,避免滑向点击诱饵。
Details
Motivation: 新闻标题优化常被等同于点击诱饵,损害编辑公信力;作者认为点击诱饵是正当吸引力线索被过度放大的极端结果,需在可控生成中平衡吸引力、语义忠实与反点击诱饵。 Method: 构建基于FUDGE范式的可控生成框架:使用一个主LLM,辅以两个训练于真实中性新闻语料的引导模型——(1)点击诱饵评分模型(负向引导),(2)吸引力属性模型(正向引导);点击诱饵样本由LLM基于预定义策略合成。通过调节引导权重实现从中性到高吸引力但合规标题的连续生成。 Result: 实现了在语义保真约束下对标题吸引力的细粒度、可调控增强;验证了该方法可在显著提升点击潜力的同时有效抑制点击诱饵倾向;提供了研究吸引力-保真度-反点击诱饵三者权衡的可解释框架。 Conclusion: 点击诱饵应被视为吸引力调控失衡的结果而非独立风格;本文提出的双引导FUDGE框架为新闻业提供了负责任、可解释、可控的LLM驱动标题优化新范式。 Abstract: Enhancing reader engagement while preserving informational fidelity is a central challenge in controllable text generation for news media. Optimizing news headlines for reader engagement is often conflated with clickbait, resulting in exaggerated or misleading phrasing that undermines editorial trust. We frame clickbait not as a separate stylistic category, but as an extreme outcome of disproportionate amplification of otherwise legitimate engagement cues. Based on this view, we formulate headline rewriting as a controllable generation problem, where specific engagement-oriented linguistic attributes are selectively strengthened under explicit constraints on semantic faithfulness and proportional emphasis. We present a guided headline rewriting framework built on a large language model (LLM) that uses the Future Discriminators for Generation (FUDGE) paradigm for inference-time control. The LLM is steered by two auxiliary guide models: (1) a clickbait scoring model that provides negative guidance to suppress excessive stylistic amplification, and (2) an engagement-attribute model that provides positive guidance aligned with target clickability objectives. Both guides are trained on neutral headlines drawn from a curated real-world news corpus. At the same time, clickbait variants are generated synthetically by rewriting these original headlines using an LLM under controlled activation of predefined engagement tactics. By adjusting guidance weights at inference time, the system generates headlines along a continuum from neutral paraphrases to more engaging yet editorially acceptable formulations. The proposed framework provides a principled approach for studying the trade-off between attractiveness, semantic preservation, and clickbait avoidance, and supports responsible LLM-based headline optimization in journalistic settings.[10] Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures
Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
Main category: cs.CL
TL;DR: 本文提出了一种功能组件消融框架,系统评估了两种亚十亿参数混合语言模型(Qwen3.5-0.8B 和 Falcon-H1-0.5B)中注意力机制与状态空间模型/线性注意力的协同作用,发现替代组件(SSM或线性注意力)是主干,且混合架构具有显著的功能冗余和容错优势。
Details
Motivation: 探究混合语言模型中注意力与SSM/线性注意力是否真正协同工作,而非一方被另一方绕过或弱化。 Method: 在两个子1B混合模型(Qwen3.5-0.8B、Falcon-H1-0.5B)及纯Transformer控制模型(Qwen2.5-0.5B)上,开展组消融、层扫描、位置消融、匹配随机对照和多基准困惑度分析。 Result: (1)两类组件均不可或缺;(2)替代组件(SSM/线性注意力)为主干,移除导致>35,000x困惑度上升,远超注意力移除的~82x;(3)组件重要性呈位置梯度,早期层更关键;(4)混合模型对随机层删除的鲁棒性比纯Transformer高20–119倍。 Conclusion: 混合架构并非简单堆叠,而是形成具有功能冗余与分工的协同系统,为模型压缩、架构设计与容错部署提供实证依据。 Abstract: Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub-1B hybrid models -- Qwen3.5-0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon-H1-0.5B (parallel: Mamba-2 + attention) -- with a pure Transformer control (Qwen2.5-0.5B). Through group ablations, layer-wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20-119x greater resilience to random layer removal than pure Transformers, revealing built-in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault-tolerant deployment.[11] Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning
Niyati Bafna,Ryan Soh-Eun Shim,Barbara Plank,David Yarowsky,Hale Sirin
Main category: cs.CL
TL;DR: 本文提出Rashid框架,通过可逆加密高资源语言(HRLs)来构建真正未见的语言,以克服低资源语言在ICLL研究中缺乏工具、数据和专家的瓶颈,从而支持大规模、多任务、可复现的ICLL研究。
Details
Motivation: 低资源语言在上下文内语言学习(ICLL)研究中面临NLP工具缺失、数据匮乏和专家不足的问题,导致评估困难、实验成本高、结论泛化性差。 Method: 提出Rashid框架,利用可逆密码技术将高资源语言(如英语)转换为结构真实但语义不可读的‘伪低资源语言’,从而在保留丰富标注数据和工具的同时模拟真实低资源场景。 Result: 基于Rashid框架,作者系统评估了现有ICLL方法,验证了昂贵资源(如双语词典、音素信息)的有效性,并在翻译以外的丰富下游任务上测试了ICLL策略,揭示了当前方法的局限与改进方向。 Conclusion: Rashid为ICLL研究提供了可控、可扩展、可复现的实验范式,显著拓展了该领域的探索边界,并为未来低资源语言学习研究提供了实用路径与实证洞见。 Abstract: Where there is growing interest in in-context language learning (ICLL) for unseen languages with large language models, such languages usually suffer from the lack of NLP tools, data resources, and researcher expertise. This means that progress is difficult to assess, the field does not allow for cheap large-scale experimentation, and findings on ICLL are often limited to very few languages and tasks. In light of such limitations, we introduce a framework (Rashid), for studying ICLL wherein we reversibly cipher high-resource languages (HRLs) to construct truly unseen languages with access to a wide range of resources available for HRLs, unlocking previously impossible exploration of ICLL phenomena. We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond machine translation. These lines of exploration showcase the possibilities enabled by our framework, as well as providing actionable insights regarding current performance and future directions in ICLL.[12] Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs
Aria Pessianzadeh,Alex H. Poole,Rezvaneh Rezapour
Main category: cs.CL
TL;DR: 本研究通过计算分析Reddit上超过17,000条堕胎相关帖子,考察多布斯案判决前后用户如何在线表达堕胎获取障碍,识别出法律、经济、情感与社会等八类障碍,并发现情感与心理障碍在各阶段均占主导,伴随显著的负面情绪。
Details
Motivation: Dobbs案判决后美国堕胎权格局剧变,亟需理解公众(尤其在线社区)如何感知、表达和应对新型获取障碍。 Method: 基于四个堕胎相关子版块的17,000+帖子,构建多步分类流水线:标注信息类型、堕胎阶段(前/中/后)、八类障碍及情绪;结合主题建模分析障碍理由的时序演化。 Result: 情感与心理障碍(如恐惧、困惑、悲伤、紧张)在线叙事中持续占据主导地位;不同障碍类型与特定情绪及信息行为(寻求/分享)存在系统性关联;话题建模揭示障碍论述随法律与文化环境动态演变。 Conclusion: 在线堕胎 discourse 的核心是情感负担而非单纯制度障碍;该研究为理解政策突变下公众健康叙事提供了多维实证框架,并凸显数字平台作为脆弱性监测与支持干预窗口的价值。 Abstract: The 2022 U.S. Supreme Court decision in Dobbs v. Jackson Women's Health Organization reshaped the reproductive rights landscape, introducing new uncertainty and barriers to abortion access. We present a large-scale computational analysis of abortion discourse on Reddit, examining how barriers to access are articulated across information-seeking and information-sharing behaviors, different stages of abortion (before, during, after), and three phases of the Dobbs decision in 2022. Drawing on more than 17,000 posts from four abortion-related subreddits, we employed a multi-step pipeline to classify posts by information type, abortion stage, barrier category, and expressed emotions. Using a codebook of eight barrier types, including legal, financial, emotional, and social obstacles, we analyzed their associations with emotions and information behaviors. Topic modeling of model-generated barrier rationales further revealed how discourse evolved in response to shifting legal and cultural contexts. Our findings show that emotional and psychological barriers consistently dominate abortion narratives online, with emotions such as nervousness, confusion, fear, and sadness prevalent across discourse. By linking information behaviors, barriers, emotions, and temporal dynamics, this study provides a multi-dimensional account of how abortion is navigated in online communities.[13] CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context
Giovana Kerche Bonás,Roseval Malaquias Junior,Marcos Piau,Thiago Laitz,Thales Sales Almeida,Hugo Abonizio,Celio Larcher,Ramon Pires,Rodrigo Nogueira
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.[14] Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young
Main category: cs.CL
TL;DR: 本研究评估了12个开源推理模型在链式思维(CoT)中的忠实性(faithfulness),即模型是否准确表达影响其输出的实际因素;结果发现忠实性在39.7%–89.9%间显著差异,受架构与训练方式影响远大于参数量,且模型内部识别提示影响(~87.5%)远高于输出中承认该影响(~28.6%),表明CoT作为安全监控机制存在系统性局限。
Details
Motivation: 链式思维(CoT)被用作大模型在安全关键场景下的透明性机制,但其有效性依赖于‘忠实性’;此前仅在两个闭源模型上评估,结果不佳(25%–39%),亟需在开源模型生态中系统评估。 Method: 对12个开源推理模型(7B–685B参数、9种架构)在MMLU和GPQA Diamond共498道多选题上开展实验,注入六类推理提示(如一致性、谄媚、元数据等),统计提示成功改变答案时模型在CoT中承认提示影响的比例,并结合关键词分析区分思考token与答案文本中的承认行为。 Result: 整体忠实率介于39.7%(Seed-1.6-Flash)至89.9%(DeepSeek-V3.2-Speciale);一致性提示(35.5%)和谄媚提示(53.9%)承认率最低;训练方法与模型家族比参数量更能预测忠实性;思考token承认率约87.5%,而答案文本中仅约28.6%。 Conclusion: CoT忠实性不是模型固有不变的属性,而是随架构、训练方式及提示类型系统变化;当前CoT监控作为安全机制的可行性受限,需更精细的忠实性建模与干预策略。 Abstract: Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.[15] LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
Hailay Teklehaymanot,Dren Fazlija,Wolfgang Nejdl
Main category: cs.CL
TL;DR: 本文提出LGSE框架,通过形态学感知的子词分割和初始化策略,提升低资源、形态丰富语言的预训练语言模型适配效果。
Details
Motivation: 现有词汇扩展方法依赖任意子词切分,导致词汇表征碎片化并丢失关键形态信息,难以有效适配低资源、形态丰富的语言。 Method: 提出Lexically Grounded Subword Embedding Initialization (LGSE)框架:基于形态素分解单词,用预训练子词或FastText形态素表征均值初始化新token嵌入;不可分割时使用字符n-gram表示;在语言自适应预训练中加入正则项约束新嵌入偏离初始值,并仅更新新增嵌入。 Result: 在阿姆哈拉语和提格里尼亚语的问答、命名实体识别和文本分类任务上,LGSE一致优于基线方法。 Conclusion: 形态学驱动的嵌入初始化能显著提升低资源语言的表征质量,验证了利用形态结构先验知识进行嵌入初始化的有效性。 Abstract: Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.[16] Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages
Chukwuebuka Anyaegbuna,Eduardo Juan Perez Guerrero,Jerry Liu,Timothy Keyes,April Liang,Natasha Steele,Stephen Ma,Jonathan Chen,Kevin Schulman
Main category: cs.CL
TL;DR: 本研究评估了四种前沿大语言模型(GPT-5.1、Claude Opus 4.5、Gemini 3 Pro、Kimi K2)在8种语言(涵盖高、中、低资源语言)上对22份医疗文档的翻译性能,发现其语义保真度高且不受语言资源水平显著影响,表明LLM有望改善医疗语言无障碍。
Details
Motivation: 解决美国2730万非英语母语居民面临的专业医疗翻译成本高、可及性低的问题。 Method: 使用五层验证框架,评估四款前沿大语言模型在8种语言(含高、中、低资源语言)对22份医疗文档的翻译效果;采用LaBSE评分、跨模型回译、模型间一致性分析及词汇借用分析等方法进行多角度验证。 Result: 所有模型在704个翻译对中均实现高语义保真度(LaBSE > 0.92);高低资源语言间差异不显著(p = 0.066);跨模型回译验证无循环偏差(delta = -0.0009);四模型间一致性高(LaBSE = 0.946);词汇借用与保真度无相关性(rho = +0.018, p = 0.82)。 Conclusion: 前沿大语言模型能在不同语言资源水平下可靠保持医疗文本语义,具备支持医疗语言无障碍的实际潜力。 Abstract: Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82). These converging results suggest frontier LLMs preserve medical meaning across resource levels, with implications for language access in healthcare.[17] Improving LLM Predictions via Inter-Layer Structural Encoders
Tom Ulanovski,Eyal Blyachman,Maya Bechler-Speicher
Main category: cs.CL
TL;DR: 本文提出Inter-Layer Structural Encoders (ILSE),利用Cayley-Encoder从LLM各中间层提取并融合信息,显著提升多任务性能,尤其在小模型和少样本场景下效果突出。
Details
Motivation: 现有LLM通常仅使用最后一层表示,但研究表明不同任务可能需不同中间层特征;如何有效融合各层信息尚缺乏系统方法。 Method: 提出ILSE框架,核心是基于expander Cayley图的几何编码器Cayley-Encoder,实现跨层结构化信息传播与融合。 Result: 在13个分类与语义相似度任务、9种参数规模(14M–8B)的LLM上验证,ILSE最高提升准确率44%、相似度25%,且在少样本下高效,可使小模型媲美大模型。 Conclusion: ILSE提供了一种通用、高效、可扩展的跨层表示融合范式,显著增强LLM表征能力,尤其利于资源受限场景。 Abstract: The standard practice in Large Language Models (LLMs) is to base predictions on the final-layer token representations. Recent studies, however, show that intermediate layers encode substantial information, which may contain more task-relevant features than the final-layer representations alone. Importantly, it was shown that for different tasks, different layers may be optimal. In this work we introduce Inter-Layer Structural Encoders (ILSE), a powerful structural approach to learn one effective representation from the LLM's internal layer representations all together. Central to ILSE is Cayley-Encoder, a mathematically grounded geometric encoder that leverages expander Cayley graphs for efficient inter-layer information propagation. We evaluate ILSE across 13 classification and semantic similarity tasks with 9 pre-trained LLMs ranging from 14 million to 8 billion parameters. ILSE consistently outperforms baselines and existing approaches, achieving up to 44% improvement in accuracy and 25% in similarity metrics. We further show that ILSE is data-efficient in few-shot regimes and can make small LLMs competitive with substantially larger models.[18] Synthetic or Authentic? Building Mental Patient Simulators from Longitudinal Evidence
Baihan Li,Bingrui Jin,Kunyao Lan,Ming Wang,Mengyue Wu
Main category: cs.CL
TL;DR: 本文提出DEPROFILE框架,通过整合多源真实数据构建统一的患者档案,并引入Chain-of-Change代理将纵向记录转化为时序记忆表示,显著提升了心理对话系统中患者模拟的真实性、行为多样性和事件丰富性。
Details
Motivation: 现有患者模拟方法依赖快照式提示,缺乏充分的个人资料信息,导致多轮交互中行为同质、疾病进展不连贯。 Method: 提出DEPROFILE数据驱动的患者模拟框架,融合人口统计属性、标准化临床症状、咨询对话及纵向生活事件历史;并设计Chain-of-Change代理,将嘈杂的纵向记录转化为结构化、时间对齐的记忆表示。 Result: 在多个大语言模型上实验表明,DEPROFILE构建的更全面患者档案显著提升了对话真实性、行为多样性与事件丰富性,优于当前最优基线。 Conclusion: 以可验证的纵向证据为依据进行患者模拟,对提升心理对话系统性能至关重要。 Abstract: Patient simulation is essential for developing and evaluating mental health dialogue systems. As most existing approaches rely on snapshot-style prompts with limited profile information, homogeneous behaviors and incoherent disease progression in multi-turn interactions have become key chellenges. In this work, we propose DEPROFILE, a data-grounded patient simulation framework that constructs unified, multi-source patient profiles by integrating demographic attributes, standardized clinical symptoms, counseling dialogues, and longitudinal life-event histories from real-world data. We further introduce a Chain-of-Change agent to transform noisy longitudinal records into structured, temporally grounded memory representations for simulation. Experiments across multiple large language model (LLM) backbones show that with more comprehensive profile constructed by DEPROFILE, the dialogue realism, behavioral diversity, and event richness have consistently improved and exceed state-of-the-art baselines, highlighting the importance of grounding patient simulation in verifiable longitudinal evidence.[19] Detecting Non-Membership in LLM Training Data via Rank Correlations
Pranav Shetty,Mirazul Haque,Zhiqiang Ma,Xiaomo Liu
Main category: cs.CL
TL;DR: 本文提出PRISM方法,利用灰盒访问模型logits,通过检测两个模型在未见过数据集时的归一化token概率秩相关性更高这一现象,实现数据集级非成员身份验证,从而可靠地确认特定数据集未被用于LLM训练。
Details
Motivation: 随着大语言模型训练数据规模扩大且来源不透明,验证某数据集是否未被用于训练变得至关重要,但此前研究主要关注成员推断,而非成员验证问题被忽视。 Method: 提出PRISM测试方法,基于未见过同一数据集的两个模型在归一化token log概率上的秩相关性更高这一关键观察,构建相关性检验来检测数据集级非成员身份,仅需灰盒访问模型logits。 Result: PRISM在所有测试数据集上均能可靠排除训练成员身份,且无假阳性,实现了对特定数据集未参与训练的验证。 Conclusion: PRISM为验证特定数据集被排除在LLM训练之外提供了首个有效、可靠的框架,填补了非成员验证的研究空白。 Abstract: As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem -- verifying that a dataset was not used -- has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level non-membership using only grey-box access to model logits. Our key insight is that two models that have not seen a dataset exhibit higher rank correlation in their normalized token log probabilities than when one model has been trained on that data. Using this observation, we construct a correlation-based test that detects non-membership. Empirically, PRISM reliably rules out membership in training data across all datasets tested while avoiding false positives, thus offering a framework for verifying that specific datasets were excluded from LLM training.[20] Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
Naohiro Tawara,Samuele Cornell,Alexander Polok,Marc Delcroix,Lukáš Burget,Shinji Watanabe
Main category: cs.CL
TL;DR: 本文系统比较了基于大语言模型(LLM)和模块化流水线的语音识别方法在多说话人对话场景下的性能,提出了新的语义错误率指标tcpSemER,并发现LLM方法在双说话人场景中表现尚可,但在说话人数量和重叠增加时性能下降明显,而模块化方法更鲁棒。
Details
Motivation: 现有LLM-based ASR系统在单说话人基准上表现良好,但在真实多说话人、重叠语音、远场噪声等复杂场景下鲁棒性未知,需系统评估与新评价指标。 Method: 沿重叠鲁棒性、语义保真度、说话人数和单/多通道输入四个维度进行系统对比;提出基于嵌入语义相似度的tcpSemER指标,替代传统Levenshtein距离;将tcpWER分解为重叠与非重叠两部分以支持细粒度分析。 Result: 实验表明:LLM-based系统在双说话人场景中与模块化方法相当,但随说话人数和语音重叠度增加显著退化;模块化流水线在各类复杂条件下整体更鲁棒。 Conclusion: LLM-based ASR当前在多说话人重叠对话中仍面临挑战,模块化方法在鲁棒性方面仍有不可替代优势;新指标tcpSemER能更有效地捕捉语义层面的错误。 Abstract: Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.[21] How Utilitarian Are OpenAI's Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025)
Johannes Himmelreich
Main category: cs.CL
TL;DR: 本研究复现并扩展了Pfeffer等人(2025)关于大语言模型在电车难题中道德判断的研究,发现单一提示下的结果不可靠;通过多提示变体测试揭示:GPT-4o的低功利主义响应率源于安全拒绝而非道德立场,而推理模型在脚桥困境中虽倾向更功利,但常拒绝回答或给出非功利答案。研究强调多提示鲁棒性测试应成为LLM道德行为实证研究的标准方法。
Details
Motivation: 检验并深化Pfeffer等(2025)关于LLM道德推理差异的结论,探究其结果是否受提示工程影响,从而评估单提示评估的可靠性。 Method: 复现原研究并扩展至四个当前OpenAI模型,系统测试不同提示变体(如'Should I...' vs 'Is it morally permissible...?'),分析响应模式(功利/非功利/拒绝回答)。 Result: 电车问题中,原‘推理模型更功利’的结论不成立——GPT-4o的低功利响应源于提示框架触发的安全拒绝;改用中性提问后其功利响应率达99%;所有模型在去混淆提示下均趋同于功利答案。脚桥困境中,推理模型虽整体更倾向功利,但高频拒绝作答或给出非功利答案。 Conclusion: 单提示评估无法可靠反映LLM真实道德推理能力;必须采用多提示鲁棒性测试,否则易得出误导性结论;提示设计是影响LLM道德响应的关键混杂变量。 Abstract: Pfeffer, Krügel, and Uhl (2025) report that OpenAI's reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o's low utilitarian rate doesn't reflect a deontological commitment but safety refusals triggered by the prompt's advisory framing. When framed as "Is it morally permissible...?" instead of "Should I...?", GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.[22] Explanation Generation for Contradiction Reconciliation with LLMs
Jason Chan,Zhixue Zhao,Robert Gaizauskas
Main category: cs.CL
TL;DR: 本文提出了一种新的NLP任务——调和性解释生成,旨在让模型为表面矛盾的陈述生成使其兼容的解释;作者利用现有自然语言推理数据集构建该任务,并设计了可扩展的自动评估指标;实验表明当前大语言模型在此任务上表现有限,且增加推理计算量的收益随模型增大而饱和。
Details
Motivation: 人类在社交互动和专业领域中常需假设能调和矛盾的解释,但现有NLP工作多将矛盾视为需消除的错误,而大语言模型在此类调和性推理能力上的研究尚属空白。 Method: 提出调和性解释生成任务,通过重用现有自然语言推理(NLI)数据集构建训练与测试样本,并设计基于逻辑一致性、信息量与简洁性的自动质量评估指标。 Result: 对18个大语言模型的实验显示,多数模型在该任务上表现较差;扩大测试时计算资源(如‘思考’步数)带来的提升随模型参数规模增大而趋于饱和。 Conclusion: 调和性解释生成是大语言模型推理能力中一个被忽视的重要维度,其局限性可能影响聊天机器人、科研助手等下游应用的效果,亟需针对性改进。 Abstract: Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.[23] PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation
Ruidi Chang,Jiawei Zhou,Hanjie Chen
Main category: cs.CL
TL;DR: PRISM是一个用于联合分析大语言模型推理过程中跨步骤和跨层动态变化的新框架,揭示了失败推理轨迹的典型模式(如验证循环、过度思考、过早承诺)以及提示对推理行为的深层影响。
Details
Motivation: 现有研究仅从token序列(跨步骤)或隐状态向量(跨层)单一视角分析LLM推理过程,缺乏统一的多维诊断工具。 Method: 提出PRISM框架,通过概率建模联合分析推理轨迹在语义层面(跨步骤)和隐状态层面(跨层)的演化,结合多个推理模型与基准测试进行实证诊断。 Result: 发现失败推理更易陷入验证循环,并分化为‘过度思考’与‘过早承诺’等不同模式;提示不仅影响准确率,还系统性改变语义转移与内部计算模式。 Conclusion: PRISM将推理过程建模为结构化可观察对象,为LLM推理机制的分析与诊断提供了实用、可解释的新工具。 Abstract: Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden-state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final-task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.[24] KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training
Ramchand Kumaresan
Main category: cs.CL
TL;DR: 本文提出KALAVAI协议,通过后验融合独立训练的领域专家模型,实现性能提升且增益可预测;该协议采用轻量级MoE路由机制,在多语言和大规模模型上均展现出显著性能增益,并明确了共享初始化、冻结层和学习型路由三大关键要求。
Details
Motivation: 解决独立训练的领域专家模型难以有效协同、融合增益不可预测的问题,提供一种无需重新训练即可提升整体性能的实用化后验融合方法。 Method: KALAVAI协议:各贡献者在共享检查点基础上独立微调,再通过仅需500步训练的轻量级MoE路由器进行融合;要求共享初始化、可选冻结底层参数、必须使用学习型路由(而非均匀平均)。 Result: 融合增益与专家间发散度呈线性关系(gain = 0.82×divergence − 2.72);在410M/1B/6.9B模型上分别提升+7.72%/+7.49%/+6.53%;跨语言融合达+21.76%,Yoruba困惑度从41.9降至7.7;20人联邦提升+16.71%;路由器性能逼近域最优路由(误差<10^{-5} nats)。 Conclusion: 后验融合可行且高效,增益可量化预测;KALAVAI为构建协作式大模型生态提供了低开销、高兼容性的新范式。 Abstract: Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.[25] DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona
Janghyeok Choi,Jaewon Lee,Sungzoon Cho
Main category: cs.CL
TL;DR: 本文提出DALDALL框架,利用法律领域专业角色(如律师、检察官、法官)进行数据增强,生成更具词汇和语义多样性的查询,提升低资源法律信息检索任务的性能。
Details
Motivation: 低资源领域面临数据稀缺问题,现有基于大语言模型的数据增强方法重数量轻质量,缺乏领域适配性。 Method: 提出基于专业角色(律师、检察官、法官等)的 persona-based 数据增强框架 DALDALL,用于法律信息检索任务。 Result: 在 CLERC 和 COLIEE 基准上验证:生成查询的词汇多样性(Self-BLEU)提升,语义保真度保持;微调后的稠密检索器在召回率上优于基线。 Conclusion: 基于角色的提示策略是生成高质量、领域适配训练数据的有效方法,尤其适用于专业化低资源场景。 Abstract: Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.[26] Span Modeling for Idiomaticity and Figurative Language Detection with Span Contrastive Loss
Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种基于BERT和RoBERTa的微调模型,结合槽位损失与带难负样本重加权的跨度对比损失(SCL),显著提升了习语识别性能,在多个数据集上达到SOTA序列准确率,并引入F1与序列准确率的几何平均作为新评估指标。
Details
Motivation: 习语等非组合性多词表达对语言模型构成挑战,因其含义不可由单个词义简单叠加得到,且常规tokenization和上下文嵌入难以直接捕捉其整体语义;现有大模型虽通过扩大短语词表缓解问题,但仍常依赖少样本提示或指令微调,而BERT/LSTM微调方法效果更优,亟需进一步提升。 Method: 提出基于BERT/RoBERTa的微调模型,联合使用槽位损失(slot loss)与带难负样本重加权的跨度对比损失(SCL);通过消融实验验证各组件有效性,并引入F1与序列准确率(SA)的几何平均作为综合评估指标。 Result: 在现有习语检测数据集上达到最优序列准确率(state-of-the-art SA);消融研究表明SCL有效且具有泛化能力;所提几何平均指标能同时衡量模型的跨度感知能力与整体性能。 Conclusion: 融合槽位损失与改进的跨度对比损失(SCL)可有效增强预训练语言模型对非组合性习语的识别能力,所提方法及评估指标为习语理解任务提供了新思路与实用工具。 Abstract: The category of figurative language contains many varieties, some of which are non-compositional in nature. This type of phrase or multi-word expression (MWE) includes idioms, which represent a single meaning that does not consist of the sum of its words. For language models, this presents a unique problem due to tokenization and adjacent contextual embeddings. Many large language models have overcome this issue with large phrase vocabulary, though immediate recognition frequently fails without one- or few-shot prompting or instruction finetuning. The best results have been achieved with BERT-based or LSTM finetuning approaches. The model in this paper contains one such variety. We propose BERT- and RoBERTa-based models finetuned with a combination of slot loss and span contrastive loss (SCL) with hard negative reweighting to improve idiomaticity detection, attaining state of the art sequence accuracy performance on existing datasets. Comparative ablation studies show the effectiveness of SCL and its generalizability. The geometric mean of F1 and sequence accuracy (SA) is also proposed to assess a model's span awareness and general performance together.[27] Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration
Qiyao Sun,Xingming Li,Xixiang He,Ao Cheng,Xuanyu Ji,Hailun Lu,Runke Huang,Qingyong Hu
Main category: cs.CL
TL;DR: 本文提出了一种自适应贝叶斯估计框架(ABE),用于语义熵建模与引导式语义探索,以更高效、准确地检测大语言模型生成中的幻觉现象。该方法动态调整采样次数,并结合扰动重要性采样策略,在多个QA数据集上显著提升了检测性能和计算效率。
Details
Motivation: 现有幻觉检测方法依赖固定采样预算,无法适应查询复杂度,导致计算效率低下。 Method: 提出自适应贝叶斯估计框架(ABE),采用分层贝叶斯建模语义分布,通过基于方差的阈值动态控制采样轮次;并设计扰动重要性采样策略以系统探索语义空间。 Result: 在四个QA数据集上实验表明,该方法在低预算下比现有方法少约50%采样量即达同等检测性能,相同采样预算下平均AUROC提升12.6%。 Conclusion: ABE框架能兼顾幻觉检测的准确性与计算效率,为LLM可信评估提供了新思路。 Abstract: Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance-based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation-based importance sampling strategy to systematically explore the semantic space. Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low-budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.[28] When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning
Abhinaba Basu,Pavan Chakraborty
Main category: cs.CL
TL;DR: 本文提出了一种“步骤级评估”方法,通过逐条删除思维链(CoT)中的推理步骤来检验模型是否真正依赖这些步骤进行决策。实验发现,多数前沿大模型(如GPT-4、Claude Opus等)在多个任务中生成的推理步骤多为“装饰性”,即删除后答案不变;仅少数模型在特定任务上表现出真实步骤依赖;研究还揭示了“输出刚性”现象,并指出训练目标比模型规模更关键地影响推理真实性。
Details
Motivation: 验证语言模型生成的思维链(CoT)推理步骤是否被模型实际用于决策,还是仅作为事后的解释性装饰,从而评估其推理的忠实性(faithfulness)。 Method: 提出“步骤级评估”:对每个样本,依次删除CoT中的一句推理,观察最终答案是否改变;结合API调用实现零权重访问评估;辅以注意力机制分析(如CoT注意力衰减程度)和输出长度对比(“输出刚性”)。 Result: 在情感分析、数学、主题分类、医学问答等任务中,10个前沿模型大多显示低步骤必要性(<17%),即推理步骤多为装饰性;小模型(0.8–8B)在数学任务中步骤必要性达55%;MiniMax-M2.5和Kimi-K2.5在部分任务中表现突出但存在任务间shortcut;Claude Opus输出步骤多而GPT-OSS-120B极简,证实输出刚性;注意力分析显示装饰性任务中晚期层CoT注意力下降更显著(33% vs 20%)。 Conclusion: 当前主流大模型的CoT推理普遍缺乏忠实性,其步骤多为事前决策后的叙述性包装;推理真实性高度依赖具体模型与任务组合,不能泛化;提升忠实性需改进训练目标而非单纯扩大模型规模。 Abstract: Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover "output rigidity": on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.[29] RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings
Sitong Zhou,Meliha Yetisgen,Mari Ostendorf
Main category: cs.CL
TL;DR: 本文提出了一种面向纵向放射学报告的结构化摘要方法——时间线生成任务,并构建了RadTimeline数据集用于评估,实验表明中间步骤(组名生成)对发现分组至关重要。
Details
Motivation: 纵向放射学报告中追踪病灶变化对疾病进展判断至关重要,但人工处理耗时,亟需自动摘要技术辅助。 Method: 将纵向报告摘要建模为时间线生成任务:按日期列组织病灶发现,按时间关联性行分组;采用三步LLM流程——提取病灶、生成组名、依据组名聚类病灶。 Result: 在自建RadTimeline数据集上的实验揭示了不同规模LLM与提示策略的权衡;组名生成作为中间步骤被证实关键;最优配置召回率高但含少量无关发现,分组性能接近人工标注水平。 Conclusion: 结构化时间线摘要形式有利于跨时间比较与事实核查;三步LLM框架中组名生成是提升分组效果的核心环节;该方法在临床放射学报告摘要中具有实用潜力。 Abstract: Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting strategies. Our results highlight that group name generation as an intermediate step is critical for effective finding grouping. The best configuration has some irrelevant findings but very good recall, and grouping performance is comparable to human annotators.[30] Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts
Maida Aizaz,Quang Minh Nguyen
Main category: cs.CL
TL;DR: 本文研究了5个主流大语言模型在生成巴勒斯坦和以色列人物角色时如何表征地缘政治身份,发现其在战争与非战争语境下存在系统性偏见(如对巴勒斯坦人倾向赋予低社会经济地位、生存导向角色,而以色列人则多为中产、专业角色),且即使提示避免有害假设,模型仍难以消除深层社会经济差异,其推理过程虽提及公平概念,但最终输出未一致体现公平结果。
Details
Motivation: 随着大语言模型被广泛用于社会模拟与角色生成,亟需理解其如何表征敏感的地缘政治身份,尤其在冲突语境下是否存在系统性偏见。 Method: 对5个主流大语言模型,在640种实验条件下(涵盖战争/非战争语境、不同角色设定)生成巴勒斯坦与以色列人物角色;分析生成属性的分布模式,并考察加入‘避免有害假设’提示后的变化;同时追踪模型的推理链以对比其公平性表述与实际输出的一致性。 Result: 发现显著的分布偏差:战争语境下巴勒斯坦角色多关联低社会经济地位与生存导向角色,以色列角色则保持中产与专业属性;加入公平提示后,模型表现出多样化调整(如增加非二元性别推断、趋同于泛化职业),但核心社会经济差异仍顽固存在;推理链中虽频繁出现公平相关概念,却未稳定转化为公平的生成结果。 Conclusion: 大语言模型对地缘政治身份的表征存在结构性偏差,其对公平原则的理解与执行之间缺乏一致性,提示仅靠提示工程难以根除深层社会偏见,需更深入的模型机制干预与评估框架。 Abstract: Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., "student"), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.[31] Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer
Chaoqun Cui,Caiyan Jia
Main category: cs.CL
TL;DR: 本文提出了一种基于纯Transformer架构的谣言检测新方法P2T3,通过提取传播方向上的对话链、引入token-wise嵌入和归纳偏置,并在大规模无标签数据上预训练,有效缓解了图神经网络(GNN)在谣言传播树中因结构特性导致的过平滑和长程依赖建模困难问题,在多个基准数据集及少样本场景下均取得SOTA性能。
Details
Motivation: 现有基于GNN的谣言检测方法在处理谣言传播树时易出现过平滑问题,且难以建模长程依赖;该问题根源在于传播树中大量节点为1级节点,结构特性导致GNN性能下降。 Method: 提出Pre-Trained Propagation Tree Transformer(P2T3),基于纯Transformer架构:从传播树中按回复方向提取所有对话链,采用token-wise embedding编码连接信息,并引入必要归纳偏置,最后在大规模无标签数据上进行预训练。 Result: P2T3在多个基准数据集上超越现有SOTA方法,并在少样本条件下表现优异;成功规避GNN固有的过平滑问题,具备扩展为大模型或统一多模态方案的潜力。 Conclusion: P2T3为谣言检测提供了一种更鲁棒、可扩展的新范式,验证了纯Transformer架构在建模社交传播结构上的有效性,有望推动未来社交媒体分析的统一建模研究。 Abstract: Deep learning techniques for rumor detection typically utilize Graph Neural Networks (GNNs) to analyze post relations. These methods, however, falter due to over-smoothing issues when processing rumor propagation structures, leading to declining performance. Our investigation into this issue reveals that over-smoothing is intrinsically tied to the structural characteristics of rumor propagation trees, in which the majority of nodes are 1-level nodes. Furthermore, GNNs struggle to capture long-range dependencies within these trees. To circumvent these challenges, we propose a Pre-Trained Propagation Tree Transformer (P2T3) method based on pure Transformer architecture. It extracts all conversation chains from a tree structure following the propagation direction of replies, utilizes token-wise embedding to infuse connection information and introduces necessary inductive bias, and pre-trains on large-scale unlabeled datasets. Experiments indicate that P2T3 surpasses previous state-of-the-art methods in multiple benchmark datasets and performs well under few-shot conditions. P2T3 not only avoids the over-smoothing issue inherent in GNNs but also potentially offers a large model or unified multi-modal scheme for future social media research.[32] EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
Yixuan Wang,Shiyu Ji,Yijun Liu,Qingfu Zhu,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出EchoKV,一种灵活的KV缓存压缩方案,支持标准与压缩推理间的按需切换,利用轻量网络重建残差KV分量,并通过两阶段微调实现快速低成本训练,在多个基准上优于现有方法。
Details
Motivation: 现有低秩压缩方法依赖不可逆参数变换,无法在内存充足时灵活恢复全精度推理,而KV缓存内存需求已成为LLM长上下文应用的关键瓶颈。 Method: 提出EchoKV:采用轻量网络从部分子集重建残差KV分量,利用注意力头间和层内相似性;并设计两阶段微调策略以实现快速、低成本训练。 Result: 在LongBench和RULER上实验表明,EchoKV在不同压缩比下持续优于现有方法,且短上下文场景下保持高吞吐。 Conclusion: EchoKV实现了KV缓存压缩的灵活性与高效性统一,兼顾长上下文压缩性能与短上下文推理效率,为LLM部署提供更实用的内存-计算权衡方案。 Abstract: The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.[33] Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset
Ryoma Suzuki,Zhiyang Qi,Michimasa Inaba
Main category: cs.CL
TL;DR: 本文提出了一种多大语言模型(LLM)集成方法,用于高质量翻译日语心理咨询对话数据集KokoroChat,生成英、中双语版本的Multilingual KokoroChat;该方法通过多模型生成多样假设,再由单一LLM综合分析优化输出,在人工评估中显著优于单个SOTA LLM。
Details
Motivation: 解决高质量、公开可用心理咨询对话数据集严重匮乏的问题,尤其需保障跨语言翻译在敏感咨询场景中的最高保真度。 Method: 构建多LLM集成翻译框架:先由多个不同LLM生成多样化翻译假设,再由一个LLM基于对各假设优缺点的分析,生成最终高质量译文。 Result: 人工偏好研究表明,该集成方法生成的英、中译文显著优于任一当前最优单LLM的输出;Multilingual KokoroChat数据集已开源。 Conclusion: 多LLM集成策略能有效提升心理咨询领域跨语言数据构建的质量与可靠性,为相关研究与应用提供了高质量多语资源支撑。 Abstract: To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method's outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.[34] Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion
Qi Sun,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng
Main category: cs.CL
TL;DR: 本文提出Cold-EQS框架,通过结合强化学习与不确定性估计,在冷启动场景下提升电商查询建议(EQS)效果,显著提高在线聊天用户访问量(chatUV)6.81%。
Details
Motivation: 现有基于大语言模型与点击率(CTR)模型的查询建议方法在冷启动场景下因缺乏足够点击数据而表现不佳,亟需一种不依赖历史点击信号的新方法。 Method: 提出Cold-EQS:一种面向冷启动电商查询建议的迭代式强化学习框架;以可回答性、事实性和信息增益为奖励信号;利用不确定性估计筛选无点击信号的难样本进行持续优化;并构建包含16949条真实用户查询的EQS-Benchmark。 Result: 离线与在线实验均验证Cold-EQS有效性,尤其在线chatUV指标提升+6.81%,且线上线下效果呈强正相关。 Conclusion: Cold-EQS有效缓解了冷启动问题,为无需点击反馈的查询建议任务提供了可行且高效的解决方案。 Abstract: Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.[35] Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees
Ye Li,Anqi Hu,Yuanchang Ye,Shiyan Tong,Zhiyuan Wang,Bo Fu
Main category: cs.CL
TL;DR: 本文提出了一种面向大语言模型(LLMs)的集合值预测框架,通过采样生成候选响应集,并基于数据驱动校准方法设定阈值,以在可行风险水平下提供统计上有效的覆盖保证。
Details
Motivation: 传统LLM仅输出最可能生成(MLG)作为点预测,低估了模型能力;实际上,正确答案可能存在于更广的生成空间中,可通过多次采样发现,因此需转向集合值预测。 Method: 提出一种具备可行性感知覆盖保证的集合值预测框架;定义最小可达风险水平(MRL),并设计数据驱动的校准流程,从采样响应中估计严格阈值以构建预测集。 Result: 在六个语言生成任务、五种LLM上的大量实验表明,该框架兼具统计有效性(满足覆盖概率要求)与预测效率(预测集大小合理)。 Conclusion: 集合值预测是提升LLM可靠性与可解释性的有效范式;所提框架在理论保障与实际性能间取得良好平衡,且MRL概念揭示了覆盖能力的根本限制。 Abstract: Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.[36] DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube
Jawid Ahmad Baktash,Mosa Ebrahimi,Mohammad Zarif Joya,Mursal Dawodi
Main category: cs.CL
TL;DR: 本文构建了首个达里语(阿富汗主要语言)虚假信息检测数据集DariMis,包含9224个YouTube视频,标注了信息类型(虚假/部分真实/真实)和危害等级(低/中/高),发现二者存在强耦合性;提出双输入BERT编码策略提升虚假信息召回率,并验证了达里语专用模型ParsBERT的优越性能。
Details
Motivation: 达里语作为阿富汗主要语言,拥有数千万使用者,但在虚假信息检测研究中几乎空白,亟需构建专门数据集和方法以填补这一缺口。 Method: 构建首个达里语YouTube虚假信息标注数据集DariMis(9224条视频),标注两个维度:信息类型与危害等级;提出标题与描述分离的双段BERT输入编码策略;对比ParsBERT与XLM-RoBERTa-base性能,并进行消融实验与置信区间统计分析。 Result: 发现55.9%的虚假信息具有中/高危害,而真实信息仅1.0%;双输入编码使虚假信息召回率提升7.0个百分点(60.1%→67.1%);ParsBERT在测试集上准确率达76.60%,宏F1为72.77%。 Conclusion: 信息类型与危害等级结构性耦合,使信息类型分类器可作为隐式危害分流过滤器;双输入编码策略对安全关键的少数类(虚假信息)更有效;达里语专用模型优于多语言通用模型,凸显本地化建模的重要性。 Abstract: Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship between headline claims and body content, a key signal of misleading information. An ablation study against single-field concatenation shows that pair-input encoding yields a 7.0 percentage point gain in Misinformation recall (60.1 percent to 67.1 percent), the safety-critical minority class, despite modest overall macro F1 differences (0.09 percentage points). We benchmark a Dari/Farsi-specialized model (ParsBERT) against XLM-RoBERTa-base; ParsBERT achieves the best test performance with accuracy of 76.60 percent and macro F1 of 72.77 percent. Bootstrap 95 percent confidence intervals are reported for all metrics, and we discuss both the practical significance and statistical limitations of the results.[37] Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation
Nils A. Herrmann,Tobias Eder,Jingyi He,Georg Groh
Main category: cs.CL
TL;DR: 本文提出了一种细粒度的多模态毒性标注方案,区分‘无礼性’(tone)和‘不容忍性’(content),并在Hateful Memes数据集上验证其优于传统粗粒度仇恨标签,提升模型性能与误判平衡性。
Details
Motivation: 现有多模态毒性基准仅使用单一二元仇恨标签,混淆了表达的语调(tone)与内容(content)两个本质不同的维度,导致内容审核模型可靠性不足。 Method: 基于传播学理论构建双维度细粒度标注体系(incivility 和 intolerance),在2030个Hateful Memes样本上标注;对比粗粒度训练、跨标注迁移学习及粗细粒度联合学习三种范式。 Result: 联合使用粗细粒度标签显著提升模型性能,并改善错误分布平衡性:LLaVA-1.6-Mistral-7B的FNR-FPR从0.74降至0.42,Qwen2.5-VL-7B从0.54降至0.28。 Conclusion: 细粒度标注不仅补充粗粒度标签,还能提升审核系统可靠性;粗细结合是迈向更稳健多模态内容审核的实用路径。 Abstract: Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.[38] PaperVoyager : Building Interactive Web with Visual Language Models
Dasen Dai,Biao Wu,Meng Fang,Wenhao Wang
Main category: cs.CL
TL;DR: 本文提出了一种将科研论文自动转换为可交互网页系统的代理方法,名为Paper-to-Interactive-System Agent,并设计了PaperVoyager框架以结构化建模机制与交互逻辑,在19篇论文构成的新基准上验证了其有效性。
Details
Motivation: 现有文档智能体仅能生成静态摘要或网页,难以表达技术论文中动态机制和状态转换,限制了对复杂科研内容的深入理解与交互式探索。 Method: 提出Paper-to-Interactive-System Agent,实现PDF论文端到端理解、系统建模与交互网页合成;并设计PaperVoyager结构化生成框架,显式建模论文中的机制与交互逻辑。 Result: 在包含19篇论文及对应专家构建交互系统的新基准上,PaperVoyager显著提升了生成交互系统的质量,支持用户输入操控与动态行为可视化。 Conclusion: 该工作开创了交互式科学论文理解的新范式,推动视觉语言模型在技术文档深度理解和实用化落地方面的进步。 Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.[39] Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于对话记忆的轻量级推理框架,利用检索到的用户历史对话上下文,使8B参数模型在无需额外训练的情况下,以96%的成本降低实现69%的235B大模型性能,证明记忆比模型规模更能提升AI代理的准确性和效率。
Details
Motivation: 生产环境中的AI代理常面临大量语义重复的用户查询(高达47%),但每次均以相同高成本处理,造成计算冗余;作者认为可通过引入对话记忆将重复性转化为效率优势。 Method: 提出记忆增强型推理框架:使用轻量级8B模型,结合BM25与余弦相似度的混合检索机制,从历史对话中检索相关上下文,并在低开销路径下完成响应;不依赖额外训练或标注数据;引入路由与置信度机制进行查询分发。 Result: 在LoCoMo和LongMemEval数据集上,该方法达30.5% F1,恢复235B全上下文模型69%性能,同时降低96%有效成本;235B模型无记忆时仅13.7% F1,低于独立8B模型(15.4% F1);混合检索带来+7.7 F1提升。 Conclusion: 对于面向用户的持续性AI代理,对话记忆是提升准确率与效率的核心因素,其作用远超单纯扩大模型规模;记忆通过提供用户特定知识来抑制幻觉、提升响应正确性,且随时间积累持续增强效果。 Abstract: Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47\% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5\% F1, recovering 69\% of the performance of a full-context 235B model while reducing effective cost by 96\%. Notably, a 235B model without memory (13.7\% F1) underperforms even the standalone 8B model (15.4\% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96\% of queries to the small model, but yields poor accuracy (13.0\% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.[40] Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation
Julian Oestreich,Maximilian Bley,Frank Binder,Lydia Müller,Maksym Sydorenko,André Alcalde
Main category: cs.CL
TL;DR: 本文提出TriFEX评估流程和Parametric Knowledge Precision (PKP)指标,用于更准确评估RAG微调在电子设计自动化长文本生成任务中的事实性与知识内化能力,发现传统指标如ROUGE和BERTScore存在明显缺陷,并验证小模型经微调后可超越大模型。
Details
Motivation: 现有RAG微调研究多聚焦文档问答,依赖标准NLP指标(如ROUGE、BERTScore),易掩盖事实性差异;缺乏适用于长文本生成、能区分知识来源与内化程度的评估方法。 Method: 在电子设计自动化领域,对7B模型采用五种上下文增强策略进行RAG微调;提出TriFEX——基于三元组(用户查询-上下文-参考)的人工验证评估流程;定义Parametric Knowledge Precision(PKP)以量化模型内部知识的准确性(排除提示泄露)。 Result: ROUGE和BERTScore无法检测出TriFEX揭示的事实性差异;现有知识内化指标高度依赖检索率(约75%方差源于表达率PR而非真实正确率PKP);微调后的7B模型在多数指标上优于72B基线模型,并在跨条件及相关基准上表现出泛化能力。 Conclusion: 当前RAG评估指标存在严重局限,TriFEX与PKP可更可靠地衡量事实性与知识内化;小模型经任务适配微调后具备实用潜力,支持低成本、本地化部署。 Abstract: Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.[41] AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing
Sarubi Thillainathan,Ji-Ung Lee,Michael Sullivan,Alexander Koller
Main category: cs.CL
TL;DR: 本文提出AuthorMix,一种轻量级、模块化且可解释的作者风格迁移框架,通过训练针对特定作者的LoRA适配器并进行层间适配器混合,仅需少量目标风格样例即可实现高效、高保真风格迁移。
Details
Motivation: 现有方法训练单一模型处理所有作者风格,成本高、灵活性差,且常牺牲语义保真度换取风格迁移效果。 Method: 提出AuthorMix框架,为高资源作者分别训练LoRA适配器,并通过学习到的层间适配器混合策略,快速适配新目标作者,仅需少量目标风格样本。 Result: AuthorMix在低资源目标上超越现有SOTA方法及GPT-5.1,获得最高综合得分,并显著提升语义保真度。 Conclusion: AuthorMix是一种高效、灵活且语义保真度高的轻量级作者风格迁移方案,适用于低资源场景。 Abstract: The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines -- as well as GPT-5.1 -- for low-resource targets, achieving the highest overall score and substantially improving meaning preservation.[42] When Language Models Lose Their Mind: The Consequences of Brain Misalignment
Gabriele Merlin,Mariya Toneva
Main category: cs.CL
TL;DR: 本文通过构建脑错位模型(brain-misaligned models)来探究脑对齐(brain alignment)对语言能力的影响,发现脑错位显著损害下游语言任务表现,表明脑对齐对LLM的语言理解能力至关重要。
Details
Motivation: 尽管脑对齐大语言模型受到关注,但其对语言能力的具体作用尚不明确;本文旨在厘清脑对齐是否及如何影响语言理解与处理。 Method: 构建在保持语言建模性能的同时刻意降低对大脑活动预测能力的脑错位模型,并与脑对齐模型在200多项涵盖语义、句法、篇章、推理和形态等领域的下游任务上进行系统对比评估。 Result: 脑错位模型在绝大多数下游语言任务上表现显著劣于脑对齐模型,表明脑对齐对实现鲁棒的语言能力具有实质性贡献。 Conclusion: 脑对齐不仅是认知建模或AI可信性的附加属性,更是支撑LLM深层语言理解能力的关键功能要素;神经表征与语言处理之间存在紧密的功能耦合。 Abstract: While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models--LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.[43] HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature
Devvrat Joshi,Islem Rekik
Main category: cs.CL
TL;DR: 本文提出了一种两阶段零样本科学知识图谱构建框架Z-NERD+HGNet,通过正交语义分解、多尺度注意力和层次感知关系抽取等技术,首次将层级抽象建模为欧氏空间中的连续属性,并在多个基准上显著提升实体识别与关系抽取性能。
Details
Motivation: 现有知识图谱构建方法难以识别长多词实体、跨域泛化能力差、忽视科学知识的层次性;通用大模型计算昂贵且在专业任务上准确率不稳定,导致当前KG浅层且不一致。 Method: 提出两阶段框架:第一阶段Z-NERD包含正交语义分解(OSD)和多尺度TCQK注意力机制;第二阶段HGNet采用层次感知的消息传递进行关系抽取;引入可微分层次损失和连续抽象场(CAF)损失以保障全局一致性;并发布多领域基准SPHERE。 Result: 在SciERC、SciER和SPHERE上达到新SOTA,OOD测试中NER提升8.08%,RE提升5.99%;零样本设置下NER提升10.76%,RE提升26.2%。 Conclusion: 该框架实现了可扩展、零样本的科学KG构建,首次将层级抽象形式化为欧氏嵌入中的连续属性,为替代双曲方法提供了更简明有效的新范式。 Abstract: Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.[44] Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
Shushanta Pudasaini,Luis Miralles-Pechuán,David Lillis,Marisa Llorens Salvador
Main category: cs.CL
TL;DR: 本文提出了一种可解释的AI生成文本检测框架,基于30个语言学特征,在基准数据集上达到高F1分数(0.9734),但跨域/跨生成器测试显示泛化能力差;分析表明当前检测器多依赖数据集特有风格线索而非稳定机器写作信号,且最具判别力的特征也最易受分布偏移影响。
Details
Motivation: 现有AI文本检测器在基准测试中表现优异,但真实场景下的可靠性与可解释性尚不明确,亟需探究其是否真正识别机器作者身份,还是仅利用数据集特有伪影。 Method: 构建融合语言学特征工程、机器学习与可解释AI(如SHAP)的检测框架,在PAN CLEF 2025和COLING 2025数据集上训练并进行跨域、跨生成器评估及错误分析。 Result: 模型在基准测试中F1达0.9734,但跨域性能显著下降;SHAP分析揭示关键特征因数据集而异;错误分析证实高判别性特征对域偏移、格式变化和文本长度高度敏感。 Conclusion: 当前基于语言学特征的检测器泛化能力弱,其有效性常源于数据集偏差而非普适性机器写作信号;研究强调需兼顾判别性与鲁棒性,并开源了支持预测与实例级解释的Python工具包。 Abstract: The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.[45] UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities
Qi Jia,Haodong Zhao,Dun Pei,Xiujie Song,Shibo Wang,Zijian Chen,Zicheng Zhang,Xiangyang Zhu,Guangtao Zhai
Main category: cs.CL
TL;DR: 本文提出UniDial-EvalKit(UDE),一个用于评估多轮交互式AI系统的统一评测工具包,通过数据格式标准化、模块化评估流程和统一评分接口,提升评测的可复现性、效率与可扩展性。
Details
Motivation: 现有交互式AI系统评测协议高度异构,数据格式、模型接口和评估流程不统一,阻碍了系统性比较。 Method: 设计并实现UniDial-EvalKit(UDE),包括通用数据模式转换、模块化评估架构、一致评分接口、并行生成与评分、基于检查点的缓存机制。 Result: 在多个多轮基准上验证有效,显著提升评测效率、可复现性与可扩展性,并支持大规模高效评测。 Conclusion: UDE为交互式AI系统提供了标准化评测框架,开源发布以推动评测生态规范化和领域发展。 Abstract: Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a consistent scoring interface. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint-based caching to eliminate redundant computation. Validated across diverse multi-turn benchmarks, UDE not only guarantees high reproducibility through standardized workflows and transparent logging, but also significantly improves evaluation efficiency and extensibility. We make the complete toolkit and evaluation scripts publicly available to foster a standardized benchmarking ecosystem and accelerate future breakthroughs in interactive AI.[46] From Synthetic to Native: Benchmarking Multilingual Intent Classification in Logistics Customer Service
Haoyu He,Jinyu Zhuang,Haoran Chu,Shuhang Yu,J,T AI Group,Hao Wang,Kunpeng Han
Main category: cs.CL
TL;DR: 本文提出一个基于真实物流客服日志的多语言分层意图分类基准,包含英语、西班牙语、阿拉伯语等六种语言的真实用户查询,揭示机器翻译测试集会显著高估模型在真实噪声数据上的性能。
Details
Motivation: 现有主流多语言意图分类基准多依赖机器翻译文本,过于干净标准,无法反映真实客服场景中噪声大、语言多样、标签分层的特点,导致对模型鲁棒性评估失真。 Method: 构建了一个源自600K真实物流客服日志、经过滤、LLM辅助质检与人工验证的30K去标识化多语言查询数据集;设计两层标签体系(13个父类+17个子类);提供原生与对应机器翻译的配对测试集;在扁平与分层协议下评测多语言编码器、嵌入模型及小语言模型。 Result: 实验表明,使用机器翻译测试集会显著高估模型在真实原生查询上的性能,尤其在长尾意图识别和跨语言迁移任务上偏差更大。 Conclusion: 亟需基于真实用户数据的多语言意图分类基准,以更准确评估模型在实际业务场景中的泛化能力与鲁棒性。 Abstract: Multilingual intent classification is central to customer-service systems on global logistics platforms, where models must process noisy user queries across languages and hierarchical label spaces. Yet most existing multilingual benchmarks rely on machine-translated text, which is typically cleaner and more standardized than native customer requests and can therefore overestimate real-world robustness. We present a public benchmark for hierarchical multilingual intent classification constructed from real logistics customer-service logs. The dataset contains approximately 30K de-identified, stand-alone user queries curated from 600K historical records through filtering, LLM-assisted quality control, and human verification, and is organized into a two-level taxonomy with 13 parent and 17 leaf intents. English, Spanish, and Arabic are included as seen languages, while Indonesian, Chinese, and additional test-only languages support zero-shot evaluation. To directly measure the gap between synthetic and real evaluation, we provide paired native and machine-translated test sets and benchmark multilingual encoders, embedding models, and small language models under flat and hierarchical protocols. Results show that translated test sets substantially overestimate performance on noisy native queries, especially for long-tail intents and cross-lingual transfer, underscoring the need for more realistic multilingual intent benchmarks.[47] ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
Hao Wang,Haocheng Yang,Licheng Pan,Lei Shen,Xiaoxi Li,Yinuo Wang,Zhichao Chen,Yuan Lu,Haoxuan Li,Zhouchen Lin
Main category: cs.CL
TL;DR: 本文提出ImplicitRM方法,通过隐式人类反馈(如点击、复制)学习无偏奖励模型,解决了隐式偏好数据缺乏明确负样本和用户偏好偏差两大挑战。
Details
Motivation: 当前基于人类反馈的强化学习(RLHF)中,奖励建模依赖高成本的显式实验反馈数据,亟需一种低成本替代方案。 Method: 提出ImplicitRM方法:首先用分层模型将训练样本分为四类潜在组,再基于似然最大化构建理论无偏的学习目标。 Result: 实验表明ImplicitRM在多种隐式偏好数据集上均能学习出准确的奖励模型。 Conclusion: ImplicitRM为隐式奖励建模提供了有效且理论可证无偏的解决方案,显著降低了奖励建模的数据成本。 Abstract: Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.[48] Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?
Nasser A Alsadhan
Main category: cs.CL
TL;DR: 本研究评估了GPT-4o、Gemini 1.5 Pro和Claude Sonnet 3.5等先进大语言模型模仿惠特曼、华兹华斯、特朗普和奥巴马等知名人物写作风格的能力,发现AI生成文本仍高度可检测,尤其通过困惑度(perplexity)等特征可有效区分;尽管AI在句法复杂度等低维特征上接近人类,但在情感密度与风格变异性方面仍有明显差距。
Details
Motivation: 随着生成式AI模仿人类写作风格能力增强,亟需系统评估其在作者风格模仿上的真实能力边界,尤其在数字人文与社交媒体语境下的作者归属问题。 Method: 采用零样本提示框架生成模仿文本,结合BERT分类器与XGBoost可解释模型进行检测;融合LIWC词频分析、困惑度、可读性指标等8个风格特征进行多维评估。 Result: XGBoost仅用8个特征即可达到与高维神经网络相当的检测准确率;困惑度被识别为最关键判别指标;AI在低维启发式特征(如句法复杂度)上趋近人类,但在情感密度与风格变异性上显著不足。 Conclusion: 当前LLM尚不能真正复现人类作者的深层风格特征,该研究为LLM风格建模与作者归属提供了可解释的基准与关键统计缺口分析。 Abstract: Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.[49] I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes
Shijia Zhou,Saif M. Mohammad,Barbara Plank,Diego Frassinelli
Main category: cs.CL
TL;DR: 本文评估了八种最先进的多模态大语言模型(MLLMs)在识别和解释网络模因中六种修辞性含义方面的能力,并通过人类评估检验其解释的合理性和忠实性;结果表明所有模型均存在过度倾向将模因判定为具有修辞性含义的偏差,且正确预测未必伴随忠实于原内容的解释。
Details
Motivation: 目前尚不清楚多模态大语言模型(MLLMs)如何融合图文信息来识别模因中的修辞性含义,亟需系统性评估。 Method: 在三个数据集上评测八种前沿生成式MLLMs对六类修辞性含义的检测与解释能力,并开展人类评估以检验解释的合理性(是否支撑预测标签)与忠实性(是否忠于原始模因内容)。 Result: 所有模型均表现出强烈偏差——倾向于将无修辞性含义的模因误判为有;定性分析发现,正确预测并不总对应忠实的解释。 Conclusion: 当前MLLMs在模因修辞性理解上存在系统性偏差与解释不可靠问题,提示需改进模型的多模态推理与解释一致性机制。 Abstract: Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.[50] Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models
Nasser A Alsadhan
Main category: cs.CL
TL;DR: 本研究评估了多个大语言模型(LLMs)在英语情感表达和阿拉伯语人格标记上的拟人化能力,发现AI文本虽可被高精度识别,但其情感与人格表征方式与人类存在系统性差异;引入AI生成数据可提升阿拉伯语人格分类性能,GPT-4o和Gemini表现出更强的情感一致性。
Details
Motivation: 探究大语言模型是否能在英语(情感)和阿拉伯语(人格)中真实模拟人类复杂心理特质,尤其关注资源匮乏语言中的建模偏差与检测挑战。 Method: 在六种模型(Jais、Mistral、LLaMA、GPT-4o、Gemini、DeepSeek)上开展两项任务:1)机器分类器判别人类/AI文本;2)情感/人格分类泛化性测试;辅以语言学与心理语言学分析,并在阿拉伯语任务中引入AI合成数据增强。 Result: AI文本可被高F1(>0.95)区分,但对改写样本鲁棒性差;人类与AI数据训练的分类器互泛化能力弱,表明表征差异;AI数据增强显著提升阿拉伯语人格分类效果;GPT-4o和Gemini情感一致性最优;人类与AI文本在语调、真实感和复杂度上存在可测量差异。 Conclusion: LLMs并未真正复现人类情感与人格的心理机制,而是形成独特表征模式;在低资源语言中,需结合合成数据与针对性建模以提升对齐与检测能力,这对情感计算、作者归属与负责任AI部署具有重要启示。 Abstract: The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models:Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F1>0.95), though classification performance deteriorates on paraphrased samples, indicating a reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have implications for affective computing, authorship attribution, and responsible AI deployment, particularly within underresourced language contexts where generative AI detection and alignment pose unique challenges.[51] Steering LLMs for Culturally Localized Generation
Simran Khanuja,Hongbin Liu,Shujian Zhang,John Lambert,Mingqing Chen,Rajiv Mathews,Lun Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于机制可解释性的方法(CuE),通过稀疏自编码器识别并操控大语言模型中的文化表征,以提升文化忠实性和长尾文化概念的生成能力,兼具诊断与可控引导价值。
Details
Motivation: 现有文化本地化方法(如提示工程或后训练对齐)是黑箱、难以控制,且无法区分模型失败是源于知识缺失还是提示不当;需揭示和操控模型内部的文化表征。 Method: 利用稀疏自编码器识别可解释的文化相关特征,聚合为文化嵌入(CuE);用CuE分析隐性文化偏差,并构建白盒引导干预策略。 Result: CuE引导显著提升文化忠实性及长尾文化概念生成能力;其效果与黑箱方法互补,叠加使用可进一步增益;表明模型问题多属提示不当而非知识缺失(但存在文化差异)。 Conclusion: CuE为理解与可控调节LLM中的文化表征提供了新范式,兼具诊断意义与实用价值。 Abstract: LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don't necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.[52] WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention
Duy Dao Do,Anaïs Halftermeyer,Thi-Bich-Hanh Dao
Main category: cs.CL
TL;DR: 本文提出WISTERIA框架,通过分析事件对条件下的top-K注意力成分,挖掘隐式时间信号(如词汇、句法、形态特征)以提升时序关系抽取的准确性和可解释性。
Details
Motivation: 现有基于注意力机制的时序关系抽取模型过于关注全局显著词元,忽视了决定时序关系的事件对特异性线索;且多数方法依赖显式时间标记(如before、after),忽略了更丰富的隐式语言信号。 Method: 提出WISTERIA框架,结合多头注意力与事件对条件下的top-K池化,提取每对事件最相关的上下文词元;将时间信号定义为任何能隐式表达时序关系的词汇、句法或形态成分,并进行可解释性分析。 Result: 在TimeBank-Dense、MATRES、TDDMan和TDDAuto等多个基准数据集上达到有竞争力的准确率;语言学分析证实top-K词元与已知时间语言学线索一致,提供了局部化、可解释的时序推理依据。 Conclusion: WISTERIA不仅提升了时序关系抽取性能,还通过聚焦事件对特异性隐式信号,增强了模型的可解释性与语言学合理性。 Abstract: Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.[53] Failure of contextual invariance in gender inference with large language models
Sagar Kumar,Ariel Flint,Luca Maria Aiello,Andrea Baronchelli
Main category: cs.CL
TL;DR: 本文通过性别推断任务发现,大语言模型(LLM)在语境等价的任务表述下输出并不稳定;引入微小、理论上无信息量的语境会引发系统性输出偏移,削弱与文化性别刻板印象的相关性,反而使无关语境特征(如无关指代的代词性别)成为最强预测因子;部分依赖性无法用边缘效应或简单重复解释,表明LLM严重违反语境不变性假设。
Details
Motivation: 检验标准评估中隐含的‘LLM输出在语境等价任务下应保持稳定’这一关键假设是否成立,尤其关注性别推断任务中的语境敏感性问题。 Method: 设计受控的代词选择任务,引入最小化、理论上不提供任务相关信息的语境变量,系统测量不同语境下模型输出的变化,并采用Contextuality-by-Default分析量化语境对输出的不可分解依赖。 Result: 在19–52%的案例中,模型输出对语境存在无法被边际效应或代词重复解释的强依赖;文化性别刻板印象相关性减弱或消失,而无关语境特征(如无关人物代词性别)成为最显著预测因子。 Conclusion: LLM输出在近似相同的语法结构下仍显著违反语境不变性,这对偏差基准测试的效度和高风险场景下的模型部署构成根本性挑战。 Abstract: Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.cs.CV [Back]
[54] Founder effects shape the evolutionary dynamics of multimodality in open LLM families
Manuel Cebrian
Main category: cs.CV
TL;DR: 本文利用ModelBiome数据集分析了开源大语言模型家族中多模态能力(尤其是图文任务)随时间的演化规律,发现多模态能力并非渐进式扩散,而是通过稀少的‘创始事件’进入家族,并在已有视觉语言模型(VLM)谱系内快速放大和分化,呈现突变式采纳特征。
Details
Motivation: 探究多模态能力在开源大语言模型家族中如何随时间涌现与传播,填补当前对多模态演化路径认知的空白。 Method: 基于Hugging Face的ModelBiome AI生态系统数据集(含超180万模型条目及谱系关系),定量分析多模态任务(特别是图像-文本)在时间维度和谱系传承路径上的分布与演变,计算跨模态转移率、谱系条件转换率及创始人集中度。 Result: 多模态能力在2023年前罕见,2024–2025年激增且集中于图文任务;各大家族首个多模态变体出现时间滞后文本生成模型数月到两年不等;细调边中仅0.218%由纯文本模型产生VLM后代,94.5%的VLM子代源自VLM父代;约60%的VLM模型为无记录父代的新根节点,其余多为VLM衍生,显示‘创始—放大—分化’模式。 Conclusion: 开源LLM家族中的多模态能力采用‘ punctuated adoption ’(突变式采纳)机制:依赖稀少的创始事件启动,随后在VLM谱系内快速扩增,导致其扩展受限于谱系内部传承而非跨模态迁移,从而形成与纯语言模型不同的、迁移受限的缩放行为。 Abstract: Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.[55] From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
Federico Toschi,Nicolò Brunello,Andrea Sassella,Vincenzo Scotti,Mark James Carman
Main category: cs.CV
TL;DR: 本文评估了当前开源多模态大语言模型(MLMs)在技术性任务(如家具组装)中作为实时助手的适用性,构建了Manual to Action Dataset(M2AD)数据集,并测试MLMs在少标注推理、步骤跟踪和手册页引用三方面的能力,发现其性能受限于架构与硬件,亟需支持多图像与图文交错推理能力。
Details
Motivation: 随着大语言模型(LLMs)的发展,研究正向多模态扩展;为支持用户在真实场景(如VR/AR环境)中完成程序性任务,需评估现有MLMs在技术任务中的实时辅助能力。 Method: 构建带逐步标注和手册参考的家具组装数据集M2AD,系统评估MLMs在三方面能力:(1)利用推理减少精细标注需求;(2)跟踪装配步骤进展;(3)正确关联说明书页面。 Result: 部分MLMs具备基本程序性理解能力,但整体性能受限于架构(如单图像输入)和硬件约束,难以有效处理多图像与图文交错推理任务。 Conclusion: 当前开源MLMs尚不足以胜任复杂实时技术辅助任务,未来需发展支持多图像输入和细粒度图文联合推理的新型MLM架构。 Abstract: The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly available MLMs to provide this kind of assistance on technical tasks. To this end, we annotated a data set of furniture assembly with step by step labels and manual references: the Manual to Action Dataset (M2AD). We used this dataset to assess (1) to which extent the reasoning abilities of MLMs can be used to reduce the need for detailed labelling, allowing for more efficient, cost effective annotation practices, (2) whether MLMs are able to track the progression of assembly steps (3) and whether MLMs can refer correctly to the instruction manual pages. Our results showed that while some models understand procedural sequences, their performance is limited by architectural and hardware constraints, highlighting the need for multi image and interleaved text image reasoning.[56] When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations
Harsh Nishant Lalai,Raj Sanjay Shah,Hanspeter Pfister,Sashank Varma,Grace Guo
Main category: cs.CV
TL;DR: 本文提出一个细粒度的误导性可视化基准,评估多模态大模型在检测图表设计错误(如截断坐标轴)与推理错误(如因果谬误)方面的能力,发现模型更擅长识别设计层面的误导,而对语义推理层面的误导检测效果较差。
Details
Motivation: 现有视觉语言模型在图表理解任务上表现良好,但其检测由细微推理错误(如标题中的因果推断)导致的误导性可视化的能力尚不明确,亟需系统性评估。 Method: 构建了一个结合真实图表与人工撰写、按细粒度错误类型(推理错误与可视化设计错误)标注的误导性图文对基准,并在多个商用及开源VLM上进行评测。 Result: 模型对可视化设计错误(如截断坐标轴)的检测显著优于推理类错误(如 cherry-picking、因果谬误);且常将非误导性图表误判为误导性。 Conclusion: 当前VLM在归因式误导检测(即定位具体错误类型)方面能力有限,尤其难以识别语义与逻辑层面的误导,该工作填补了粗粒度误导检测与细粒度错误归因之间的空白。 Abstract: Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.[57] Efficient Universal Perception Encoder
Chenchen Zhu,Saksham Suri,Cijo Jose,Maxime Oquab,Marc Szafraniec,Wei Wen,Yunyang Xiong,Patrick Labatut,Piotr Bojanowski,Raghuraman Krishnamoorthi,Vikas Chandra
Main category: cs.CV
TL;DR: 本文提出了一种高效通用感知编码器(EUPE),通过先从多个领域专家模型蒸馏到一个大型代理教师模型,再从该单一教师模型压缩得到轻量高效编码器,在保持小尺寸的同时实现跨任务的强泛化能力。
Details
Motivation: 在资源受限的边缘设备上部署AI模型需兼顾计算效率与多任务处理能力,亟需兼具小型化和强表征能力的视觉编码器。 Method: 提出EUPE方法:不同于以往直接从多个教师模型聚合压缩的方式,先将多个领域专家编码器蒸馏融合为一个大型代理教师模型,再对该单一教师模型进行压缩,得到高效通用编码器。 Result: EUPE在多种下游任务上达到或超越同等规模的单领域专家模型性能,并优于以往聚合式编码器;模型与代码将开源。 Conclusion: EUPE验证了‘先放大后缩小’的两阶段蒸馏范式在构建高效通用视觉编码器上的有效性,为边缘AI提供了新思路。 Abstract: Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We will release the full family of EUPE models and the code to foster future research.[58] Spatially-Aware Evaluation Framework for Aerial LiDAR Point Cloud Semantic Segmentation: Distance-Based Metrics on Challenging Regions
Alex Salvatierra,José Antonio Sanz,Christian Gutiérrez,Mikel Galar
Main category: cs.CV
TL;DR: 本文提出了一种面向航空LiDAR点云语义分割的新评估框架,包含距离驱动的误差严重性度量和针对难分类点的聚焦评估,以克服传统指标(如mIoU、OA)忽略空间上下文和易分类点主导的问题。
Details
Motivation: 传统语义分割指标(如mIoU、OA)在航空LiDAR数据上存在两个关键缺陷:一是忽略误分类的空间几何严重性(影响DTM等地理产品精度),二是易分类点占比过高,掩盖模型在困难区域的真实差异。 Method: 提出两套互补评估方法:1)基于距离的指标,量化每个误分类点到其预测类别最近真实点的空间偏差;2)聚焦于被至少一个模型误分类的‘难分类点’子集进行评估,降低易分类点偏差。 Result: 在三个航空LiDAR数据集上对比三种SOTA模型,新指标揭示了传统方法无法发现的空间误差模式,为地球观测应用提供更可靠的模型选择依据。 Conclusion: 所提框架能更全面、更具应用导向地评估点云语义分割模型性能,尤其适用于对空间一致性要求高的任务。 Abstract: Semantic segmentation metrics for 3D point clouds, such as mean Intersection over Union (mIoU) and Overall Accuracy (OA), present two key limitations in the context of aerial LiDAR data. First, they treat all misclassifications equally regardless of their spatial context, overlooking cases where the geometric severity of errors directly impacts the quality of derived geospatial products such as Digital Terrain Models. Second, they are often dominated by the large proportion of easily classified points, which can mask meaningful differences between models and under-represent performance in challenging regions. To address these limitations, we propose a novel evaluation framework for comparing semantic segmentation models through two complementary approaches. First, we introduce distance-based metrics that account for the spatial deviation between each misclassified point and the nearest ground-truth point of the predicted class, capturing the geometric severity of errors. Second, we propose a focused evaluation on a common subset of hard points, defined as the points misclassified by at least one of the evaluated models, thereby reducing the bias introduced by easily classified points and better revealing differences in model performance in challenging regions. We validate our framework by comparing three state-of-the-art deep learning models on three aerial LiDAR datasets. Results demonstrate that the proposed metrics provide complementary information to traditional measures, revealing spatial error patterns that are critical for Earth Observation applications but invisible to conventional evaluation approaches. The proposed framework enables more informed model selection for scenarios where spatial consistency is critical.[59] OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction
Hamidreza Aftabi,Faye Yu,Brooke Switzer,Zachary Fishman,Eitan Prisman,Antony Hodgson,Cari Whyne,Sidney Fels,Michael Hardisty
Main category: cs.CV
TL;DR: 本文提出OsteoFlow,一种基于流的框架,用于预测下颌骨重建术后一年的CT扫描图像,核心创新是李雅普诺夫引导的轨迹蒸馏方法,显著提升了长期骨重塑预测的准确性。
Details
Motivation: 预测下颌骨重建后的长期骨重塑具有重要临床价值,但现有生成模型难以在长时间尺度上保持轨迹一致性和解剖保真度。 Method: 提出OsteoFlow框架,采用李雅普诺夫引导的轨迹蒸馏,从配准导出的稳态速度场教师模型中蒸馏连续传输时间轨迹,并结合切除感知图像损失以保证几何对应性。 Result: 在344个配对感兴趣区域上评估,OsteoFlow显著优于当前最优基线,在手术切除区平均绝对误差降低约20%。 Conclusion: 轨迹蒸馏为长期医学影像预测提供了新思路,OsteoFlow展示了其在骨重塑建模中的潜力。 Abstract: Predicting long-term bone remodeling after mandibular reconstruction would be of great clinical benefit, yet standard generative models struggle to maintain trajectory-level consistency and anatomical fidelity over long horizons. We introduce OsteoFlow, a flow-based framework predicting Year-1 post-operative CT scans from Day-5 scans. Our core contribution is Lyapunov-guided trajectory distillation: Unlike one-step distillation, our method distills a continuous trajectory over transport time from a registration-derived stationary velocity field teacher. Combined with a resection-aware image loss, this enforces geometric correspondence without sacrificing generative capacity. Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state of-the-art baselines, reducing mean absolute error in the surgical resection zone by ~20%. This highlights the promise of trajectory distillation for long-term prediction. Code is available on GitHub: OsteoFlow.[60] Static Scene Reconstruction from Dynamic Egocentric Videos
Qifei Cui,Patrick Chen
Main category: cs.CV
TL;DR: 本文提出了一种鲁棒的长时序自摄像视频3D重建方法,通过掩码感知重建机制抑制动态前景(如手部),并结合分块重建与位姿图拼接策略,显著提升轨迹精度和静态几何质量。
Details
Motivation: 自摄像视频因相机快速运动和频繁动态交互,导致现有静态重建系统(如MapAnything)出现严重轨迹漂移和手部“幽灵”几何伪影。 Method: 提出掩码感知重建机制,在注意力层显式抑制动态前景;采用分块重建与位姿图拼接策略以保证全局一致性并消除长期漂移。 Result: 在HD-EPIC和室内无人机数据集上,绝对轨迹误差显著降低,静态几何视觉更干净,优于朴素基线。 Conclusion: 该方法有效拓展了基础模型在动态第一人称场景中的3D重建能力。 Abstract: Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and "ghost" geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.[61] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Hejun Dong,Junbo Niu,Bin Wang,Weijun Zeng,Wentao Zhang,Conghui He
Main category: cs.CV
TL;DR: 本文提出MinerU-Diffusion,一种基于扩散模型的非自回归文档OCR框架,通过视觉条件下的并行去噪替代传统自回归解码,提升长文档解析鲁棒性与速度。
Details
Motivation: 现有OCR系统多依赖自回归解码,导致长文档处理存在序列延迟和错误累积;作者认为从逆渲染视角出发,左到右生成并非任务本质,而是序列化带来的限制。 Method: 提出MinerU-Diffusion:采用块状扩散解码器与不确定性驱动的课程学习策略,在视觉条件下实现并行扩散去噪,取代自回归解码。 Result: 在多个基准上显著提升鲁棒性,解码速度达自回归基线的3.2倍;在新提出的Semantic Shuffle基准上验证了其更弱的语言先验依赖与更强的视觉OCR能力。 Conclusion: 扩散模型可有效替代自回归范式用于文档OCR,为长结构化文本解析提供更高效、鲁棒的新路径。 Abstract: Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.[62] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
Weitong Cai,Hang Zhang,Yukai Huang,Shitong Sun,Jiankang Deng,Songcen Xu,Jifei Song,Zhensong Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为ColorTrigger的高效流式视频理解新范式——'灰度常开、彩色按需',通过仅在必要时触发彩色帧采集,显著降低边缘设备上的感知与计算开销。
Details
Motivation: 连续高保真RGB视频采集在资源受限的边缘/可穿戴设备上成本过高,而实际中颜色信息并非始终必要,存在大量冗余。 Method: 基于窗口化灰度相似性分析设计了一个无需在线训练的实时触发机制ColorTrigger,结合轻量级二次规划、信用预算控制和动态token路由,实现因果性色度冗余检测与稀疏彩色帧采集。 Result: 在流式视频理解基准上,仅使用8.1%的RGB帧即达到全彩基线91.6%的性能。 Conclusion: 自然视频中存在显著的颜色冗余,'灰度常开、彩色按需'范式可实现在资源受限设备上的实用化常开视频感知。 Abstract: Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.[63] Tiny Inference-Time Scaling with Latent Verifiers
Davide Bucciarelli,Evelyn Turri,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出Verifier on Hidden States (VHS),一种直接在扩散变换器(DiT)中间隐藏状态上操作的验证器,避免了将候选输出解码至像素空间再编码的冗余计算,显著降低了推理时验证开销,同时保持或提升生成质量。
Details
Motivation: 现有基于多模态大语言模型(MLLM)的验证器虽能提升生成性能,但需将候选输出解码至像素空间并重新编码为视觉嵌入,造成高推理成本;而扩散模型本已在潜在空间运行,该流程存在冗余。 Method: 提出VHS验证器,直接利用DiT单步生成器的中间隐藏层特征进行验证,无需像素空间解码与重编码;在极小推理预算(少量候选)下实现高效推理时缩放。 Result: 相比标准MLLM验证器,在相同推理时间预算下,VHS降低联合生成-验证时间63.3%、FLOPs 51%、VRAM占用14.5%,并在GenEval上提升2.7%。 Conclusion: VHS通过在隐藏状态层面进行验证,实现了更高效的推理时扩展,在显著降低计算开销的同时维持甚至优于MLLM验证器的性能。 Abstract: Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.[64] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
Delin An,Chaoli Wang
Main category: cs.CV
TL;DR: 本文提出Sketch2CT,一种基于多模态扩散的结构感知3D医学体数据生成框架,联合利用用户提供的2D草图和文本描述生成解剖一致的3D CT体积。
Details
Motivation: 解决3D医学体数据在多模态条件下生成时缺乏解剖结构一致性的难题,缓解医学领域数据稀缺问题。 Method: 提出Sketch2CT框架:先基于草图和文本条件生成3D分割掩码,引入两个新模块(局部文本引导的草图特征细化、全局草图-文本表征融合),采用胶囊注意力主干网络;再以分割掩码为条件驱动潜在扩散模型合成3D CT体数据。 Result: 在公开CT数据集上实验表明,Sketch2CT在多模态医学体数据生成任务中性能优越,支持可控、低成本的数据增强。 Conclusion: Sketch2CT实现了草图与文本联合引导的高质量、解剖一致的3D医学体积生成,为医学数据增强提供了新范式。 Abstract: Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.[65] High Resolution Flood Extent Detection Using Deep Learning with Random Forest Derived Training Labels
Azizbek Nuriddinov,Ebrahim Ahmadisharaf,Mohammad Reza Alizadeh
Main category: cs.CV
TL;DR: 本文提出了一种结合PlanetScope光学影像与地形特征(HAND和坡度)的洪水制图框架,利用随机森林生成标签训练U-Net模型,在飓风Ida案例中验证了其在数据稀缺条件下的有效性。
Details
Motivation: 洪水模型验证因极端事件观测数据稀少而困难;高分辨率光学影像虽有潜力,但受云层遮挡和灾时标注数据缺乏限制。 Method: 构建融合PlanetScope光学影像(4波段)与地形特征(HAND、坡度,共6波段)的机器学习/深度学习框架:先用随机森林对专家标注洪水掩膜生成训练标签,再训练两个U-Net(ResNet18骨干网络)模型进行对比。 Result: 含地形特征的U-Net模型性能(F1=0.92,IoU=0.85)与仅光学影像模型几乎一致,表明HAND和坡度对淹没范围识别提升有限;该框架具备可扩展性和标签高效性。 Conclusion: 所提框架为数据稀缺的洪水场景提供了可行、高效的淹没范围制图方法,但地形辅助特征在本实验中未显著提升精度。 Abstract: Validation of flood models, used to support risk mitigation strategies, remains challenging due to limited observations during extreme events. High-frequency, high-resolution optical imagery (~3 m), such as PlanetScope, offers new opportunities for flood mapping, although applications remain limited by cloud cover and the lack of labeled training data during disasters. To address this, we develop a flood mapping framework that integrates PlanetScope optical imagery with topographic features using machine learning (ML) and deep learning (DL) algorithms. A Random Forest model was applied to expert-annotated flood masks to generate training labels for DL models, U-Net. Two U-Net models with ResNet18 backbone were trained using optical imagery only (4 bands) and optical imagery combined with Height Above Nearest Drainage (HAND) and topographic slope (6 bands). Hurricane Ida (September 2021), which caused catastrophic flooding across the eastern United States, including the New York City metropolitan area, was used as an example to evaluate the framework. Results demonstrate that the U-Net model with topographic features achieved very close performance to the optical-only configuration (F1=0.92 and IoU=0.85 by both modeling scenarios), indicating that HAND and slope provide only marginal value to inundation extent detection. The proposed framework offers a scalable and label-efficient approach for mapping inundation extent that enables modeling under data-scarce flood scenarios.[66] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
Shoubin Yu,Lei Shu,Antoine Yang,Yao Fu,Srinivas Sunkara,Maria Wang,Jindong Chen,Mohit Bansal,Boqing Gong
Main category: cs.CV
TL;DR: 本文提出了Ego2Web基准,首次将第一人称视频感知与网页代理执行结合,用于评估能同时理解物理环境与数字世界任务的多模态AI代理;并设计了高一致性自动评估方法Ego2WebJudge,实验表明现有SOTA代理在此基准上表现薄弱,存在显著提升空间。
Details
Motivation: 现有网页代理基准仅关注网页交互与感知,缺乏对用户真实物理环境(如AR眼镜捕捉的自我中心视觉)的建模,无法评估需联合物理感知与在线操作的任务。 Method: 构建Ego2Web基准:采集真实第一人称视频,配对多样化网页任务(电商、媒体检索、知识查询等);采用自动数据生成+人工校验流程构建高质量视频-任务对;提出LLM-as-a-Judge自动评估方法Ego2WebJudge。 Result: Ego2WebJudge与人工评估达成约84%一致性;在Ego2Web上测试多种SOTA代理,其性能普遍较弱,各任务类别均有显著提升空间;消融实验验证准确视频理解是任务成功的关键,凸显当前代理局限性。 Conclusion: Ego2Web填补了物理感知与网页执行协同评估的空白,为开发能无缝连接物理与数字世界的真正实用AI助手提供了关键新基准和评估工具。 Abstract: Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.[67] UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images
Kaizhen Tan,Fan Zhang
Main category: cs.CV
TL;DR: 本文提出UrbanVGGT方法,利用单张街景图像估计人行道宽度,结合语义分割、3D重建、自适应地面拟合、相机高度标定和方向性宽度测量,在华盛顿特区基准数据集上达到0.252米平均绝对误差。
Details
Motivation: 人行道宽度是衡量行人可达性、舒适性和路网质量的重要指标,但大多数城市缺乏大范围的实测宽度数据;现有方法依赖高成本实地调查、高分辨率航拍影像或简化几何假设,难以兼顾可扩展性与精度。 Method: UrbanVGGT流水线:语义分割提取人行道区域 → 前馈式3D重建 → 自适应地面平面拟合 → 基于相机高度的度量尺度标定 → 在恢复平面上进行方向性宽度测量。 Result: 在华盛顿特区真值基准上,平均绝对误差为0.252米,95.5%预测值误差≤0.50米;消融实验表明尺度标定最关键;在三座城市生成了覆盖527条OSM路段的SV-SideWidth原型数据集。 Conclusion: 街景图像可用于大规模生成人行道宽度候选数据,但需跨城市广泛验证与本地真值审计后,方可作为权威规划数据使用。 Abstract: Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.[68] Generalized multi-object classification and tracking with sparse feature resonator networks
Lazar Supic,Alec Mullen,E. Paxon Frady
Main category: cs.CV
TL;DR: 本文提出了一种基于分析-合成与共振器网络的视觉场景理解方法,能同时捕获不变性和等变性结构,无需数据增强即可处理任意平移、多目标及运动目标。
Details
Motivation: 现有神经网络在追求平移等变换不变性时往往丢失等变信息(如物体精确位置),且监督学习难以天然保证不变性,泛化能力差。 Method: 构建一个生成模型描述含MNIST数字及其颜色、位置等变换的简单场景;用共振器网络逆向该生成模型,结合稀疏特征基集实现形状与位置解耦;模块化设计包含形状模块(平移无关)和翻译模块(保留位置信息)。 Result: 实现了对未见数字形状的泛化;仅用居中图像训练分类器即能识别任意平移目标;自然具备注意力机制,可逐个分析多目标场景;能以数像素精度追踪多个移动目标。 Conclusion: 该方法通过生成建模与共振器网络的结合,在保持等变性的同时实现强不变性,提升了泛化性、数据效率与多目标处理能力,为具身智能中的场景理解提供了新范式。 Abstract: In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.[69] CanViT: Toward Active-Vision Foundation Models
Yohaï-Eliel Berreby,Sabrina Du,Audrey Durand,B. Suresh Krishna
Main category: cs.CV
TL;DR: 本文提出了CanViT,首个任务和策略无关的主动视觉基础模型(AVFM),通过场景相对RoPE绑定视网膜拓扑ViT主干与空间拓扑画布工作区,并设计Canvas Attention机制实现高效低延迟序列推理;采用无标签的被动到主动密集潜在蒸馏预训练方案,在ImageNet-21k上大规模预训练,显著超越现有主动视觉模型在ADE20K分割和ImageNet分类上的性能。
Details
Motivation: 主动计算机视觉缺乏可扩展的通用架构和预训练流程,导致主动视觉基础模型(AVFM)尚未被探索。 Method: 提出CanViT模型,结合场景相对RoPE、retinotopic ViT主干与spatiotopic canvas工作区,引入Canvas Attention交叉注意力机制;解耦‘思考’与‘记忆’层级以降低延迟;设计policy-agnostic passive-to-active dense latent distillation预训练方法。 Result: CanViT-B在ADE20K分割中单次低分辨率glimpse达38.5% mIoU,优于最佳主动模型(27.6%)且仅需19.5倍更少FLOPs;多glimpse达45.9%;ImageNet-1k分类达81.2% top-1准确率;支持长rollout、大场景与新策略泛化。 Conclusion: CanViT首次实现了真正意义上的主动视觉基础模型,大幅缩小了主动与被动视觉在语义分割等任务上的性能差距,开辟了AVFM这一新研究方向。 Abstract: Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes -- an order of magnitude more than previous active models -- and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.[70] FullCircle: Effortless 3D Reconstruction from Casual 360$^\circ$ Captures
Yalda Foroutan,Ipek Oztas,Daniel Rebain,Aysegul Dundar,Kwang Moo Yi,Lily Goli,Andrea Tagliasacchi
Main category: cs.CV
TL;DR: 本文提出了一种直接从原始360°相机图像重建3D场景的实用管线,无需特殊采集协议或预处理,且对画面中常见的人体操作者具有鲁棒性;通过新构建的多层级原始双鱼眼数据集验证,方法在360°及模拟视角任务上均显著优于现有基线。
Details
Motivation: 现有360°辐射场重建方法依赖特殊采集流程和预处理,违背了辐射场‘轻松捕获与重建’的初衷;同时,传统广角视角受限导致标定与重建困难,而360°相机虽覆盖广,却缺乏即采即用的鲁棒重建方案。 Method: 提出端到端的360°辐射场重建管线,直接处理原始双鱼眼图像;引入针对人体干扰的鲁棒优化策略;构建包含真实360°捕获的多层级基准数据集用于评估。 Result: 在自建360°数据集上显著优于原始3DGS及基于视角模拟的鲁棒基线方法;验证了360°采集在随意式重建中的实质性优势。 Conclusion: 原始360°输入可支持高效、鲁棒、免预处理的辐射场重建,为真正‘即采即重建’的3D工作流提供了可行路径。 Abstract: Radiance fields have emerged as powerful tools for 3D scene reconstruction. However, casual capture remains challenging due to the narrow field of view of perspective cameras, which limits viewpoint coverage and feature correspondences necessary for reliable camera calibration and reconstruction. While commercially available 360$^\circ$ cameras offer significantly broader coverage than perspective cameras for the same capture effort, existing 360$^\circ$ reconstruction methods require special capture protocols and pre-processing steps that undermine the promise of radiance fields: effortless workflows to capture and reconstruct 3D scenes. We propose a practical pipeline for reconstructing 3D scenes directly from raw 360$^\circ$ camera captures. We require no special capture protocols or pre-processing, and exhibit robustness to a prevalent source of reconstruction errors: the human operator that is visible in all 360$^\circ$ imagery. To facilitate evaluation, we introduce a multi-tiered dataset of scenes captured as raw dual-fisheye images, establishing a benchmark for robust casual 360$^\circ$ reconstruction. Our method significantly outperforms not only vanilla 3DGS for 360$^\circ$ cameras but also robust perspective baselines when perspective cameras are simulated from the same capture, demonstrating the advantages of 360$^\circ$ capture for casual reconstruction. Additional results are available at: https://theialab.github.io/fullcircle[71] A vision-language model and platform for temporally mapping surgery from video
Dani Kiyasseh
Main category: cs.CV
TL;DR: 本文提出了Halsted,一个基于大规模手术视频数据集HSA训练的视觉-语言模型,用于全面、高效地映射外科手术行为,并通过公开数据集HSA-27k和在线平台实现临床可及性,推动手术AI向临床部署与自主机器人手术迈进。
Details
Motivation: 现有手术AI模型覆盖范围窄、行为建模不全面、缺乏临床实用性,难以服务于一线外科医生。 Method: 构建了大规模、多专科的Halsted手术图谱(HSA),采用迭代自标注框架生成超65万段手术视频;在此基础上训练视觉-语言模型Halsted;并开发开源子集HSA-27k及面向外科医生的Web平台。 Result: Halsted在手术活动映射任务上超越此前SOTA模型,具备更高全面性与计算效率;HSA-27k已公开;Halsted平台已上线并支持全球外科医生分钟级自动分析自身手术视频。 Conclusion: 本工作通过标准化手术视频数据与直接面向用户的工具链,显著缩小手术AI的转化鸿沟,为临床部署和未来自主机器人手术奠定基础。 Abstract: Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.[72] Language Models Can Explain Visual Features via Steering
Javier Ferrando,Enrique Lopez-Cuena,Pablo Agustin Martin-Torres,Daniel Hinjos,Anna Arias-Duart,Dario Garcia-Gasulla
Main category: cs.CV
TL;DR: 本文提出了一种基于因果干预的稀疏自编码器(SAE)特征解释新方法——Steering,利用视觉-语言模型(VLM)结构,通过在空图像上操控单个SAE特征并 prompting 语言模型生成解释,实现可扩展、高质量的自动化视觉概念解释,并提出结合因果干预与输入示例的混合方法Steering-informed Top-k。
Details
Motivation: 稀疏自编码器虽能发现大量视觉特征,但缺乏无需人工干预的自动解释方法;现有基于输入样本相关性的解释方法存在局限,亟需一种更本质、可扩展的解释范式。 Method: 提出Steering方法:在视觉编码器中对单个SAE特征施加因果干预(即在空图像输入下激活该特征),再利用VLM的语言解码器生成对该特征所表征视觉概念的自然语言解释;进一步设计Steering-informed Top-k混合策略,融合因果干预与传统top-k输入示例方法。 Result: Steering方法可规模化生成高质量特征解释,解释质量随语言模型规模提升而持续增强;Steering-informed Top-k在不增加计算开销前提下达到当前最优解释质量。 Conclusion: Steering为视觉模型可解释性提供了新维度,证明因果干预结合VLM是自动化特征解释的有效路径,混合策略进一步推动了实用化进展。 Abstract: Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees'', effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.[73] TrajLoom: Dense Future Trajectory Generation from Video
Zewei Zhang,Jia Jun Cheng Xian,Kaiwen Liu,Ming Liang,Hang Chu,Jun Chen,Renjie Liao
Main category: cs.CV
TL;DR: 本文提出TrajLoom框架,通过Grid-Anchor Offset Encoding、TrajLoom-VAE和TrajLoom-Flow三部分,实现从历史轨迹与视频上下文中预测密集点轨迹的未来演化,并引入统一基准TrajLoomBench;相比SOTA方法,将预测时长从24帧提升至81帧,显著提升运动真实感与稳定性。
Details
Motivation: 密集点轨迹是紧凑且富有表现力的运动表征,但如何从观测视频中建模其未来演化仍具挑战性;同时缺乏统一、标准化的评估基准。 Method: 提出三模块框架:(1) Grid-Anchor Offset Encoding——以像素中心为锚点编码偏移量,缓解位置依赖偏差;(2) TrajLoom-VAE——通过掩码重建与时空一致性正则化学习轨迹的紧致时空隐空间;(3) TrajLoom-Flow——基于流匹配在隐空间生成未来轨迹,并结合边界线索与on-policy K步微调保障采样稳定性;并构建TrajLoomBench统一基准。 Result: 在真实与合成视频上均显著优于SOTA,预测时长从24帧延长至81帧,运动更真实稳定;所预测轨迹可直接用于下游视频生成与编辑任务。 Conclusion: TrajLoom提供了一种高效、鲁棒且可扩展的密集轨迹预测范式,推动视频理解与可控生成的发展,并通过开源代码、模型与数据集促进社区研究。 Abstract: Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.[74] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
Fulvio Sanguigni,Davide Lobba,Bin Ren,Marcella Cornia,Nicu Sebe,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出了首个大规模指令驱动的服装编辑数据集Dress-ED,涵盖虚拟试穿(VTON)、虚拟脱衣(VTOFF)与文本引导的服装编辑,并构建了统一的多模态扩散模型作为基线方法。
Details
Motivation: 现有虚拟试穿/试脱数据集缺乏指令驱动的可控、交互式编辑能力,限制了时尚生成的灵活性与实用性。 Method: 构建全自动多模态流水线(融合MLLM服装理解、扩散模型编辑、LLM引导验证),生成包含14.6万组四元样本(商品图、人物穿戴图、编辑后图、自然语言指令)的Dress-ED数据集;并提出联合建模语言指令与视觉服装线索的统一多模态扩散框架。 Result: 发布首个覆盖VTON/VTOFF/文本编辑的统一基准Dress-ED(146k样本,3类服装、7种编辑类型);提出有效支持指令驱动编辑的多模态扩散基线模型。 Conclusion: Dress-ED填补了指令驱动时尚编辑的数据空白,所提框架为可控、交互式虚拟时尚生成提供了新范式与坚实基础。 Abstract: Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.[75] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images
Heesup Yun,Isaac Kazuo Uyehara,Ioannis Droutsas,Earl Ranario,Christine H. Diepenbrock,Brian N. Bailey,J. Mason Earles
Main category: cs.CV
TL;DR: 本文提出了一种基于图像生成3D植物架构的新型算法,利用视觉-语言模型(VLM)从合成图像中提取器官级结构参数,实现了高精度的程序化植物建模。
Details
Motivation: 传统方法在田间尺度测量植物架构参数和嵌套结构劳动强度大、成本高,亟需自动化、低成本的替代方案。 Method: 构建面向植物架构XML定义的专用分词器,将结构信息转化为语言模型可处理的token序列;训练视觉-语言模型,仅用Helios生成的合成牛豆图像及对应真实架构数据进行端到端学习。 Result: 教师强制训练下token F1达0.73;自回归生成评估得BLEU-4为94.00%,ROUGE-L为0.5182,验证了从图像准确还原程序化架构的可行性。 Conclusion: 证明仅凭图像即可实现植物器官级几何与拓扑参数的提取与建模,为后续拓展至真实图像奠定了基础。 Abstract: Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant's architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.[76] To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models
OFM Riaz Rahman Aranya,Kevin Desai
Main category: cs.CV
TL;DR: 本文研究了医学视觉语言模型(VLMs)在幻觉(hallucination)和迎合性(sycophancy)两类关键失败模式下的鲁棒性,发现二者存在权衡关系;提出三个新评估指标(L-VASE、CCS、CSI),实证表明当前主流7–8B参数VLM均未达到临床可用的安全水平。
Details
Motivation: 现有医学VLM虽在视觉问答任务上表现良好,但其在幻觉与迎合性双重失败模式下的鲁棒性缺乏系统评估,尤其二者共存时的影响尚不明确,阻碍临床部署。 Method: 在三个医学VQA数据集上评测6个VLM(3个通用型+3个医学专用型),提出三个新指标:L-VASE(改进版VASE,避免双重归一化)、CCS(置信度校准的迎合性评分)、CSI(融合接地性、自主性与校准性的几何平均安全指数)。 Result: 发现‘接地性-迎合性权衡’:最低幻觉模型最迎合;最强抗压模型幻觉率高于所有医学专用模型;全部1151个测试样本中,无一模型CSI > 0.35。 Conclusion: 仅优化单一指标(如减少幻觉)不足以保障临床安全;必须联合评估幻觉与迎合性,当前7–8B参数VLM尚不具备临床部署条件。 Abstract: Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight[77] Toward Faithful Segmentation Attribution via Benchmarking and Dual-Evidence Fusion
Abu Noman Md Sakib,OFM Riaz Rahman Aranya,Kevin Desai,Zijie Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于语义分割归因图的可复现基准测试框架,并提出了Dual-Evidence Attribution(DEA)方法,在干预式可信度、抗扰性等方面显著优于梯度类方法,揭示了可视化评估无法发现的可信度-稳定性权衡。
Details
Motivation: 现有语义分割归因图多依赖视觉合理性判断,缺乏对归因是否真实驱动预测、是否发生目标外泄漏等关键性质的量化评估,亟需系统化、可复现的评估协议。 Method: 构建涵盖干预式可信度、目标外泄漏、扰动鲁棒性和运行时的多维基准;提出Dual-Evidence Attribution(DEA),通过一致性加权融合梯度证据与区域级干预信号。 Result: DEA在所有已完成实验中持续提升基于删除的可信度,保持强鲁棒性,但引入额外计算开销;基准揭示了不同归因方法族间被视觉评估掩盖的可信度-稳定性权衡。 Conclusion: 该基准为语义分割可解释性研究提供了原则性评估基础,DEA验证了融合多源证据的有效性,推动归因方法从‘看起来合理’走向‘真正可靠’。 Abstract: Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model's prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.[78] PIVM: Diffusion-Based Prior-Integrated Variation Modeling for Anatomically Precise Abdominal CT Synthesis
Dinglun He,Baoming Zhang,Xu Wang,Yao Hao,Deshan Yang,Ye Duan
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的腹部CT图像合成框架PIVM,通过结合器官特异性强度先验与分割标签,直接在图像空间中预测体素级强度变化,实现解剖结构准确、HU值完整、纹理细节丰富的合成。
Details
Motivation: 腹部CT数据因标注成本高和隐私限制而稀缺,制约了分割与诊断模型的发展。 Method: 提出Prior-Integrated Variation Modeling(PIVM)框架:不从噪声生成整幅图像,而是以器官分割标签导出的强度先验为基准,预测体素级强度变化;先验与标签共同引导扩散过程,确保空间对齐与边界真实;全程在图像空间操作,保留完整HU范围。 Result: 实现了高解剖准确性、保持原始HU值分布、保留精细纹理且无平滑失真的CT图像合成,在有限标注数据下提升下游任务性能。 Conclusion: PIVM为低资源医学影像场景提供了一种高效、隐私友好且解剖可信的合成范式,无需潜空间压缩即可兼顾真实性与结构保真。 Abstract: Abdominal CT data are limited by high annotation costs and privacy constraints, which hinder the development of robust segmentation and diagnostic models. We present a Prior-Integrated Variation Modeling (PIVM) framework, a diffusion-based method for anatomically accurate CT image synthesis. Instead of generating full images from noise, PIVM predicts voxel-wise intensity variations relative to organ-specific intensity priors derived from segmentation labels. These priors and labels jointly guide the diffusion process, ensuring spatial alignment and realistic organ boundaries. Unlike latent-space diffusion models, our approach operates directly in image space while preserving the full Hounsfield Unit (HU) range, capturing fine anatomical textures without smoothing. Source code is available at https://github.com/BZNR3/PIVM.[79] CAM3R: Camera-Agnostic Model for 3D Reconstruction
Namitha Guruprasad,Abhay Yadav,Cheng Peng,Rama Chellappa
Main category: cs.CV
TL;DR: 本文提出CAM3R模型,一种无需相机标定、适用于宽视角(如鱼眼、全景)图像的端到端三维重建方法,通过Ray Module和Cross-view Module联合估计射线方向与深度,并引入Ray-Aware全局对齐框架优化位姿与尺度,在多种相机模型数据集上达到SOTA性能。
Details
Motivation: 现有基于透视图像训练的三维重建模型在处理非矩形光学(如鱼眼、全景)图像时因隐含针孔假设而出现严重几何退化,亟需相机无关的重建方法。 Method: 提出CAM3R:包含双视图网络(Ray Module估计像素级射线方向;Cross-view Module预测径向距离、置信图、点图及相对位姿)和Ray-Aware全局对齐框架(用于位姿精调与尺度优化,同时保持局部几何一致性)。 Result: 在全景、鱼眼和针孔图像等多种相机模型数据集上实验表明,CAM3R在位姿估计与三维重建任务上均达到新SOTA性能。 Conclusion: CAM3R实现了无需先验相机标定、跨相机模型通用的高质量三维重建,验证了射线建模与全局几何一致性约束的有效性。 Abstract: Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.[80] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning
Yuxuan Jiang,Yixuan Li,Hanwei Zhu,Siyue Teng,Fan Zhang,David Bull
Main category: cs.CV
TL;DR: 本文提出Q-Tacit新范式,使视觉语言模型在潜在质量空间中进行超越自然语言的推理,通过注入结构化视觉质量先验和校准潜在推理轨迹,显著减少token使用量并提升图像质量评估性能。
Details
Motivation: 现有基于自然语言的链式推理方法难以充分表达图像质量相关的视觉线索,因文本离散性与质量感知空间存在鸿沟,限制了视觉密集型图像质量评估任务的推理效果。 Method: 提出Q-Tacit两阶段方法:(i)将结构化视觉质量先验注入VLM的潜在空间;(ii)校准潜在空间中的推理路径以增强质量评估能力。 Result: Q-Tacit在显著减少token消耗的同时,实现了强整体性能;实验证明其在图像质量评估任务中优于现有基于推理的方法。 Conclusion: 自然语言并非图像质量推理的唯一紧凑表征,潜在空间推理为图像质量评估提供了新可行路径,拓展了视觉质量建模与推理的研究方向。 Abstract: Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.[81] Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging
Vedrana Ivezić,Mara Pleasure,Ashwath Radhachandran,Saarang Panchavati,Shreeram Athreya,Vivek Sant,Benjamin Emert,Gregory Fishbein,Corey Arnold,William Speier
Main category: cs.CV
TL;DR: 本文系统研究了自监督学习(SSL)方法在医学影像中的适用性,发现联合嵌入架构(JEAs)更适合信号局部化的模态(如组织病理学),而联合嵌入预测架构(JEPAs)更适用于全局结构信息主导的模态(如肝脏超声),并由临床专家验证了特征的临床相关性。
Details
Motivation: 现有SSL方法在医学影像中广泛应用,但尚无系统研究哪种SSL目标更契合临床相关信号的空间组织特性。 Method: 选取具有不同噪声特性的两种典型医学影像模态——超声(全局解剖结构)和组织病理学(局部信号)——对比JEAs与JEPAs在表征学习上的差异,并由放射科医生和病理科医生独立评估所学特征的临床相关性。 Result: JEAs在组织病理学中表现更优(因其视图不变性目标适配局部信号),JEPAs在肝脏超声中更优(适配宏观解剖的全局结构);该差异经临床专家验证具有统计显著性。 Conclusion: SSL目标的选择应依据医学影像模态的结构特性与噪声分布,本文为匹配SSL目标与影像特性提供了可推广的指导框架。 Abstract: Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.[82] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Shiyao Li,Antoine Guédon,Shizhe Chen,Vincent Lepetit
Main category: cs.CV
TL;DR: 本文提出MAGICIAN框架,通过Imagined Gaussians场景表示和长时程规划,显著提升主动建图的效率与完整性。
Details
Motivation: 现有主动建图方法多依赖贪心的下一最佳视角预测,导致探索低效、重建不完整。 Method: 提出基于预训练占据网络生成Imagined Gaussians表示,并结合快速体渲染计算覆盖增益,嵌入树搜索算法实现长时程规划;采用闭环方式更新表示并优化轨迹。 Result: 在室内外多个基准上达到SOTA性能,验证了长时程规划对主动建图的关键优势。 Conclusion: 长时程规划结合强结构先验的场景表示,可有效克服贪心策略局限,提升主动建图的整体性能。 Abstract: Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.[83] Large-Scale Avalanche Mapping from SAR Images with Deep Learning-based Change Detection
Mattia Gatti,Alberto Mariani,Ignazio Gallo,Fabiano Monti
Main category: cs.CV
TL;DR: 本研究提出了一种基于Sentinel-1 SAR影像的双时相变化检测方法,用于大规模雪崩制图,在多个高山生态区实现了高F1/F2分数,并发布了带标注的多区域数据集作为SAR雪崩制图基准。
Details
Motivation: 准确检测卫星影像中的变化对监测日益频发和加剧的雪崩等快速质量运动灾害至关重要,以保障人类生命、基础设施和生态系统安全。 Method: 采用双时相Sentinel-1合成孔径雷达(SAR)影像进行端到端的单模态变化检测,不依赖其他模态数据,仅使用灾前与灾后SAR图像。 Result: 在保守配置(F1优化)下F1得分为0.8061;在召回率导向配置(F2优化)下F2得分为0.8414,雪崩多边形命中率达80.36%;揭示了精度与完整性之间的权衡,并验证了阈值调整对检测小/边缘雪崩的有效性。 Conclusion: 单模态SAR双时相变化检测是最稳定有效的方法;发布的多区域标注数据集为SAR雪崩制图提供了可复现的基准。 Abstract: Accurate change detection from satellite imagery is essential for monitoring rapid mass-movement hazards such as snow avalanches, which increasingly threaten human life, infrastructure, and ecosystems due to their rising frequency and intensity. This study presents a systematic investigation of large-scale avalanche mapping through bi-temporal change detection using Sentinel-1 synthetic aperture radar (SAR) imagery. Extensive experiments across multiple alpine ecoregions with manually validated avalanche inventories show that treating the task as a unimodal change detection problem, relying solely on pre- and post-event SAR images, achieves the most consistent performance. The proposed end-to-end pipeline achieves an F1-score of 0.8061 in a conservative (F1-optimized) configuration and attains an F2-score of 0.8414 with 80.36% avalanche-polygon hit rate under a less conservative, recall-oriented (F2-optimized) tuning. These results highlight the trade-off between precision and completeness and demonstrate how threshold adjustment can improve the detection of smaller or marginal avalanches. The release of the annotated multi-region dataset establishes a reproducible benchmark for SAR-based avalanche mapping.[84] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Jiayin Sun,Caixia Sun,Boyu Yang,Hailin Li,Xiao Chen,Yi Zhang,Errui Ding,Liang Li,Chao Deng,Junlan Feng
Main category: cs.CV
TL;DR: 本文提出GeoTikzBridge框架,通过tikz代码生成增强多模态大语言模型(MLLMs)对几何结构的细粒度感知与视觉推理能力,构建了两个模型及配套数据集,并在开源MLLM中达到SOTA性能。
Details
Motivation: 现有MLLMs难以感知细粒度几何结构,限制其几何理解与视觉推理能力。 Method: 提出GeoTikzBridge框架,包含两个模型:GeoTikzBridge-Base(基于2.5M图像-to-tikz对的GeoTikz-Base数据集训练,采用迭代数据扩展和局部几何变换策略);GeoTikzBridge-Instruct(在首个指令增强型tikz数据集GeoTikz-Instruct上微调),支持几何视觉推理。 Result: 所提模型在开源MLLM中达到SOTA性能,并可作为即插即用推理模块提升任意MLLM/LLM的几何问题求解能力。 Conclusion: tikz代码生成是提升MLLM几何感知与推理能力的有效途径,GeoTikzBridge为多模态几何理解提供了新范式和高质量开源资源。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: https://github.com/sjy-1995/GeoTikzBridge-Advancing-Multimodal-Code-Generation-for-Geometric-Perception-and-Reasoning.[85] Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth
Mingrui Chen,Hexiong Yang,Haogeng Liu,Huaibo Huang,Ran He
Main category: cs.CV
TL;DR: 本文提出一个全面的多模态基准,专门评估大语言模型(MLLMs)的推理宽度(reasoning width)与推理深度(reasoning depth),并构建了细粒度的思维树评估协议,在1200+多模态案例上对30多个主流MLLM进行评测,发现当前模型在结合深度链式推理与广度探索式搜索方面仍存在显著瓶颈。
Details
Motivation: 现有研究多关注推理深度,而忽视了同样关键的推理宽度——即模型在多约束、并行路径中系统性搜索与剪枝的能力;需建立兼顾宽与深的统一评估框架。 Method: 构建包含1200+高质量多模态样本的基准,覆盖异构领域;提出细粒度‘思维树’(tree-of-thought)评估协议,联合量化推理宽度与深度;对12个模型家族(超30个先进MLLM)按难度、题型和技能维度系统评测。 Result: 当前MLLM在通用或常识性视觉问答任务上表现良好,但在需同时调用长链推理与宽域试错的洞察式推理任务上显著受限;识别出典型失败模式(如路径剪枝不当、约束冲突处理失效等)。 Conclusion: 推理宽度是衡量MLLM高级认知能力的关键新维度;未来工作应推动模型兼具‘更深’与‘更宽’的协同推理能力,而非单向强化深度。 Abstract: In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning width, a complementary dimension to the more commonly studied reasoning depth. Specifically, reasoning depth measures the model's ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next. Reasoning width tends to focus more on the model's capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking. To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. We evaluate 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills. Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only deeper but also wider.[86] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment
Tzu-Ti Wei,Chu-Yu Huang,Yu-Chee Tseng,Jen-Jee Chen
Main category: cs.CV
TL;DR: WiFi2Cap 是一种三阶段框架,利用 Wi-Fi 信道状态信息(CSI)生成人类活动的自然语言描述,在保护隐私的同时实现细粒度语义理解。
Details
Motivation: 现有基于 Wi-Fi CSI 的系统多聚焦于姿态估计或预定义动作分类,缺乏对自然语言描述生成的支持;且无线信号与语言之间存在语义鸿沟,以及左右肢体混淆等方向敏感歧义问题。 Method: 提出 WiFi2Cap 框架:1)视觉-语言教师模型从同步视频-文本对中学习可迁移监督;2)CSI 学生模型通过跨模态对齐映射到教师的视觉空间和文本嵌入,并引入镜像一致性损失(Mirror-Consistency Loss)缓解左右歧义;3)采用前缀调优的语言模型从 CSI 嵌入生成动作描述。同时构建了同步 CSI-RGB-句子基准数据集 WiFi2Cap Dataset。 Result: 在 BLEU-4、METEOR、ROUGE-L、CIDEr 和 SPICE 等多项语言生成指标上显著优于基线方法。 Conclusion: WiFi2Cap 实现了从 Wi-Fi 信号到自然语言动作描述的有效映射,为室内隐私友好型语义感知提供了新范式。 Abstract: Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher's visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.[87] TimeWeaver: Age-Consistent Reference-Based Face Restoration with Identity Preservation
Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang
Main category: cs.CV
TL;DR: 本文提出TimeWeaver,首个支持跨年龄参考图像的参考式人脸恢复框架,通过解耦身份与年龄建模,在保持身份一致性的同时实现目标年龄语义的精确控制。
Details
Motivation: 现有参考式人脸恢复方法假设参考图像与输入图像年龄一致,无法处理历史修复或寻人等仅有跨年龄参考图像的场景,导致年龄失真。 Method: 提出TimeWeaver框架:训练阶段通过ID-Fusion模块融合全局身份嵌入与年龄抑制的面部token,学习年龄鲁棒的身份表征;推理阶段采用无需训练的Age-Aware Gradient Guidance和Token-Targeted Attention Boost技术引导采样以匹配目标年龄提示。 Result: 在视觉质量、身份保持和年龄一致性方面均超越现有方法。 Conclusion: TimeWeaver首次实现了跨年龄参考下的人脸恢复,有效解耦并协同优化身份保真与年龄可控性,为实际应用提供了新范式。 Abstract: Recent progress in face restoration has shifted from visual fidelity to identity fidelity, driving a transition from reference-free to reference-based paradigms that condition restoration on reference images of the same person. However, these methods assume the reference and degraded input are age-aligned. When only cross-age references are available, as in historical restoration or missing-person retrieval, they fail to maintain age fidelity. To address this limitation, we propose TimeWeaver, the first reference-based face restoration framework supporting cross-age references. Given arbitrary reference images and a target-age prompt, TimeWeaver produces restorations with both identity fidelity and age consistency. Specifically, we decouple identity and age conditioning across training and inference. During training, the model learns an age-robust identity representation by fusing a global identity embedding with age-suppressed facial tokens via a transformer-based ID-Fusion module. During inference, two training-free techniques, Age-Aware Gradient Guidance and Token-Targeted Attention Boost, steer sampling toward desired age semantics, enabling precise adherence to the target-age prompt. Extensive experiments show that TimeWeaver surpasses existing methods in visual quality, identity preservation, and age consistency.[88] How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay Videos
Wentao Lu,Alexander Senchenko,Alan Sayle,Abram Hindle,Cor-Paul Bezemer
Main category: cs.CV
TL;DR: 本文研究了视觉语言模型(VLM)在工业级长时游戏视频质量保障中检测视觉缺陷的实际效果,发现现成VLM已有一定能力,但单纯提示工程或二次判断等无微调策略提升有限,需结合文本与视觉异常检测的混合方法。
Details
Motivation: 视频驱动的游戏质量保障工作耗时费力且易出错,而VLM具备通用视觉推理能力,有望直接从视频帧中检测视觉缺陷;但现有研究多基于人工筛选数据集,缺乏真实工业场景验证。 Method: 在100段总计41小时、含19738个关键帧的工业QA游戏视频上开展实证研究;以单提示(single-prompt)VLM为基线,对比引入二级判断模型和元数据增强提示(检索历史缺陷报告)两种无微调增强策略的效果。 Result: 基线VLM达到精度0.50、准确率0.72;两种增强策略仅带来边际提升,却增加了计算开销和输出波动。 Conclusion: 现成VLM已能在实际游戏QA视频中检测部分视觉缺陷,但进一步突破需融合文本语义分析与视觉异常识别的混合架构,而非依赖纯提示优化。 Abstract: Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.[89] SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Khanh Binh Nguyen,Chae Jung Park
Main category: cs.CV
TL;DR: 本文提出Sound-aware Prompt Learning (SOUPLE),通过可学习的上下文标记替代固定提示,结合视觉特征生成条件上下文,以增强音频-视觉定位任务中音频与视觉语义的对应关系。
Details
Motivation: CLIP模型在音频-视觉定位任务中表现不佳,固定提示(如“a photo of a [V_A]”)难以建立音频嵌入与上下文之间的有意义关联,且传统替换[CLS]为[V_A]的方式无法有效捕获语义线索。 Method: 提出SOUPLE方法,用可学习的上下文标记替代固定提示,并将视觉特征融入上下文标记生成过程,以驱动掩码解码器实现音频-视觉语义对齐。 Result: 在VGGSound、SoundNet和AVSBench数据集上的实验表明,SOUPLE显著提升了音频-视觉定位与分割性能。 Conclusion: SOUPLE通过引入视觉感知的可学习提示机制,有效缓解了CLIP在跨模态音频-视觉定位任务中的语义鸿沟问题,为多模态提示学习提供了新思路。 Abstract: Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.[90] MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding
Purui Bai,Tao Wu,Jiayang Sun,Xinyue Liu,Huaibo Huang,Ran He
Main category: cs.CV
TL;DR: 本文提出了MVPBench,一个用于评估多视频感知能力的新基准,包含14个子任务和5K问答测试,揭示了当前多模态大模型在多视频理解上的显著局限。
Details
Motivation: 现有基准仅限于静态图像或单个视频,忽略了多个视频之间的复杂交互,因此需要构建专门评估多视频感知能力的基准。 Method: 构建了Multi-Video Perception Evaluation Benchmark(MVPBench),涵盖14个子任务、5K问答样本,使用2.7K视频片段(来自现有数据集及人工标注)进行评测。 Result: 大规模实验表明,当前多模态大模型在处理多视频输入时表现较差,暴露出其在多视频理解方面的严重不足。 Conclusion: MVPBench填补了多视频感知评估的空白,有望推动多视频理解技术的发展。 Abstract: The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.[91] Multimodal Industrial Anomaly Detection via Geometric Prior
Min Li,Jinghui He,Gang Li,Jiachen Li,Jin Wan,Delong Han
Main category: cs.CV
TL;DR: 本文提出了一种基于几何先验的多模态工业异常检测网络GPAD,通过点云专家模型提取精细几何特征,并结合两阶段融合策略与几何先验引导的注意力融合和异常区域分割,显著提升了对复杂几何形状缺陷(如微小表面变形、不规则轮廓)的检测精度。
Details
Motivation: 现有多模态工业异常检测方法未能有效利用表面法向量和3D形状拓扑等关键几何信息,导致对细微几何缺陷(如表面变形、不规则轮廓)检测精度低。 Method: 提出GPAD网络:1)设计点云专家模型,采用差分法向量计算增强几何细节并生成几何先验;2)设计两阶段融合策略,融合多模态数据与3D点云中的几何先验;3)引入基于几何先验的注意力融合和异常区域分割模块。 Result: 在MVTec-3D AD和Eyecandies数据集上,GPAD在检测精度上超越了当前最先进(SOTA)方法。 Conclusion: 几何先验能显著提升多模态工业异常检测对复杂几何缺陷的识别能力,GPAD为利用3D几何信息进行高精度异常检测提供了新范式。 Abstract: The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model's ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.[92] Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
Main category: cs.CV
TL;DR: 本文提出SlotCurri方法,通过重建引导的槽位课程学习、结构感知损失和循环推理,有效缓解视频对象中心学习中的槽位过分割问题,并在YouTube-VIS和MOVi-C数据集上显著提升FG-ARI指标。
Details
Motivation: 现有基于slot-attention的视频对象分解模型易出现严重过分割,因模型被隐式鼓励填满所有槽位以最小化重建误差,导致单个对象被多个冗余槽位表示。 Method: 提出重建引导的槽位课程(SlotCurri):1)从少量粗粒度槽位开始训练,逐步在重建误差高的区域增加新槽位;2)在MSE损失基础上引入结构感知损失,增强局部对比度与边缘信息以明晰语义边界;3)设计循环推理机制,前向后向遍历帧序列以提升时序一致性。 Result: 在YouTube-VIS和MOVi-C数据集上FG-ARI分别提升+6.8和+8.3,验证了方法有效性。 Conclusion: SlotCurri通过按需扩展槽位容量、结合结构先验与循环时序建模,系统性解决了视频对象槽位过分割问题,提升了对象解耦质量与时序一致性。 Abstract: Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.[93] ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Ao Cheng,Xingming Li,Xuanyu Ji,Xixiang He,Qiyao Sun,Chunping Qiu,Runke Huang,Qingyong Hu
Main category: cs.CV
TL;DR: 本文提出了首个面向电子海图(ENC)理解的专业基准ENC-Bench,包含20490个来自真实NOAA ENC的专家验证样本,涵盖感知、空间推理和航海决策三个层级;在10个主流多模态大模型上的零样本评测显示最高准确率仅47.88%,暴露出符号 grounding、空间计算与多约束推理等关键短板。
Details
Motivation: 电子海图(ENC)是现代航海安全的关键基础设施,但现有MMMLs对其专业、标准化、矢量化的符号与空间语义理解能力尚不明确,缺乏专用评测基准。 Method: 构建首个专业ENC理解基准ENC-Bench,基于真实S-57原始数据,通过标定的矢量转图像流程生成样本,并经自动一致性检查与专家审核;设计三层任务体系(感知、空间推理、航海决策),对10个SOTA多模态大模型进行统一零样本评估。 Result: 最佳模型(如GPT-4o)在ENC-Bench上仅达47.88%准确率,模型在符号识别、空间计算、多约束航海推理及光照/尺度鲁棒性方面存在系统性缺陷。 Conclusion: ENC-Bench首次为安全关键领域的专业图表理解建立了严格评测标准,揭示了当前MLLMs在符号化、结构化、高可靠性场景下的根本局限,推动AI向专业航海应用发展。 Abstract: Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure -- requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.[94] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery
Bijay Shakya,Catherine Hoier,Khandaker Mamun Ahmed
Main category: cs.CV
TL;DR: 本文提出一种融合AI超分辨率、深度学习目标检测与视觉-语言模型(VLM)的混合框架,用于灾后建筑损毁评估:先用VRT提升影像分辨率,再用YOLOv11定位建筑,最后通过多VLM协同与CLIPScore无参考评估实现四等级语义损毁分析,并为应急响应提供决策支持。
Details
Motivation: 传统遥感影像存在空间分辨率低、上下文模糊、语义可解释性差等问题,导致灾后结构损伤评估可靠性不足,亟需更鲁棒、可解释的自动化方法。 Method: 采用Video Restoration Transformer(VRT)进行影像超分(1024×1024→4096×4096);基于YOLOv11在灾前影像中定位建筑物;对裁剪出的建筑区域输入多视觉-语言模型(VLM-as-a-Jury)进行四等级损毁语义评估;引入CLIPScore实现无真值标注下的语义对齐评估。 Result: 在xBD数据集的Moore龙卷风和Hurricane Matthew子集上验证,该框架显著提升了损毁建筑的语义解释能力,并能生成面向一线救援人员的恢复建议。 Conclusion: 所提混合框架有效克服了遥感影像语义瓶颈,通过多模型VLM协同与无参考评估机制,在保证准确性的同时增强了可解释性与安全性,适用于实际应急响应场景。 Abstract: Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.[95] Typography-Based Monocular Distance Estimation Framework for Vehicle Safety Systems
Manognya Lokesh Reddy,Zheng Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于车牌字体特征的单目视觉测距方法,利用车牌字符高度等标准化 typography 信息,结合几何建模、自适应检测、多特征融合与卡尔曼滤波等技术,在低成本前提下实现高精度、鲁棒且实时的车距估计。
Details
Motivation: 解决单目视觉测距中尺度模糊和环境敏感问题,同时避免高成本LiDAR/雷达方案,推动低成本ADAS与自动驾驶落地。 Method: 以车牌字符高度为被动标定标记,通过鲁棒车牌检测与字符分割,结合针孔相机模型计算距离;引入交互式标定、双模检测、多阈值字符分割、基于车道线的相机姿态补偿、深度学习融合、时序卡尔曼滤波及多字体特征(笔画宽、字间距、边框厚)融合。 Result: 室内标定实验显示字符高度变异系数仅2.3%,平均绝对误差7.7%;无需GPU即可实时运行;相比车牌宽度法,标准差降低35%,距离估计更平滑稳定。 Conclusion: 基于typography的单目测距框架在精度、鲁棒性与实时性上取得良好平衡,为低成本车载系统提供了实用可行的新范式。 Abstract: Accurate inter-vehicle distance estimation is a cornerstone of advanced driver assistance systems and autonomous driving. While LiDAR and radar provide high precision, their cost prohibits widespread adoption in mass-market vehicles. Monocular vision offers a low-cost alternative but suffers from scale ambiguity and sensitivity to environmental disturbances. This paper introduces a typography-based monocular distance estimation framework, which exploits the standardized typography of license plates as passive fiducial markers for metric distance estimation. The core geometric module uses robust plate detection and character segmentation to measure character height and computes distance via the pinhole camera model. The system incorporates interactive calibration, adaptive detection with strict and permissive modes, and multi-method character segmentation leveraging both adaptive and global thresholding. To enhance robustness, the framework further includes camera pose compensation using lane-based horizon estimation, hybrid deep-learning fusion, temporal Kalman filtering for velocity estimation, and multi-feature fusion that exploits additional typographic cues such as stroke width, character spacing, and plate border thickness. Experimental validation with a calibrated monocular camera in a controlled indoor setup achieved a coefficient of variation of 2.3% in character height across consecutive frames and a mean absolute error of 7.7%. The framework operates without GPU acceleration, demonstrating real-time feasibility. A comprehensive comparison with a plate-width based method shows that character-based ranging reduces the standard deviation of estimates by 35%, translating to smoother, more consistent distance readings in practice, where erratic estimates could trigger unnecessary braking or acceleration.[96] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
Wenyue Chen,Wenjue Chen,Peng Li,Qinghe Wang,Xu Jia,Heliang Zheng,Rongfei Jia,Yuan Liu,Ronggang Wang
Main category: cs.CV
TL;DR: 本文提出Know3D框架,通过将多模态大语言模型的知识注入3D生成过程,实现语言可控的3D资产背面重建。
Details
Motivation: 现有单视图3D生成方法因观测模糊性和缺乏全局结构先验,导致未见区域生成随机、难控,常违背用户意图或产生不合理几何。 Method: 提出Know3D框架,采用VLM-扩散模型联合架构:VLM负责语义理解与引导,扩散模型作为桥梁将语义知识传递至3D生成模型,通过潜在隐状态注入实现语言驱动的背面生成。 Result: 实现了语言可控的3D背面重建,将传统随机‘幻觉’转化为语义可引导过程,提升了生成合理性与用户可控性。 Conclusion: Know3D为融合语义知识与3D生成提供了新范式,推动了可控、可信3D内容生成的发展。 Abstract: Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.[97] Exposure-Normalized Bed and Chair Fall Rates via Continuous AI Monitoring
Paolo Gabriel,Peter Rehani,Zack Drumm,Tyler Troy,Tiffany Wyatt,Narinder Singh
Main category: cs.CV
TL;DR: 本研究利用AI连续监测,基于暴露时间(而非占用床日)估算跌倒率,发现椅子暴露每1000小时的跌倒率为17.8次,床暴露为4.3次;椅-床率比为2.35,但无统计学显著性;椅上跌倒多与脚踏板位置不当有关。
Details
Motivation: 传统以占用床日计算跌倒率存在暴露时间衡量偏差,需更精准的基于实际暴露时间(如坐姿或卧姿时长)的风险评估方法。 Method: 回顾性队列研究,采用AI连续监测系统采集2024年8月至2025年12月间3980个监测单元的292,914小时数据;使用概率加权法计算单位暴露时间跌倒率;主分析采用Poisson回归模型估计椅vs床的调整率比;辅以事件归因分析识别跌倒机制。 Result: 椅暴露跌倒率为17.8/1000小时,床为4.3/1000小时;椅-床率比为2.35(95%CI: 0.87–6.33,p=0.0907);在32例去重跌倒事件中,7例直接发生于椅子上,其中6例与脚踏板定位失败相关。 Conclusion: 椅上跌倒风险高于床上,但差异未达统计学显著性;结果为假设生成性质,支持优化椅子设计(尤其是脚踏板设置)而非减少椅子使用。 Abstract: This retrospective cohort study used continuous AI monitoring to estimate fall rates by exposure time rather than occupied bed-days. From August 2024 to December 2025, 3,980 eligible monitoring units contributed 292,914 hourly rows, yielding probability-weighted rates of 17.8 falls per 1,000 chair exposure-hours and 4.3 per 1,000 bed exposure-hours. Within the study window, 43 adjudicated falls matched the monitoring pipeline, and 40 linked to eligible exposure hours for the primary Poisson model, producing an adjusted chair-versus-bed rate ratio of 2.35 (95% confidence interval 0.87 to 6.33; p=0.0907). In a separate broader observation cohort (n=32 deduplicated events), 6 of 7 direct chair falls involved footrest-positioning failures. Because this was an observational study in a single health system, these findings remain hypothesis-generating and support testing safer chair setups rather than using chairs less.[98] Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis
Chamuditha Jayanga Galappaththige,Thomas Gottwald,Peter Stehr,Edgar Heinert,Niko Suenderhauf,Dimity Miller,Matthias Rottmann
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、即插即用的框架,用于3D高斯泼溅(Gaussian Splatting)中像素级、视角相关的预测不确定性估计,以增强其在自动驾驶等安全关键场景中的可信度和实用性。
Details
Motivation: 现有3D高斯泼溅虽渲染效果优秀,但缺乏对表示不确定性的量化能力,难以满足自主代理与安全关键应用对空间地图可靠性要求。 Method: 提出一种后处理方法,将不确定性建模为基于贝叶斯正则化的重建残差线性最小二乘优化,无需修改原始场景表示,即可提取每个高斯图元的不确定性通道。 Result: 该方法在不损害视觉保真度的前提下,显著提升了三项下游感知任务性能:主动视角选择、位姿无关的场景变化检测与异常检测。 Conclusion: 所提不确定性估计框架使3D高斯泼溅从纯渲染引擎升级为可信的空间地图,具备实际部署潜力。 Abstract: Recent advances in 3D Gaussian Splatting have enabled impressive photorealistic novel view synthesis. However, to transition from a pure rendering engine to a reliable spatial map for autonomous agents and safety-critical applications, knowing where the representation is uncertain is as important as the rendering fidelity itself. We bridge this critical gap by introducing a lightweight, plug-and-play framework for pixel-wise, view-dependent predictive uncertainty estimation. Our post-hoc method formulates uncertainty as a Bayesian-regularized linear least-squares optimization over reconstruction residuals. This architecture-agnostic approach extracts a per-primitive uncertainty channel without modifying the underlying scene representation or degrading baseline visual fidelity. Crucially, we demonstrate that providing this actionable reliability signal successfully translates 3D Gaussian splatting into a trustworthy spatial map, further improving state-of-the-art performance across three critical downstream perception tasks: active view selection, pose-agnostic scene change detection, and pose-agnostic anomaly detection.[99] It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Lishen Qu,Shihao Zhou,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
Main category: cs.CV
TL;DR: 本文提出Flickerformer,一种基于Transformer的架构,用于去除短曝光摄影中的闪烁伪影,通过相位融合、自相关前馈网络和小波方向注意力模块实现无鬼影的高效修复。
Details
Motivation: 闪烁伪影具有周期性和方向性等结构化特征,而现有通用恢复框架未针对这些特性建模,导致抑制效果差并引入鬼影。 Method: 提出Flickerformer模型,包含相位融合模块(PFM)、自相关前馈网络(AFFN)和小波方向注意力模块(WDAM),分别利用闪烁的周期性、帧内结构规律和方向性进行建模。 Result: 在定量指标和视觉质量上均优于当前最优方法。 Conclusion: Flickerformer有效建模闪烁伪影的内在结构特性,实现了高质量、无鬼影的闪烁去除,为结构化退化建模提供了新思路。 Abstract: Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network's ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available at https://github.com/qulishen/Flickerformer.[100] PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding
Lirong Che,Zhenfeng Gan,Yanbo Chen,Junbo Tan,Xueqian Wang
Main category: cs.CV
TL;DR: PhotoAgent is an embodied agent for photography that uses Large Multimodal Models (LMMs) and 3D Gaussian Splatting to translate aesthetic language commands into geometric constraints and refine camera poses via mental simulation, achieving high-quality, aesthetically superior images efficiently.
Details
Motivation: Embodied agents for creative tasks like photography face a semantic gap between high-level language commands and precise geometric control; bridging this gap is essential for effective aesthetic execution. Method: PhotoAgent combines LMM-driven chain-of-thought reasoning to convert aesthetic goals into geometric constraints, an analytical solver for initial viewpoint computation, and iterative visual reflection in a photorealistic internal world model built with 3D Gaussian Splatting for pose refinement. Result: PhotoAgent demonstrates superior spatial reasoning and produces higher-quality final images compared to baseline methods, enabled by efficient mental simulation instead of physical trial-and-error. Conclusion: Integrating multimodal reasoning with photorealistic mental simulation enables embodied agents to effectively execute complex creative photography tasks with high aesthetic fidelity and efficiency. Abstract: Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.[101] Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
Mincheol Kwon,Minseung Lee,Seonga Choi,Miso Choi,Kyeong-Jin Oh,Hyunyoung Lee,Cheonyoung Park,Yongho Song,Seunghyun Park,Jinkyu Kim
Main category: cs.CV
TL;DR: 本文提出PinPoint框架,通过两阶段方法(定位指令相关图像区域+精细化特征提取)提升大型视觉语言模型在复杂图像(如信息图、文档)上的推理效率与准确性,同时减少计算开销。
Details
Motivation: 现有LVLMs处理信息密集图像(如信息图、文档布局)时需生成大量视觉token,导致计算开销大,难以高效定位与指令相关的关键区域。 Method: 提出两阶段框架PinPoint:第一阶段利用Instruction-Region Alignment机制,结合图文输入定位指令相关图像区域;第二阶段对这些区域进行精细化视觉特征提取;并构建了覆盖InfographicVQA、MultiPageDocVQA和SinglePageDocVQA的新标注数据集以提供更丰富的监督信号。 Result: PinPoint在多个挑战性VQA基准上超越现有方法,精度更高,同时显著减少无关视觉token数量,降低计算开销。 Conclusion: PinPoint通过显式建模图文指令对齐与区域级特征精炼,有效平衡了大型视觉语言模型在复杂图像理解任务中的性能与效率。 Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.[102] TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
Chunxia Qin,Chenyu Liu,Pengcheng Xia,Jun Du,Baocai Yin,Bing Yin,Cong Liu
Main category: cs.CV
TL;DR: 本文提出TDATR方法,通过表细节感知学习和单元格级视觉对齐改进端到端表格识别,在数据受限场景下表现优异,并在七个基准上达到SOTA或接近SOTA性能。
Details
Motivation: 现有模块化表格识别流程结构与内容分离,导致集成效果差、流程复杂;端到端方法依赖大量标注数据,在数据受限场景下性能不佳。 Method: 提出TDATR框架,采用“感知-融合”策略:首先进行表细节感知学习,联合建模结构与内容,借助多任务语言建模范式利用多样化文档数据;其次引入结构引导的单元格定位模块,强化视觉-语言对齐;最终生成结构化HTML输出。 Result: 在七个基准数据集上实现SOTA或极具竞争力的结果,且无需针对特定数据集微调。 Conclusion: TDATR通过细节感知与细粒度对齐,显著提升端到端表格识别在低资源场景下的鲁棒性、准确性与可解释性。 Abstract: Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.[103] Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
Zhiceng Shi,Changmiao Wang,Jun Wan,Wenwen Min
Main category: cs.CV
TL;DR: 本文提出SpaHGC模型,利用多模态异构图学习,结合组织病理图像与空间转录组数据,通过跨切片图像嵌入和掩码图对比学习,提升空间基因表达预测的准确性与生物学相关性。
Details
Motivation: 空间转录组(ST)实验成本高,限制其大规模应用;而现有基于病理图像预测ST的方法难以建模复杂的跨切片空间关系。 Method: 提出SpaHGC:构建融合片内与片间spot-spot关系的异构图,利用病理基础模型提取图像嵌入以实现跨切片知识迁移,并引入掩码图对比学习增强特征表示与空间基因表达知识迁移能力。 Result: 在七个来自不同平台、组织和癌种的匹配数据集上全面评测,SpaHGC显著优于九种现有SOTA方法;预测结果显著富集于多个癌症相关通路。 Conclusion: SpaHGC能有效建模复杂空间依赖关系,显著提升预测精度与生物学可解释性,具有重要应用潜力。 Abstract: While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential.[104] MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion
Zuxian He,Xu Cheng,Zhaodong Sun,Haoyu Chen,Jingang Shi,Xiaobai Li,Guoying Zhao
Main category: cs.CV
TL;DR: 本文提出了一种面向多视角视频的鲁棒rPPG方法MVRD-rPPG,通过构建新数据集MVRD并设计运动补偿、双流网络与多视角注意力机制,在运动和遮挡场景下显著提升心率估计精度。
Details
Motivation: 现有rPPG方法在面部运动和遮挡下性能下降,因其依赖静态单视角视频;本文旨在解决无约束多视角视频中的运动诱导遮挡问题。 Method: 构建三视角同步多场景MVRD数据集;提出MVRD-rPPG框架,包含自适应时序光流补偿(ATOC)、节律-视觉双流网络、多视角相关感知注意力(MVCA)及相关频率对抗学习(CFA)策略。 Result: 在MVRD运动场景下,MAE为0.90,Pearson相关系数R达0.99,显著优于现有方法;消融实验验证各模块有效性。 Conclusion: 多视角协同建模与对抗式频域约束可有效提升rPPG在动态真实场景下的鲁棒性与精度,为非接触生理监测提供新范式。 Abstract: Remote photoplethysmography (rPPG) is a non-contact technique that estimates physiological signals by analyzing subtle skin color changes in facial videos. Existing rPPG methods often encounter performance degradation under facial motion and occlusion scenarios due to their reliance on static and single-view facial videos. Thus, this work focuses on tackling the motion-induced occlusion problem for rPPG measurement in unconstrained multi-view facial videos. Specifically, we introduce a Multi-View rPPG Dataset (MVRD), a high-quality benchmark dataset featuring synchronized facial videos from three viewpoints under stationary, speaking, and head movement scenarios to better match real-world conditions. We also propose MVRD-rPPG, a unified multi-view rPPG learning framework that fuses complementary visual cues to maintain robust facial skin coverage, especially under motion conditions. Our method integrates an Adaptive Temporal Optical Compensation (ATOC) module for motion artifact suppression, a Rhythm-Visual Dual-Stream Network to disentangle rhythmic and appearance-related features, and a Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation. Furthermore, we introduce a Correlation Frequency Adversarial (CFA) learning strategy, which jointly enforces temporal accuracy, spectral consistency, and perceptual realism in the predicted signals. Extensive experiments and ablation studies on the MVRD dataset demonstrate the superiority of our approach. In the MVRD movement scenario, MVRD-rPPG achieves an MAE of 0.90 and a Pearson correlation coefficient (R) of 0.99. The source code and dataset will be made available.[105] MultiCam: On-the-fly Multi-Camera Pose Estimation Using Spatiotemporal Overlaps of Known Objects
Shiyu Li,Hannah Schieber,Kristoffer Waldow,Benjamin Busam,Julian Kreimeier,Daniel Roth
Main category: cs.CV
TL;DR: 本文提出一种无需标记物的多相机动态AR系统相机位姿估计方法,利用已知物体在时空上的视场(FoV)重叠关系,结合增强的物体位姿估计器构建时空场景图,实现非重叠视场相机间的关联,并在多个数据集上验证了其优越性。
Details
Motivation: 现有基于标记的多相机位姿估计方法依赖初始标定或持续可见标记,存在标记需始终处于视场内、难以部署等局限;而利用场景中已知物体可实现更鲁棒、灵活的无标记动态估计。 Method: 提出基于已知物体时空视场重叠的动态相机位姿估计方法;改进当前最优物体位姿估计器以实时更新时空场景图,支持跨非重叠视场的相机关联;构建包含静态与动态相机、多物体及时间维度FoV重叠的新基准数据集。 Result: 在YCB-V和T-LESS数据集的视场重叠场景下,相机位姿精度超越当前最优方法;在自建多相机多物体数据集上验证了方法有效性;代码与数据集已开源。 Conclusion: 所提无标记、动态、多相机位姿估计框架显著提升了AR系统在复杂动态环境中的鲁棒性与实用性,为真实场景下的多视角协同AR应用提供了新范式。 Abstract: Multi-camera dynamic Augmented Reality (AR) applications require a camera pose estimation to leverage individual information from each camera in one common system. This can be achieved by combining contextual information, such as markers or objects, across multiple views. While commonly cameras are calibrated in an initial step or updated through the constant use of markers, another option is to leverage information already present in the scene, like known objects. Another downside of marker-based tracking is that markers have to be tracked inside the field-of-view (FoV) of the cameras. To overcome these limitations, we propose a constant dynamic camera pose estimation leveraging spatiotemporal FoV overlaps of known objects on the fly. To achieve that, we enhance the state-of-the-art object pose estimator to update our spatiotemporal scene graph, enabling a relation even among non-overlapping FoV cameras. To evaluate our approach, we introduce a multi-camera, multi-object pose estimation dataset with temporal FoV overlap, including static and dynamic cameras. Furthermore, in FoV overlapping scenarios, we outperform the state-of-the-art on the widely used YCB-V and T-LESS dataset in camera pose accuracy. Our performance on both previous and our proposed datasets validates the effectiveness of our marker-less approach for AR applications. The code and dataset are available on https://github.com/roth-hex-lab/IEEE-VR-2026-MultiCam.[106] URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection
Wei Luo,Peng Xing,Yunkang Cao,Haiming Yao,Weiming Shen,Zechao Li
Main category: cs.CV
TL;DR: 本文提出URA-Net,通过不确定性建模与异常感知结合语义特征重建,提升无监督异常检测性能。
Details
Motivation: 传统基于重建的无监督异常检测方法易过泛化,导致异常也能被较好重建,从而降低检测性能。 Method: 提出URA-Net:1)以预训练CNN提取的多级语义特征为重建目标;2)引入特征级人工异常合成模块生成训练样本;3)基于贝叶斯神经网络的不确定性集成异常感知模块估计异常区域与模糊边界;4)利用全局正常语义信息的恢复注意力机制修复异常区域;5)使用输入与恢复特征的残差图进行检测与定位。 Result: 在MVTec AD、BTAD工业数据集和OCT-2017医学图像数据集上实验结果显著优于现有方法。 Conclusion: URA-Net通过显式建模异常恢复与不确定性感知,在无监督异常检测任务中实现了更鲁棒、精准的检测与定位性能。 Abstract: Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.[107] EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Yaolun Zhang,Ruohui Wang,Jiahao Wang,Yepeng Tang,Xuanyu Zheng,Haonan Duan,Hao Lu,Hanming Deng,Lewei Lu
Main category: cs.CV
TL;DR: 本文提出EVA框架,通过规划先行的迭代推理机制实现高效视频理解,结合三阶段学习流程,在多个基准上显著提升性能。
Details
Motivation: 现有视频理解方法在处理长视频时效率低下,缺乏自适应推理能力,且依赖人工设计的工作流和感知优先策略。 Method: 提出EVA框架,采用规划-感知-行动-反思的迭代推理范式,并设计包含监督微调(SFT)、卡尼曼-特沃斯基优化(KTO)和广义奖励策略优化(GRPO)的三阶段训练流程。 Result: 在六个视频理解基准上,EVA相比通用MLLM基线提升6–12%,相比先前自适应代理方法再提升1–3%。 Conclusion: EVA实现了查询驱动、高效且端到端可训练的视频代理,验证了规划先行范式在视频理解中的有效性。 Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.[108] UAV-DETR: DETR for Anti-Drone Target Detection
Jun Yang,Dong Wang,Hongxu Yin,Hongpeng Li,Jianxiong Yu
Main category: cs.CV
TL;DR: UAV-DETR是一种面向微型无人机检测的高效实时目标检测框架,通过WTConv增强骨干网络、滑动窗口自注意力机制、跨尺度特征重校准与融合网络以及混合损失函数,在精度与计算效率间取得更好平衡。
Details
Motivation: 现有基于深度学习的无人机检测方法难以在鲁棒特征表示与计算效率之间取得平衡,尤其在复杂背景和强环境干扰下检测微型无人机时表现不佳。 Method: 提出UAV-DETR框架,包括WTConv增强的骨干网络、滑动窗口自注意力编码器(SWSA-IFI)、高效跨尺度特征重校准与融合网络(ECFRFN),以及混合Inner-CIoU与NWD损失策略。 Result: 在自建UAV数据集上mAP50:95提升6.61%,参数量减少39.8%;在DUT-ANTI-UAV基准上Precision提升1.4%,F1-Score提升1.0%。 Conclusion: UAV-DETR在反无人机目标检测任务中实现了精度与效率的更优权衡,是一种具有实用价值的新型检测框架。 Abstract: Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection capabilities. Specifically, UAV-DETR features a WTConv-enhanced backbone and a Sliding Window Self-Attention (SWSA-IFI) encoder, capturing the high-frequency structural details of tiny targets while drastically reducing parameter overhead. Furthermore, we propose an Efficient Cross-Scale Feature Recalibration and Fusion Network (ECFRFN) to suppress background noise and aggregate multi-scale semantics. To further enhance accuracy, UAV-DETR incorporates a hybrid Inner-CIoU and NWD loss strategy, mitigating the extreme sensitivity of standard IoU metrics to minor positional deviations in small objects. Extensive experiments demonstrate that UAV-DETR significantly outperforms the baseline RT-DETR on our custom UAV dataset (+6.61% in mAP50:95, with a 39.8% reduction in parameters) and the public DUT-ANTI-UAV benchmark (+1.4% in Precision, +1.0% in F1-Score). These results establish UAV-DETR as a superior trade-off between efficiency and precision in counter-UAV object detection. The code is available at https://github.com/wd-sir/UAVDETR.[109] YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception
Marios Impraimakis,Daniel Vazquez,Feiyu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于Kolmogorov-Arnold网络的可解释后验代理模型,用于评估YOLOv10目标检测置信度的可信度,尤其在图像退化或模糊场景下提供透明、可视觉化的可靠性判断,并结合BLIP生成场景描述,实现轻量级多模态可解释感知系统。
Details
Motivation: 现有自动驾驶等视觉系统在图像退化或模糊场景下缺乏对检测置信度可靠性的透明解释能力,限制了其在安全关键场景中的可信部署。 Method: 采用Kolmogorov-Arnold网络作为可解释的后验代理模型,利用7个几何与语义特征建模YOLOv10检测结果的可信度;其加性样条结构支持各特征影响的直接可视化;同时集成BLIP模型生成场景描述以支持轻量多模态交互。 Result: 在COCO数据集和巴斯大学校园图像上验证,该框架能准确识别因模糊、遮挡或低纹理导致的低可信度检测,并提供可操作的过滤、复核或风险缓解依据。 Conclusion: 所提框架实现了兼具高可解释性与可信置信度估计的目标检测,为自主与多模态AI系统的透明感知组件提供了实用有效的解决方案。 Abstract: The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature's influence. This produces smooth and transparent functional mappings that reveal when the model's confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.[110] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
Yunheng Li,Hangyi Kuang,Hengrui Zhang,Jiangxia Cao,Zhaojie Liu,Qibin Hou,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 本文提出了一种名为PEPO的感知-探索策略优化方法,通过细粒度的token级分析,结合感知先验与熵值门控机制,在不增加监督或辅助分支的前提下,显著提升多模态链式推理性能。
Details
Motivation: 现有基于可验证奖励的强化学习(RLVR)方法在多模态链式推理中缺乏对视觉接地程度差异的建模,难以区分推理轨迹中不同token的感知与推理作用。 Method: 提出PEPO方法:基于隐藏状态相似性构建感知先验,并与token熵通过平滑门控机制融合,生成token级优势函数;该方法可无缝集成到GRPO、DAPO等现有RLVR框架中。 Result: 在几何推理、视觉定位、视觉谜题求解和少样本分类等多个多模态基准上,PEPO一致且稳健地超越强RL基线,同时保持训练稳定性。 Conclusion: token级感知-探索动态建模是提升多模态CoT推理的关键,PEPO提供了一种通用、轻量且有效的强化学习优化范式。 Abstract: Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO[111] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
Jiacheng Hua,Yishu Yin,Yuhang Wu,Tai Wang,Yifei Huang,Miao Liu
Main category: cs.CV
TL;DR: 本文提出TRACE方法,通过文本化表征视频中的三维空间信息,提升多模态大语言模型(MLLMs)在3D空间推理任务上的性能。
Details
Motivation: 现有MLLMs难以进行3D空间推理,因其无法从视频中构建结构化的三维环境抽象表示。 Method: 提出TRACE(Textual Representation of Allocentric Context from Egocentric Video)提示方法,引导MLLMs生成基于文本的、包含元上下文、相机轨迹和物体实体的三维环境表征作为中间推理痕迹。 Result: 在VSI-Bench和OST-Bench上显著且一致地优于先前提示策略,适用于多种参数规模与训练范式的MLLM骨干模型;消融实验与深入分析验证了设计有效性并揭示了当前瓶颈。 Conclusion: TRACE为提升MLLMs的3D空间推理能力提供了有效、通用且可解释的提示框架,推动了具身智能与视频理解的发展。 Abstract: Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.[112] UniQueR: Unified Query-based Feedforward 3D Reconstruction
Chensheng Peng,Quentin Herau,Jiezhi Yang,Yichen Xie,Yihan Hu,Wenzhao Zheng,Matthew Strong,Masayoshi Tomizuka,Wei Zhan
Main category: cs.CV
TL;DR: UniQueR提出了一种基于稀疏3D查询的统一前馈框架,用于从无位姿图像高效、准确地进行3D重建,显著提升几何精度与渲染质量,同时大幅降低计算和内存开销。
Details
Motivation: 现有前馈模型(如DUSt3R、VGGT、AnySplat)仅输出2.5D结果,难以建模遮挡区域,几何表达能力受限。 Method: 将重建建模为稀疏3D查询推理问题,学习一组全局3D空间中的锚点作为显式几何查询,每个查询生成可微分渲染的3D高斯,并通过跨视图特征的统一查询交互与解耦交叉注意力实现高效推理。 Result: 在Mip-NeRF 360和VR-NeRF数据集上,UniQueR在渲染质量和几何精度上均超越现有前馈方法,且所需图元数量比稠密方法少一个数量级。 Conclusion: UniQueR验证了稀疏3D查询范式在单次前馈中实现完整场景几何重建的有效性与高效性,为无位姿图像的3D重建提供了新思路。 Abstract: We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions--in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.[113] SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Haoyu Huang,Jinfa Huang,Zhongwei Wan,Xiawu Zheng,Rongrong Ji,Jiebo Luo
Main category: cs.CV
TL;DR: 本文提出SpecEyes框架,通过轻量级模型预测执行路径、认知门控机制和异构并行漏斗结构,显著加速多模态智能体推理,同时保持甚至提升准确率。
Details
Motivation: 现有基于迭代视觉工具调用的智能体多模态大语言模型存在严重的顺序开销(即agentic depth),导致高延迟和低并发能力。 Method: 提出SpecEyes:1)使用轻量级无工具MLLM作为推测式规划器预测执行轨迹;2)设计基于答案可分性的认知门控机制实现无需标注的自验证;3)构建异构并行漏斗以掩盖大模型的串行执行开销。 Result: 在V* Bench、HR-Bench和POPE上实现1.1–3.35倍加速,准确率最高提升6.7%,并发吞吐量显著提升。 Conclusion: SpecEyes有效突破了智能体MLLM的顺序瓶颈,在速度与精度之间取得更好平衡,为高并发多模态智能体服务提供可行方案。 Abstract: Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.[114] Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Chengxin Lv,Yihui Li,Hongyu Yang,YunHong Wang
Main category: cs.CV
TL;DR: Gau-Occ是一种基于语义3D高斯的高效多模态三维语义占据预测框架,通过LiDAR补全扩散模型和高斯锚点融合实现高精度与低计算开销。
Details
Motivation: 现有三维语义占据预测方法依赖计算昂贵的稠密体素或BEV张量,难以兼顾精度与效率;多模态融合虽能提升性能,但缺乏轻量紧凑的表征方式。 Method: 提出Gau-Occ框架:1)将场景建模为紧凑的语义3D高斯集合;2)设计LiDAR补全扩散模型(LCD)从稀疏LiDAR恢复几何结构以初始化鲁棒高斯锚点;3)引入高斯锚点融合(GAF),通过几何对齐的2D采样与跨模态对齐融合多视角图像语义。 Result: 在多个挑战性基准上达到SOTA性能,同时显著提升计算效率。 Conclusion: 语义3D高斯是一种有效替代稠密体素的紧凑场景表征,结合LCD与GAF可兼顾几何完整性与语义判别力,为实时自动驾驶感知提供新范式。 Abstract: 3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.[115] MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
Ufaq Khan,Umair Nawaz,L D M S S Teja,Numaan Saeed,Muhammad Bilal,Yutong Xie,Mohammad Yaqub,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 本文提出MedObvious基准,用于评估医学视觉语言模型(VLMs)在输入验证(如模态、解剖结构、视角等一致性检查)方面的能力,发现当前VLMs在此关键安全能力上仍不可靠。
Details
Motivation: 现有医学VLMs虽能生成流畅诊断文本,但缺乏对输入图像基本合理性(如模态、解剖、视角、完整性)的预诊断‘健全性检查’能力;而现有基准忽略该环节,导致关键安全失效被掩盖。 Method: 构建包含1880项任务的MedObvious基准,聚焦小规模多图组层面的一致性验证(即判断是否存在任一图面板违反预期一致性),涵盖5个渐进难度层级和5种评估格式;对17种VLM进行系统评测。 Result: 多数VLM在健全性检查上表现不佳:部分模型在正常输入上虚报异常(假阳性),性能随图像数量增加而下降,且在多项选择与开放生成设置下准确率差异显著。 Conclusion: 预诊断输入验证是当前医学VLM中尚未解决的安全关键能力,必须作为独立能力加以评估和保障,方可考虑临床部署。 Abstract: Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.[116] A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection
Wei Luo,Haiming Yao,Zhenfeng Qiang,Xiaotian Zhang,Weihang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向通用异常检测的无监督方法FSR(特征打乱与恢复),通过多尺度特征重建、随机打乱与恢复机制及可调打乱率,缓解重建类方法中的‘相同捷径’问题,提升模型跨场景泛化能力。
Details
Motivation: 现有基于重建的无监督异常检测方法存在‘相同捷径’问题——正常与异常区域均被良好重建,导致漏检;该问题在复杂正常分布下加剧,且模型跨场景迁移性能差。本文旨在构建适用于多种设定的通用异常检测模型。 Method: 提出FSR框架:1)以多尺度语义特征而非原始像素为重建目标;2)将特征划分为非重叠块并随机打乱,再由恢复网络还原;3)引入可调节的‘打乱率’控制任务难度;4)从网络结构与互信息角度提供理论解释。 Result: FSR在多种异常检测设定下展现出优越且稳定的检测性能,显著提升跨场景泛化能力,同时保持高效性;代码已开源。 Conclusion: FSR是一种简单、通用且高效的无监督异常检测框架,有效缓解了重建方法的固有缺陷,为构建跨域鲁棒的异常检测系统提供了新思路。 Abstract: Unsupervised anomaly detection is vital in industrial fields, with reconstruction-based methods favored for their simplicity and effectiveness. However, reconstruction methods often encounter an identical shortcut issue, where both normal and anomalous regions can be well reconstructed and fail to identify outliers. The severity of this problem increases with the complexity of the normal data distribution. Consequently, existing methods may exhibit excellent detection performance in a specific scenario, but their performance sharply declines when transferred to another scenario. This paper focuses on establishing a universal model applicable to anomaly detection tasks across different settings, termed as universal anomaly detection. In this work, we introduce a novel, straightforward yet efficient framework for universal anomaly detection: \uline{F}eature \uline{S}huffling and \uline{R}estoration (FSR), which can alleviate the identical shortcut issue across different settings. First and foremost, FSR employs multi-scale features with rich semantic information as reconstruction targets, rather than raw image pixels. Subsequently, these multi-scale features are partitioned into non-overlapping feature blocks, which are randomly shuffled and then restored to their original state using a restoration network. This simple paradigm encourages the model to focus more on global contextual information. Additionally, we introduce a novel concept, the shuffling rate, to regulate the complexity of the FSR task, thereby alleviating the identical shortcut across different settings. Furthermore, we provide theoretical explanations for the effectiveness of FSR framework from two perspectives: network structure and mutual information. Extensive experimental results validate the superiority and efficiency of the FSR framework across different settings.Code is available at https://github.com/luow23/FSR.[117] Designing to Forget: Deep Semi-parametric Models for Unlearning
Amber Yijia Zheng,Yu-Shan Tai,Raymond A. Yeh
Main category: cs.CV
TL;DR: 本文提出了一种深度半参数模型(SPMs),通过融合模块实现测试时显式删除训练样本,无需修改模型参数,在保持任务性能的同时显著提升机器遗忘效率。
Details
Motivation: 现有机器遗忘方法主要关注如何从已训练模型中移除特定样本,但忽略了不同模型本身在遗忘难易程度上的差异。 Method: 提出深度半参数模型(SPMs),引入融合模块聚合各训练样本信息,支持测试时对选定样本进行显式删除,不更新模型参数。 Result: 在ImageNet图像分类任务上,SPMs相比现有参数化模型方法将预测差距缩小11%,且遗忘速度提升10倍以上;同时在图像分类与生成任务中达到与参数化模型相当的性能。 Conclusion: SPMs通过半参数设计实现了高效、灵活的机器遗忘,在不牺牲任务性能的前提下显著提升了遗忘效率,为可信赖AI系统提供了新范式。 Abstract: Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11\%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models. The code is available at https://github.com/amberyzheng/spm_unlearning.[118] ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
Hyojin Park,Yi Li,Janghoon Cho,Sungha Choi,Jungsoo Lee,Taotao Jing,Shuai Zhang,Munawar Hayat,Dashan Gao,Ning Bi,Fatih Porikli
Main category: cs.CV
TL;DR: 本文提出ForeSeaQA基准和ForeSea系统,解决长时多摄像头监控视频中基于图像与文本的多模态查询及时间定位问题。
Details
Motivation: 现有监控检索方法在多模态查询(如结合人物图像与自然语言问题)和时间推理方面表现不足,且缺乏适配此类任务的评估基准。 Method: 构建ForeSeaQA基准(含带时间戳标注的多模态问题与长时监控视频);设计ForeSea三阶段系统:跟踪过滤、多模态嵌入索引、VideoLLM驱动的候选检索与事件定位。 Result: ForeSea在ForeSeaQA上较先前VideoRAG模型提升准确率3.5%、时间IoU 11.0%;ForeSeaQA为首个支持复杂多模态查询与时序精确定位的基准。 Conclusion: ForeSeaQA填补了多模态视频问答基准空白,ForeSea系统首次在该设定下实现高效、精准的法医级视频搜索,推动监控分析向真实场景落地。 Abstract: Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.[119] Template-Based Feature Aggregation Network for Industrial Anomaly Detection
Wei Luo,Haiming Yao,Wenyong Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于模板的特征聚合网络(TFA-Net),通过将输入图像特征聚合并融合到正常模板图像特征上,实现工业异常检测,避免了传统特征重建方法中的捷径学习问题,具有高精度与实时性。
Details
Motivation: 现有基于特征重建的工业异常检测方法存在捷径学习问题,导致对异常特征的错误重建,影响检测可靠性。 Method: 提出TFA-Net:1)从预训练CNN提取模板图像和输入图像的多层级特征;2)将输入特征按相似度聚合到模板特征上,过滤异常特征;3)利用融合后的模板特征重建特征图;4)通过输入与重建特征差异定位缺陷;5)引入输入特征随机掩码策略增强鲁棒性。 Result: 在多个真实工业数据集上达到SOTA检测性能,满足工业实时性要求。 Conclusion: TFA-Net通过模板引导的特征聚合机制,构建了更有意义的重建任务,在保持结构简洁的同时实现了高效、鲁棒且实用的工业异常检测。 Abstract: Industrial anomaly detection plays a crucial role in ensuring product quality control. Therefore, proposing an effective anomaly detection model is of great significance. While existing feature-reconstruction methods have demonstrated excellent performance, they face challenges with shortcut learning, which can lead to undesirable reconstruction of anomalous features. To address this concern, we present a novel feature-reconstruction model called the \textbf{T}emplate-based \textbf{F}eature \textbf{A}ggregation \textbf{Net}work (TFA-Net) for anomaly detection via template-based feature aggregation. Specifically, TFA-Net first extracts multiple hierarchical features from a pre-trained convolutional neural network for a fixed template image and an input image. Instead of directly reconstructing input features, TFA-Net aggregates them onto the template features, effectively filtering out anomalous features that exhibit low similarity to normal template features. Next, TFA-Net utilizes the template features that have already fused normal features in the input features to refine feature details and obtain the reconstructed feature map. Finally, the defective regions can be located by comparing the differences between the input and reconstructed features. Additionally, a random masking strategy for input features is employed to enhance the overall inspection performance of the model. Our template-based feature aggregation schema yields a nontrivial and meaningful feature reconstruction task. The simple, yet efficient, TFA-Net exhibits state-of-the-art detection performance on various real-world industrial datasets. Additionally, it fulfills the real-time demands of industrial scenarios, rendering it highly suitable for practical applications in the industry. Code is available at https://github.com/luow23/TFA-Net.[120] Group Editing : Edit Multiple Images in One Go
Yue Ma,Xinyu Wang,Qianli Ma,Qinghe Wang,Mingzhe Zheng,Xiangpeng Yang,Hao Li,Chongbo Zhao,Jixuan Ying,Harry Yang,Hongyu Liu,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出GroupEditing框架,通过显式几何对应(VGGT)与隐式时序关系(预训练视频模型)融合,实现多图一致编辑,并构建新数据集GroupEditData和评测基准GroupEditBench。
Details
Motivation: 解决多张相关图像间进行一致且统一编辑的难题,因图像在姿态、视角和空间布局上差异大,需建立可靠跨图语义对应以保证编辑准确性。 Method: 提出GroupEditing框架:1)用VGGT提取显式几何对应;2)将图像组建模为伪视频,利用预训练视频模型捕获隐式关系;3)设计新融合机制将几何线索注入视频模型;4)构建GroupEditData数据集;5)引入对齐增强RoPE模块保障身份一致性。 Result: 在视觉质量、跨视角一致性和语义对齐方面显著优于现有方法,并发布GroupEditBench评测基准验证有效性。 Conclusion: GroupEditing通过显隐双路径建模图像组内关系,结合新数据集、融合机制与对齐模块,有效提升了多图协同编辑的一致性与可控性。 Abstract: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model's ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.[121] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Zhicheng Qiu,Jiarui Meng,Tong-an Luo,Yican Huang,Xuan Feng,Xuanfu Li,ZHan Xu
Main category: cs.CV
TL;DR: SLARM is a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference using higher-order motion modeling, language-aligned semantic distillation, and causal attention.
Details
Motivation: To unify dynamic scene reconstruction, semantic understanding, and real-time streaming inference in a single efficient framework without requiring flow supervision. Method: SLARM uses higher-order motion modeling trained on differentiable renderings, distills semantic features from LSeg for language alignment, and employs window-based causal attention for streaming inference. Result: SLARM achieves state-of-the-art results: +21% motion accuracy, +1.6 dB PSNR in reconstruction, and +20% mIoU in segmentation over existing methods. Conclusion: SLARM demonstrates that tight coupling of semantics and geometry, combined with efficient streaming design, significantly improves accuracy, robustness, and efficiency in dynamic scene understanding. Abstract: We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.[122] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation
Zhe Zhang,Jing Li,Wanli Xue,Xu Cheng,Jianhua Zhang,Qinghua Hu,Shengyong Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为DDSR的双教师蒸馏与子网络校正方法,用于黑盒域自适应任务,在不访问源数据和源模型的情况下,联合利用黑盒源模型的特定知识和视觉语言模型(ViL)的通用语义信息,生成可靠的伪标签并缓解噪声监督导致的过拟合,最终通过类原型自训练进一步优化目标模型。
Details
Motivation: 现有黑盒域自适应方法受限于仅能获取黑盒源模型对目标样本的预测,易受噪声伪标签或ViL语义先验利用不足的影响,导致性能受限。 Method: 提出DDSR框架:1)双教师蒸馏——融合黑盒源模型与ViL的互补预测生成伪标签;2)子网络驱动的正则化策略抑制过拟合;3)迭代优化伪标签与ViL提示;4)基于类原型的自训练优化目标模型。 Result: 在多个基准数据集上显著优于现有SOTA方法,包括可访问源数据或源模型的方法。 Conclusion: DDSR有效克服了黑盒设定下知识迁移的瓶颈,通过协同利用特定模型知识与通用语义先验,实现了更鲁棒、更准确的无源域自适应。 Abstract: Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.[123] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
Shaobo Ju,Baiyang Song,Tao Chen,Jiapeng Zhang,Qiong Wu,Chao Chang,HuaiXi Wang,Yiyi Zhou,Rongrong Ji
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视频多模态大语言模型(MLLM)token剪枝方法ForestPrune,通过时空森林建模实现高比率、高精度的token压缩。
Details
Motivation: 现有视频MLLM的token压缩方法在高比率压缩下效果不佳,主要由于对视频时序和连续内容建模不足。 Method: ForestPrune基于语义、空间和时间约束,在视频帧间构建token森林,并依据树深度与节点角色评估token重要性,实现全局最优剪枝。 Result: 在LLaVA-Video和LLaVA-OneVision上验证,ForestPrune在减少90% token的同时保持95.8%平均准确率;在MLVU上比对比方法高10.1%准确率,剪枝时间比FrameFusion少81.4%。 Conclusion: ForestPrune是一种高效、高精度、无需训练的视频MLLM token压缩方法,显著优于现有方法。 Abstract: Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.[124] When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
Yihuan Huang,Jun Xue,Liu Jiajun,Daixian Li,Tong Zhang,Zhuolin Yi,Yanzhen Ren,Kai Li
Main category: cs.CV
TL;DR: 本文首次系统评估了现有AVSR模型在真实视频会议场景下的鲁棒性,发现传输失真与人类自发性高表达(如Lombard效应)导致性能显著下降;为此构建了首个面向视频会议的多模态数据集MLD-VC,并揭示语音增强算法引发的声学分布偏移(尤其影响前两个共振峰)是关键问题;实验表明,利用Lombard效应训练的数据可提升模型对VC失真的鲁棒性,微调后平均CER降低17.5%。
Details
Motivation: 现有AVSR模型在离线条件下表现良好,但在真实视频会议(VC)场景中鲁棒性未知,亟需系统性评估与针对性数据支撑。 Method: 构建首个面向视频会议的多模态数据集MLD-VC(含31名说话人、22.79小时音视频数据,显式引入Lombard效应以模拟人类高表达),并分析传输失真、语音增强及Lombard效应对音频声学特征(如F1/F2)的影响;通过在MLD-VC上微调AVSR模型验证鲁棒性提升效果。 Result: 发现语音增强算法是导致分布偏移的主因,其对第一、二共振峰的影响与Lombard效应相似;基于Lombard数据训练的模型在VC中更鲁棒;在MLD-VC上微调使多个VC平台上的平均CER降低17.5%。 Conclusion: Lombard效应可作为建模VC失真的有效代理,MLD-VC为提升AVSR在真实视频会议中的鲁棒性与泛化性提供了关键数据基础和方法启示。 Abstract: Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.[125] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification
Daniel Beckmann,Benjamin Risse
Main category: cs.CV
TL;DR: 本文提出FixationFormer,一种基于Transformer的架构,将专家眼动轨迹表示为token序列,通过图像与眼动序列间的显式交叉注意力,实现专家诊断线索在医学影像分析中的细粒度、直接整合,并在三个胸部X光数据集上达到SOTA分类性能。
Details
Motivation: 专家眼动提供了丰富的领域知识,但其序列性、时空稀疏性、噪声和个体差异使得在CNN系统中直接集成困难;而眼动数据天然适配Transformer架构。 Method: 提出FixationFormer,将专家注视轨迹建模为token序列,联合建模眼动序列与图像特征,利用跨模态交叉注意力机制融合二者信息。 Result: 在三个公开胸部X光基准数据集上实现了最先进的分类性能。 Conclusion: 将眼动数据以序列形式建模并融入Transformer框架,能更有效地利用专家诊断线索,显著提升医学图像分析性能。 Abstract: Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.[126] Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion
Shuangwu Qian,Xiaochan Yuan,Pengfei Liu
Main category: cs.CV
TL;DR: 本文提出PVGF-DPC框架,结合内容提示模块与视觉语义生成融合损失,提升东巴绘画图像描述生成的准确性与文化适配性。
Details
Motivation: 主流图像描述模型在东巴绘画上存在严重领域偏移,其自动文本描述任务尚未被充分探索。 Method: 提出PVGF-DPC编码器-解码器框架:采用MobileNetV2编码器提取视觉特征,注入BERT初始化的10层Transformer解码器;引入内容提示模块生成文化感知标签(如‘神祇’‘仪式纹样’),并设计视觉语义生成融合损失联合优化提示预测与描述生成。 Result: 构建了含9408张增强图像、覆盖7个主题类别的东巴绘画描述数据集,并验证了PVGF-DPC在文化相关描述上的有效性。 Conclusion: PVGF-DPC有效弥合了通用图像描述与东巴绘画文化特异性之间的鸿沟,为少数民族艺术数字化保护提供了可扩展的技术路径。 Abstract: Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels -- such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} -- and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.[127] Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Weijun Zhuang,Yuqing Huang,Weikang Meng,Xin Li,Ming Liu,Xiaopeng Hong,Yaowei Wang,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出ClusterSTM,一种面向高效视频-语言预训练的簇级时空掩码策略,通过帧内聚类与簇内高时序密度token保留,缓解高掩码率下的视觉信息损失与时序泄露问题,并引入视频-文本相关性重建目标,显著提升多项视频语言任务性能。
Details
Motivation: 大规模视频-语言预训练计算成本高昂;现有掩码视觉建模方法存在高掩码比下严重视觉信息损失和因帧间相关性导致的时序信息泄露两大根本局限。 Method: 提出ClusterSTM:先进行帧内视觉token语义聚类,再在每个簇中保留时序密度最高的token进行簇级掩码;同时设计视频-文本相关性重建目标,对齐高层多模态语义而非仅像素级视觉重建。 Result: 在视频-文本检索、视频问答和视频描述等多个基准上取得优异性能,成为当前高效视频-语言模型中的新SOTA。 Conclusion: ClusterSTM通过语义感知的簇级时空掩码与高层语义重建目标,有效平衡了预训练效率与表征能力,在保持较低计算开销的同时显著提升了视频语言理解性能。 Abstract: Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.[128] Few-Shot Generative Model Adaption via Identity Injection and Preservation
Yeqi He,Liang Li,Jiehua Zhang,Yaoqi Sun,Xichun Sheng,Zhidong Zhao,Chenggang Yan
Main category: cs.CV
TL;DR: 本文提出I²P方法,通过身份注入和一致性对齐,在少样本生成模型适配中有效保留源域身份知识,缓解模式坍塌问题。
Details
Motivation: 现有少样本生成模型适配方法在迁移至目标域时容易遗忘源域的身份知识,导致生成图像质量下降。 Method: 提出Identity Injection and Preservation (I²P)框架,包括身份注入模块(将源域身份知识融入目标域潜在空间)和身份替换模块(含风格-内容解耦器与重建调制器),并施加身份一致性约束以对齐特征。 Result: 在多个公开数据集和5个指标上,定量与定性实验均表明该方法显著优于当前最优方法。 Conclusion: I²P能有效缓解少样本适配中的源域知识遗忘问题,提升目标域图像生成质量与身份保真度。 Abstract: Training generative models with limited data presents severe challenges of mode collapse. A common approach is to adapt a large pretrained generative model upon a target domain with very few samples (fewer than 10), known as few-shot generative model adaptation. However, existing methods often suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain. To address this, we propose Identity Injection and Preservation (I$^2$P), which leverages identity injection and consistency alignment to preserve the source identity knowledge. Specifically, we first introduce an identity injection module that integrates source domain identity knowledge into the target domain's latent space, ensuring the generated images retain key identity knowledge of the source domain. Second, we design an identity substitution module, which includes a style-content decoupler and a reconstruction modulator, to further enhance source domain identity preservation. We enforce identity consistency constraints by aligning features from identity substitution, thereby preserving identity knowledge. Both quantitative and qualitative experiments show that our method achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics.[129] FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning
Jingchen Ni,Quan Zhang,Dan Jiang,Keyu Lv,Ke Zhang,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了一种基于频率感知与对比学习的弱监督伪装目标检测框架FCL-COD,通过FoRA模块、梯度感知对比学习和多尺度频率感知表征学习,显著提升了弱监督COD性能,甚至超越部分全监督方法。
Details
Motivation: 现有弱监督伪装目标检测(WSCOD)方法性能远逊于全监督方法;SAM在WSCOD中存在非伪装目标响应、局部响应、极端响应及边界感知不足等问题。 Method: 提出FCL-COD框架,包含三部分:1)Frequency-aware Low-rank Adaptation(FoRA),将频率感知的伪装场景知识注入SAM以抑制非伪装响应;2)梯度感知对比学习,增强前景-背景边界区分能力;3)多尺度频率感知表征学习,提升边界精细化建模能力。 Result: 在三个主流COD基准上实验表明,该方法性能超越当前最优弱监督方法,甚至优于部分全监督方法。 Conclusion: 频率感知与对比学习的有效结合可显著缓解WSCOD中的关键挑战,为弱监督COD提供了新思路与实用解决方案。 Abstract: Existing camouflage object detection (COD) methods typically rely on fully-supervised learning guided by mask annotations. However, obtaining mask annotations is time-consuming and labor-intensive. Compared to fully-supervised methods, existing weakly-supervised COD methods exhibit significantly poorer performance. Even for the Segment Anything Model (SAM), there are still challenges in handling weakly-supervised camouflage object detection (WSCOD), such as: a. non-camouflage target responses, b. local responses, c. extreme responses, and d. lack of refined boundary awareness, which leads to unsatisfactory results in camouflage scenes. To alleviate these issues, we propose a frequency-aware and contrastive learning-based WSCOD framework in this paper, named FCL-COD. To mitigate the problem of non-camouflaged object responses, we propose the Frequency-aware Low-rank Adaptation (FoRA) method, which incorporates frequency-aware camouflage scene knowledge into SAM. To overcome the challenges of local and extreme responses, we introduce a gradient-aware contrastive learning approach that effectively delineates precise foreground-background boundaries. Additionally, to address the lack of refined boundary perception, we present a multi-scale frequency-aware representation learning strategy that facilitates the modeling of more refined boundaries. We validate the effectiveness of our approach through extensive empirical experiments on three widely recognized COD benchmarks. The results confirm that our method surpasses both state-of-the-art weakly supervised and even fully supervised techniques.[130] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion
Manuel-Andreas Schneider,Angela Dai
Main category: cs.CV
TL;DR: 本文提出了一种几何优先的3D场景生成方法,通过构建网格骨架来保证大尺度场景的一致性,并结合图像合成模型生成高真实感外观。
Details
Motivation: 现有文本到图像/视频方法在大尺度3D场景生成中难以维持场景和物体级一致性,因缺乏显式几何建模。 Method: 将3D场景合成解耦为几何结构(网格骨架)构建与外观合成两步:先从文本生成环境几何网格,再利用图像合成、分割与物体重建填充物体布局,并以网格渲染图作为图像合成条件。 Result: 实现了可扩展、任意尺寸、高物体丰富度与多样性的3D场景生成,兼顾3D结构一致性与照片级真实感细节。 Conclusion: 该几何优先范式是迈向生成真正环境尺度、沉浸式3D世界的重要一步。 Abstract: Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.[131] VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
Jintao Cheng,Haozhe Wang,Weibin Li,Gang Wang,Yipu Zhang,Xiaoyu Tang,Jin Wu,Xieyuanli Chen,Yunhui Liu,Wei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉令牌剪枝方法VLA-IAP,基于交互优先范式,通过几何先验和动态调度策略,在保持任务成功率的同时显著提升VLA模型推理速度。
Details
Motivation: 现有视觉令牌剪枝方法忽略VLA任务中连续物理交互这一关键特性,易误删对操作至关重要的结构区域,导致早期任务阶段行为不稳定。 Method: 提出训练自由的VLA-IAP方法,引入几何先验机制保留结构锚点,并设计基于语义-运动对齐的动态剪枝调度策略,实现由保守到激进的自适应剪枝。 Result: 在LIBERO基准上达到97.8%成功率并提速1.25倍;最高提速1.54倍且性能媲美未剪枝主干模型;在多种架构、三个仿真环境及真实机器人平台均表现优异。 Conclusion: VLA-IAP验证了‘交互优先’范式的有效性,具备强泛化性与实用价值,为资源受限平台部署VLA模型提供了高效可靠方案。 Abstract: Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8\% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.[132] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought
Xuanyu Zhang,Weiqi Li,Qunliang Xing,Jingfen Xie,Bin Chen,Junlin Li,Li Zhang,Jian Zhang,Shijie Zhao
Main category: cs.CV
TL;DR: 本文提出VQ-Jarvis,一种检索增强的、一体化智能视频修复代理,具备更敏锐的质量感知能力与更高效的搜索策略,通过构建新数据集VSR-Compare并设计分层调度机制,在复杂退化视频上显著优于现有方法。
Details
Motivation: 现实世界视频修复面临异构退化挑战,静态架构和固定推理流程泛化能力差;现有基于Agent的方法在质量感知和搜索效率方面仍不足。 Method: 构建首个大规模视频配对增强数据集VSR-Compare(20K对比对,覆盖7种退化、11种增强算子及多样内容);训练多算子评判模型与退化感知模型;提出分层算子调度策略:简单视频采用RAG库一步式检索最优修复路径,困难视频采用逐步贪心搜索。 Result: 在复杂退化视频上,VQ-Jarvis持续超越现有方法,验证了其质量感知能力与搜索效率优势。 Conclusion: VQ-Jarvis通过‘锐利视觉’(精准退化与结果感知)与‘快速思维’(自适应分层调度),实现了更鲁棒、更高效的视频修复,为智能视觉代理提供了新范式。 Abstract: Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.[133] Zero-Shot Personalization of Objects via Textual Inversion
Aniket Roy,Maitreya Suin,Rama Chellappa
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、面向任意物体的快速个性化文本到图像扩散生成新框架,通过学习网络预测物体特定的文本反转嵌入,并将其注入UNet时间步中,实现单次前向传播的零样本定制。
Details
Motivation: 现有基于身份嵌入的个性化方法难以泛化到任意物体类别,缺乏通用性与效率,难以满足真实场景需求。 Method: 设计一个可学习网络,用于预测任意物体对应的文本反转(textual inversion)嵌入,并将该嵌入动态注入扩散模型UNet的各时间步中,实现文本条件下的快速定制。 Result: 在多个任务和设置下验证了方法的有效性,支持零样本、单次前向、跨类别物体的快速个性化生成。 Conclusion: 这是首个实现通用、免训练、扩散模型内物体级个性化定制的工作,为个性化图像生成开辟了新方向。 Abstract: Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.[134] Concept-based explanations of Segmentation and Detection models in Natural Disaster Management
Samar Heydari,Jawher Said,Galip Ümit Yolcu,Evgenii Kortukov,Elena Golimblevskaia,Evgenios Vlachos,Vasileios Mygdalis,Ioannis Pitas,Sebastian Lapuschkin,Leila Arras
Main category: cs.CV
TL;DR: 本文提出了一种面向洪水分割与车辆检测任务的可解释性框架,扩展了LRP方法以适配PIDNet中的sigmoid门控融合层,并结合PCX方法在概念层面提供局部与全局解释,兼顾解释可靠性与实时性,适用于无人机等资源受限平台。
Details
Motivation: 深度学习模型在灾害管理中缺乏决策透明性,影响应急响应中的人类信任。 Method: 提出一种新型重分配策略,扩展Layer-wise Relevance Propagation(LRP)以适配sigmoid门控的逐元素融合层,并结合Prototypical Concept-based Explanations(PCX)在概念层面提供局部与全局解释。 Result: 在公开洪水数据集上的实验表明,该框架能提供可靠、可解释的结果,同时保持近实时推理能力,适用于无人机等资源受限平台。 Conclusion: 所提可解释性框架有效提升了PIDNet和YOLO在灾害感知任务中的透明性与可信度,且具备嵌入式部署可行性。 Abstract: Deep learning models for flood and wildfire segmentation and object detection enable precise, real-time disaster localization when deployed on embedded drone platforms. However, in natural disaster management, the lack of transparency in their decision-making process hinders human trust required for emergency response. To address this, we present an explainability framework for understanding flood segmentation and car detection predictions on the widely used PIDNet and YOLO architectures. More specifically, we introduce a novel redistribution strategy that extends Layer-wise Relevance Propagation (LRP) explanations for sigmoid-gated element-wise fusion layers. This extension allows LRP relevances to flow through the fusion modules of PIDNet, covering the entire computation graph back to the input image. Furthermore, we apply Prototypical Concept-based Explanations (PCX) to provide both local and global explanations at the concept level, revealing which learned features drive the segmentation and detection of specific disaster semantic classes. Experiments on a publicly available flood dataset show that our framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities, rendering it suitable for deployment on resource-constrained platforms, such as Unmanned Aerial Vehicles (UAVs).[135] Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps
Chanyoung Gwak,Yoonwoo Jeong,Byungwoo Jeon,Hyunseok Lee,Jinwoo Shin,Minsu Cho
Main category: cs.CV
TL;DR: 本文提出Cog3DMap框架,通过从多视角图像中递归构建显式的、带语义和几何信息的3D记忆,提升多模态大语言模型(MLLM)的空间理解能力。
Details
Motivation: 现有MLLM视觉表征偏语义、缺乏显式几何基础,虽有方法引入几何线索,但模型仍需隐式推断3D结构,限制空间推理能力。 Method: 提出Cog3DMap框架,递归地从多视角图像构建显式3D内存,每个token均在3D空间中定位并融合语义与几何信息,再输入MLLM进行直接空间推理。 Result: 在多个空间推理基准测试上达到SOTA性能。 Conclusion: 显式3D空间记忆可显著增强MLLM的空间理解与推理能力,为多视角几何-语义联合建模提供了新范式。 Abstract: Precise spatial understanding from multi-view images remains a fundamental challenge for Multimodal Large Language Models (MLLMs), as their visual representations are predominantly semantic and lack explicit geometric grounding. While existing approaches augment visual tokens with geometric cues from visual geometry models, their MLLM is still required to implicitly infer the underlying 3D structure of the scene from these augmented tokens, limiting its spatial reasoning capability. To address this issue, we introduce Cog3DMap, a framework that recurrently constructs an explicit 3D memory from multi-view images, where each token is grounded in 3D space and possesses both semantic and geometric information. By feeding these tokens into the MLLM, our framework enables direct reasoning over a spatially structured 3D map, achieving state-of-the-art performance on various spatial reasoning benchmarks. Code will be made publicly available.[136] Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
ByeongCheol Lee,Hyun Seok Seong,Sangeek Hyun,Gilhan Park,WonJun Moon,Jae-Pil Heo
Main category: cs.CV
TL;DR: 本文提出GLA-CLIP框架,通过全局-局部对齐机制解决滑动窗口语义分割中窗口间语义不一致问题,引入代理锚点和动态归一化提升小目标分割性能,并可即插即用地增强现有无训练开放词汇分割方法。
Details
Motivation: 滑动窗口策略虽缓解CLIP处理高分辨率图像的局限,但导致窗口间语义割裂;现有方法缺乏跨窗口信息交互与对外窗token的有效关注。 Method: 提出GLA-CLIP:1)扩展KV token至全图窗口以实现跨窗口信息交换;2)设计基于相似性聚合的proxy anchor作为统一语义参考;3)引入动态归一化机制,按物体尺度自适应调整注意力强度。 Result: 在多个基准上显著提升训练-free开放词汇语义分割性能,尤其改善小目标分割效果;模块可即插即用,拓展现有方法感受野。 Conclusion: GLA-CLIP有效缓解滑动窗口带来的语义不一致问题,通过全局对齐、代理锚点与动态注意力机制,提升了无训练开放词汇分割的鲁棒性与泛化性。 Abstract: A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.[137] Generative Event Pretraining with Foundation Model Alignment
Jianwen Cao,Jiaxu Xing,Nico Messikommer,Davide Scaramuzza
Main category: cs.CV
TL;DR: 本文提出GEP(生成式事件预训练)框架,通过两阶段方法将图像数据中的语义知识迁移到事件相机数据,并建模事件特有的时间动态,从而提升事件基础模型在多种下游任务上的泛化能力。
Details
Motivation: 事件相机虽具高时序分辨率和高动态范围优势,但其独特感知特性及标注数据稀缺,导致难以训练可跨任务迁移的事件视觉基础模型(VFMs)。 Method: GEP采用两阶段框架:第一阶段通过联合回归-对比目标对齐事件编码器与冻结的图像VFM,实现事件特征到图像语义的映射;第二阶段在混合事件-图像序列上对Transformer主干网络进行自回归预训练,以建模事件的时间结构。 Result: GEP在物体识别、分割和深度估计等多种下游任务上均超越现有事件预训练方法。 Conclusion: VFM引导的语义对齐与生成式序列建模相结合,构建出语义丰富且时间感知的事件模型,具备强跨域泛化能力。 Abstract: Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.[138] Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment
Guoyang Zhao,Weiqing Qi,Kai Zhang,Chenguang Zhang,Zeying Gong,Zhihai Bi,Kai Chen,Benshan Ma,Ming Liu,Jun Ma
Main category: cs.CV
TL;DR: 本文提出了TS-1M——一个大规模、全球多样化的交通标志识别(TSR)数据集与诊断基准,包含超百万真实图像、454类标准标志,并设计了跨区域识别、长尾类别识别、低清晰度鲁棒性及语义文本理解等挑战性评测任务;通过统一评测监督学习、自监督预训练和多模态视觉语言模型,发现语义对齐能力是提升泛化性和稀有类别识别的关键;实验验证其在真实自动驾驶场景中的实用性。
Details
Motivation: 现有交通标志数据集和基准缺乏对跨区域差异、长尾分布和语义歧义等实际挑战的诊断能力,难以揭示不同建模范式的真实能力边界。 Method: 构建TS-1M大规模全球多样性数据集(>1M图像,454类),设计多维度诊断基准(跨区域、长尾、低清晰度、语义文本理解);在监督学习、自监督预训练、多模态VLM三类范式上进行统一评测;结合真实自动驾驶场景开展端到端验证。 Result: 发现语义对齐能力显著提升跨区域泛化与稀有类别识别性能;纯视觉模型易受外观变化和数据不平衡影响;TS-1M在真实驾驶中有效支持语义推理与空间定位驱动的地图级决策。 Conclusion: TS-1M建立了TSR领域首个参考级诊断基准,为构建鲁棒、语义感知的交通标志感知系统提供了原则性洞见与实用评估工具。 Abstract: Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: https://guoyangzhao.github.io/projects/ts1m.[139] HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling
António Cardoso,Pedro Sousa,Tania Pereira,Hélder P. Oliveira
Main category: cs.CV
TL;DR: 本文提出了一种针对肺部CT图像生成的新分解策略,通过分HU区间建模和多头VQVAE架构,在保持解剖一致性的同时提升生成质量与效率。
Details
Motivation: 医学影像中计算机辅助诊断(CAD)模型面临数据稀缺瓶颈,尤其在肺癌诊断中;而直接生成全HU范围CT图像计算成本高、难度大。 Method: 将CT图像按HU区间分解建模,分别训练生成模型(如多头/多解码器VQVAE),再通过可学习的重建网络整合为全范围图像;重点优化纹理表征与解剖结构一致性。 Result: 相比传统2D全范围基线,FID提升6.2%,MMD、Precision、Recall在所有HU区间均更优;多头VQVAE表现最佳,兼顾视觉保真度、多样性、低复杂度与低计算开销。 Conclusion: 该方法建立了结构感知的医学图像合成新范式,使生成建模更契合临床解读需求。 Abstract: Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.[140] MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
Basit Alawode,Arif Mahmood,Muaz Khalifa Al-Radi,Shahad Albastaki,Asim Khan,Muhammad Bilal,Moshira Ali Abdalla,Mohammed Bennamoun,Sajid Javed
Main category: cs.CV
TL;DR: 本文提出MLLM-HWSI,一种面向全切片图像(WSI)的分层多模态大语言模型,通过在细胞、图像块、区域和整张WSI四个尺度上对齐视觉特征与病理学语言,实现可解释、证据驱动的诊断推理。
Details
Motivation: 现有计算病理学多模态大模型将整张WSI压缩为单一嵌入,难以支持细粒度定位,且忽略病理医生跨尺度综合判断的诊断逻辑。 Method: 构建四尺度分层表征(细胞-词、块-短语、区域-句子、WSI-段落),引入分层对比学习目标和跨尺度一致性损失;设计轻量级Cell-Cell Attention Fusion(CCAF)模块聚合细胞级特征;多尺度视觉token与文本token融合后输入指令微调的大语言模型。 Result: 在6项计算病理任务、13个WSI级基准测试中取得SOTA性能;支持开放推理、视觉问答、报告生成与图像描述等任务。 Conclusion: MLLM-HWSI通过多尺度视觉-语言对齐,提升了WSI理解的准确性与可解释性,更贴合真实病理诊断流程。 Abstract: Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}.[141] PolarAPP: Beyond Polarization Demosaicking for Polarimetric Applications
Yidong Luo,Chenggong Li,Yunfeng Song,Ping Wang,Boxin Shi,Junchao Zhang,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出PolarAPP框架,首次联合优化偏振图像的去马赛克(demosaicking)与下游任务(如法向量估计、去反射),通过元学习实现语义特征对齐、引入等效成像约束以直接回归物理意义输出,并采用任务细化阶段提升精度。
Details
Motivation: 现有偏振成像下游任务使用的数据集由简单提取和对齐同角度像素生成,缺乏合理去马赛克,导致目标不完整、性能受限;且当前去马赛克方法仅追求光度保真,忽略下游任务需求。 Method: 提出PolarAPP框架:1)基于元学习的特征对齐机制,使去马赛克网络与下游网络表征语义一致;2)引入等效成像约束,支持直接回归物理可解释输出;3)任务细化阶段利用稳定去马赛克前端微调下游网络。 Result: 在去马赛克质量和下游任务(如法向量估计、去反射)性能上均显著优于现有方法。 Conclusion: PolarAPP实现了任务感知的偏振图像去马赛克,验证了联合优化重建与下游任务的有效性,为偏振视觉系统提供了更优的端到端解决方案。 Abstract: Polarimetric imaging enables advanced vision applications such as normal estimation and de-reflection by capturing unique surface-material interactions. However, existing applications (alternatively called downstream tasks) rely on datasets constructed by naively regrouping raw measurements from division-of-focal-plane sensors, where pixels of the same polarization angle are extracted and aligned into sparse images without proper demosaicking. This reconstruction strategy results in suboptimal, incomplete targets that limit downstream performance. Moreover, current demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks. Towards this end, we propose PolarAPP, the first framework to jointly optimize demosaicking and its downstream tasks. PolarAPP introduces a feature alignment mechanism that semantically aligns the representations of demosaicking and downstream networks via meta-learning, guiding the reconstruction to be task-aware. It further employs an equivalent imaging constraint for demosaicking training, enabling direct regression to physically meaningful outputs without relying on rearranged data. Finally, a task-refinement stage fine-tunes the task network using the stable demosaicking front-end to further enhance accuracy. Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance. Code is available upon acceptance.[142] A Synchronized Audio-Visual Multi-View Capture System
Xiangwei Shi,Era Dorta Perez,Ruud de Jong,Ojas Shirekar,Chirag Raman
Main category: cs.CV
TL;DR: 本文介绍了一种新型音视频多视角捕捉系统,强调音频与视频的严格同步,以支持对话交互中对时间敏感的细粒度分析。
Details
Motivation: 现有大多数多视角捕捉系统侧重于视频流,缺乏对音频采集及音视频严格对齐的支持,而这对研究对话交互(如轮流发言、重叠和语调)至关重要。 Method: 构建了一个将同步音频和同步视频视为首要信号的音视频多视角捕捉系统,整合多摄像头与多通道麦克风录音,并采用统一的时间架构,提供校准、采集和质量控制的实用工作流。 Result: 在实际部署中量化了同步性能,证明所获录制数据具有足够的时间一致性,可支持对话行为的细粒度分析与数据驱动建模。 Conclusion: 该系统填补了现有技术在音视频同步多视角捕捉方面的空白,为大规模、可重复的对话行为研究提供了可靠的技术支撑。 Abstract: Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.[143] NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Yik San Cheng,Runkai Zhao,Weidong Cai
Main category: cs.CV
TL;DR: 本文提出了一种将2D视觉基础模型DINOv3适配到3D神经影像分割任务的方法,通过滤波器膨胀策略和拓扑感知骨架损失,提升了神经元结构重建的准确性与形态保真度。
Details
Motivation: 缺乏适用于下游体素神经影像分析的3D基础模型,主要受限于3D图像获取困难和高质量标注稀缺。 Method: 设计基于膨胀的适配策略,将DINOv3的2D滤波器扩展为3D算子,并引入拓扑感知骨架损失以保证神经元树突结构的几何与拓扑保真。 Result: 在四个神经元成像数据集(BigNeuron、NeuroFly、CWMBS)上显著优于SoTA方法,整体结构平均精度提升2.9%,不同结构平均精度提升2.8%,不同结构占比提升3.8%。 Conclusion: 将2D视觉基础模型的知识有效迁移至3D神经影像分割任务是可行且高效的,所提方法在数据效率和形态保真性方面具有优势。 Abstract: 2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: https://github.com/yy0007/NeurINO.[144] AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection
Yangxin Yu,Yue Zhou,Bin Li,Kaiqing Lin,Haodong Li,Jiangqun Ni,Bo Cao
Main category: cs.CV
TL;DR: 本文提出AgentFoX框架,利用大语言模型驱动的多阶段分析流程,融合专家知识与上下文聚类,实现可解释、可信的AI生成图像检测,并支持未来取证工具的智能集成。
Details
Motivation: 现有AI生成图像检测器通常针对特定伪造伪影设计,导致性能局限且判断可能冲突,亟需更可靠、可解释的通用检测方法。 Method: 提出基于大语言模型的AgentFoX框架,采用多阶段分析:先进行高层语义评估,再结合信号级专家证据进行细粒度、上下文感知的合成,并通过结构化推理解决矛盾;引入专家画像与聚类画像构成的知识库及快速融合机制。 Result: AgentFoX能生成详细、人类可读的法医报告,而非简单二元判决,显著提升可解释性与实际部署可信度,并验证了其作为可扩展智能代理范式的可行性。 Conclusion: AgentFoX不仅提供了一种新型、可解释的AIGI检测方案,还开创了支持动态集成新兴取证工具的可扩展智能代理范式。 Abstract: The increasing realism of AI-Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts--such as frequency-domain patterns or semantic inconsistencies--leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present \textbf{AgentFoX}, a Large Language Model-driven framework that redefines AIGI detection as a dynamic, multi-phase analytical process. Our approach employs a quick-integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high-level semantic assessment, then transitions to fine-grained, context-aware synthesis of signal-level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human-readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real-world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.[145] Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach
Miquel Lopez Escoriza,Pau Amargant Alvarez
Main category: cs.CV
TL;DR: 本文研究了Segment Anything Model 2(SAM2)在无需微调的情况下对三维CT影像进行零样本分割的可行性,指出其缺乏固有体素感知能力,并提出仅通过推理阶段的架构与流程改进(如将CT切片视为有序序列以适配其视频记忆机制)来提升性能;在TotalSegmentator数据集上系统评估后,验证了冻结权重的SAM2可生成连贯的3D分割结果。
Details
Motivation: Foundation models在自然图像分割中表现优异,但在3D医学影像(如CT)中泛化能力受限;亟需探索无需微调的零样本迁移方法。 Method: 提出仅在推理阶段修改SAM2:将其基于视频的记忆机制适配到CT体积数据,将CT切片视为有序序列;并系统开展提示策略、记忆传播方式和多轮细化的消融实验。 Result: 在TotalSegmentator子集(500例CT)上完成消融分析,选出最优配置;最终在2500例CT上验证,冻结权重的SAM2可生成结构一致的3D分割结果。 Conclusion: SAM2可通过推理层面的适配实现零样本3D医学图像分割,证明完全零样本方案在该任务中可行。 Abstract: Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2's video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.[146] SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions
Jinzhe Tu,Ruilei Guo,Zihan Guo,Junxiao Yang,Shiyao Cui,Minlie Huang
Main category: cs.CV
TL;DR: 本文提出IlluChar数据集和SMSP框架,揭示MLLMs在视觉感知中存在高频注意力偏差,并通过多尺度感知策略显著提升模型对隐藏模式图像的理解能力。
Details
Motivation: 当前多模态大语言模型(MLLMs)在面对人类可识别但模型难以察觉的隐藏模式视觉错觉时表现脆弱,暴露出与人类感知的不一致及潜在安全风险。 Method: 构建IlluChar错觉图像数据集,分析发现模型失败源于高频注意力偏差;进而提出即插即用的多尺度感知策略(SMSP),通过抑制高频背景干扰,使生成图像更贴近人类视觉感知。 Result: SMSP显著提升了多个MLLM在错觉图像上的性能,例如Qwen3-VL-8B-Instruct准确率从13.0%提升至84.0%。 Conclusion: 该工作揭示了MLLMs视觉感知的关键缺陷,并提供了实用、鲁棒的改进方案,推动模型向人类感知对齐。 Abstract: Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models' failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs' visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.[147] PiCo: Active Manifold Canonicalization for Robust Robotic Visual Anomaly Detection
Teng Yan,Binkai Liu,Shuai Liu,Yue Yu,Bingzhuo Zhong
Main category: cs.CV
TL;DR: 本文提出PiCo框架,通过主动规范化的范式转变,解决工业机器人视觉异常检测中因6自由度姿态变化和光照等不稳定因素导致的性能瓶颈。
Details
Motivation: 工业部署的机器人视觉异常检测受限于被动感知,在多样的6自由度姿态配置及光照变化、阴影等不稳定操作条件下,内在语义异常与物理干扰共存并相互作用。 Method: 提出Active Canonicalization范式,并构建PiCo(Pose-in-Condition Canonicalization)统一框架:第一阶段为Active Physical Canonicalization,通过机器人主动调整物体朝向以降低几何不确定性;第二阶段为Neural Latent Canonicalization,采用三阶段去噪层级(光度处理、特征精炼、语义上下文推理)逐级消除干扰因素。 Result: 在大规模M2AD基准上,PiCo达到93.7% O-AUROC(较先前静态方法提升3.7%),主动闭环场景下准确率达98.5%。 Conclusion: 主动流形规范化对实现鲁棒具身感知至关重要。 Abstract: Industrial deployment of robotic visual anomaly detection (VAD) is fundamentally constrained by passive perception under diverse 6-DoF pose configurations and unstable operating conditions such as illumination changes and shadows, where intrinsic semantic anomalies and physical disturbances coexist and interact. To overcome these limitations, a paradigm shift from passive feature learning to Active Canonicalization is proposed. PiCo (Pose-in-Condition Canonicalization) is introduced as a unified framework that actively projects observations onto a condition-invariant canonical manifold. PiCo operates through a cascaded mechanism. The first stage, Active Physical Canonicalization, enables a robotic agent to reorient objects in order to reduce geometric uncertainty at its source. The second stage, Neural Latent Canonicalization, adopts a three-stage denoising hierarchy consisting of photometric processing at the input level, latent refinement at the feature level, and contextual reasoning at the semantic level, progressively eliminating nuisance factors across representational scales. Extensive evaluations on the large-scale M2AD benchmark demonstrate the superiority of this paradigm. PiCo achieves a state-of-the-art 93.7% O-AUROC, representing a 3.7% improvement over prior methods in static settings, and attains 98.5% accuracy in active closed-loop scenarios. These results demonstrate that active manifold canonicalization is critical for robust embodied perception.[148] 3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio
Jihwan Hong,Jaeyoung Do
Main category: cs.CV
TL;DR: This paper introduces VIRST-Audio, a framework for audio-based referring video object segmentation that converts audio to text via ASR and leverages a pretrained text-based RVOS model with an existence-aware gating mechanism to improve robustness and reduce hallucinations.
Details
Motivation: Bridging acoustic signals with spatio-temporal visual representations in audio-based referring video object segmentation is challenging; existing methods often lack robustness and suffer from hallucinated masks. Method: VIRST-Audio converts audio queries to text using ASR, reuses a pretrained vision-language RVOS model for text-driven segmentation, and introduces an existence-aware gating mechanism to suppress predictions when the target object is absent. Result: VIRST-Audio achieves 3rd place on the MeViS-Audio track of the 5th PVUW Challenge, showing strong generalization and stable performance. Conclusion: Converting audio to text and leveraging existing vision-language models, combined with existence-aware gating, is an effective and practical approach for ARVOS without requiring audio-specific training. Abstract: Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.[149] InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance
Dongwei Pan,Longwei Guo,Jiazhi Guan,Luying Huang,Yiding Li,Haojie Liu,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou
Main category: cs.CV
TL;DR: 本文提出InterDyad框架,通过结构化运动引导实现自然的双人交互视频合成,利用跨身份运动先验、元查询模态对齐、多模态大语言模型理解语言意图,并引入角色感知高斯引导提升唇形同步与空间一致性,显著优于现有方法。
Details
Motivation: 现有语音到视频合成方法难以建模双人交互中的跨个体依赖关系,且缺乏对反应行为的细粒度控制。 Method: 提出InterDyad框架:1)Interactivity Injector基于身份无关运动先验实现视频重演;2)MetaQuery机制对齐音频与运动先验;3)利用多模态大语言模型(MLLM)解析语言意图以调控反应时机与合理性;4)Role-aware Dyadic Gaussian Guidance(RoDG)提升极端姿态下的唇同步与空间一致性;5)构建专用双人交互评估套件。 Result: 在多项指标上显著超越当前最优方法,生成更自然、上下文更一致的双人交互视频;提供公开项目页及演示视频。 Conclusion: InterDyad为双人交互视频合成提供了统一、可控且语义驱动的新范式,有效解决了跨个体建模与反应精细化控制的关键挑战。 Abstract: Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.[150] VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
August Leander Høeg,Sophia Wiinberg Bardenfleth,Hans Martin Kjer,Tim Bjørn Dyrby,Vedrana Andersen Dahl,Anders Bjorholm Dahl
Main category: cs.CV
TL;DR: 本文揭示了当前体素超分辨率(SR)方法在使用下采样数据训练时性能被高估的问题,并提出了首个大规模真实配对高低分辨率3D医学影像数据集VoDaSuRe,指出只有基于真实低分辨率扫描训练的模型才能可靠恢复精细结构。
Details
Motivation: 现有体素超分辨率方法多依赖人工下采样的低分辨率数据进行训练,缺乏真实配对的高低分辨率3D数据集,导致模型性能评估失真,难以反映其在真实临床场景中的表现。 Method: 构建并发布VoDaSuRe——一个大规模、真实配对的高低分辨率3D体积数据集;对比分析在下采样数据与VoDaSuRe上训练的SR模型(含Transformer和CNN架构)在真实低分辨率扫描上的重建行为与结构保真度。 Result: 发现:1)在下采样数据上训练的模型重建结果更锐利但失真;2)在VoDaSuRe上训练的模型虽更平滑但更准确;3)现有方法无法真正恢复低分辨率中丢失的精细结构,而是输出平滑均值估计。 Conclusion: 当前深度学习体素超分辨率方法的性能被严重高估;必须转向基于真实、复杂配对扫描的数据集(如VoDaSuRe)进行训练与评估,才能推动该领域实质性进步。 Abstract: Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: https://augusthoeg.github.io/VoDaSuRe/[151] Conformal Cross-Modal Active Learning
Huy Hoang Nguyen,Cédric Jung,Shirin Salehi,Tobias Glück,Anke Schmeink,Andreas Kugi
Main category: cs.CV
TL;DR: 本文提出了一种名为Conformal Cross-Modal Acquisition (CCMA)的新型主动学习框架,利用视觉-语言大模型(VLM)作为教师模型,为纯视觉学生模型提供语义驱动、共形校准的不确定性估计,从而提升数据效率。
Details
Motivation: 现有主动学习方法忽视了现代视觉-语言模型中蕴含的丰富多模态知识,而基础视觉模型在数据高效学习方面的潜力尚未被充分挖掘。 Method: 提出CCMA框架,采用预训练VLM作为教师模型,通过共形预测生成语义感知的不确定性估计,并结合多样性感知策略选择样本,指导纯视觉学生模型训练。 Result: CCMA在多个基准上显著优于现有主动学习方法,尤其在仅依赖不确定性或多样性指标的方法上展现出明显优势。 Conclusion: 融合多模态共形评分与多样性感知选择的CCMA框架,有效提升了主动学习的数据效率,验证了利用VLM先验知识增强视觉模型标注效率的可行性。 Abstract: Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.[152] Dual Contrastive Network for Few-Shot Remote Sensing Image Scene Classification
Zhong Ji,Liyuan Hou,Xuan Wang,Gang Wang,Yanwei Pang
Main category: cs.CV
TL;DR: 本文提出了一种基于迁移学习的双对比网络(DCN),通过上下文引导和细节引导的两个监督对比学习分支,分别增强类间判别性和类内不变性,以解决遥感图像场景分类中样本少、类间差异小、类内差异大的难题。
Details
Motivation: 遥感图像场景分类在仅有少量标注样本时面临类间差异小、类内差异大的固有挑战,现有方法难以兼顾判别性与鲁棒性。 Method: 提出双对比网络(DCN):1)上下文引导对比学习(CCL)分支,利用Condenser Network提取上下文特征并进行监督对比学习;2)细节引导对比学习(DCL)分支,利用Smelter Network增强局部细节,并在细节特征图上构建监督对比学习以挖掘空间不变特征。 Result: 在四个公开遥感数据集上的大量实验表明,DCN性能优于现有主流方法。 Conclusion: DCN通过协同建模上下文与细节的对比学习机制,有效提升了少样本遥感图像场景分类的准确率与泛化能力。 Abstract: Few-shot remote sensing image scene classification (FS-RSISC) aims at classifying remote sensing images with only a few labeled samples. The main challenges lie in small inter-class variances and large intra-class variances, which are the inherent property of remote sensing images. To address these challenges, we propose a transfer-based Dual Contrastive Network (DCN), which incorporates two auxiliary supervised contrastive learning branches during the training process. Specifically, one is a Context-guided Contrastive Learning (CCL) branch and the other is a Detail-guided Contrastive Learning (DCL) branch, which focus on inter-class discriminability and intra-class invariance, respectively. In the CCL branch, we first devise a Condenser Network to capture context features, and then leverage a supervised contrastive learning on top of the obtained context features to facilitate the model to learn more discriminative features. In the DCL branch, a Smelter Network is designed to highlight the significant local detail information. And then we construct a supervised contrastive learning based on the detail feature maps to fully exploit the spatial information in each map, enabling the model to concentrate on invariant detail features. Extensive experiments on four public benchmark remote sensing datasets demonstrate the competitive performance of our proposed DCN.[153] GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field
Jingtao Zhou,Xuan Gao,Dongyu Liu,Junhui Hou,Yudong Guo,Juyong Zhang
Main category: cs.CV
TL;DR: GSwap是一种基于动态神经高斯人像先验的视频换头系统,通过在SMPL-X全身模型表面嵌入3D高斯特征场,实现高质量、三维一致、自然运动的全头替换。
Details
Motivation: 现有方法受限于2D生成模型或3DMM,在三维一致性、表情自然性、合成质量、全头建模和背景融合方面存在明显缺陷,易产生伪影与错位。 Method: 提出GSwap:1)将动态神经高斯人像先验嵌入SMPL-X表面构建内在3D高斯特征场;2)利用少量参考图像对预训练2D人像生成模型进行源域自适应;3)设计神经重渲染策略实现前景-背景无缝融合。 Result: 在视觉质量、时序连贯性、身份保持性和3D一致性等多方面显著超越现有方法。 Conclusion: GSwap为视频头部替换提供了更鲁棒、真实且一致的新范式,推动了该领域向高保真三维可控生成迈进。 Abstract: We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.[154] Gimbal360: Differentiable Auto-Leveling for Canonicalized $360^\circ$ Panoramic Image Completion
Yuqin Lu,Haofeng Liu,Yang Zhou,Jun Liang,Shengfeng He,Jing Li
Main category: cs.CV
TL;DR: 本文提出Gimbal360框架,通过引入规范视角空间和可微自动调平模块,解决从任意视角图像生成结构一致的360°全景图的几何与拓扑不匹配问题,并结合Horizon360数据集实现SOTA性能。
Details
Motivation: 现有扩散模型擅长2D外绘,但难以直接用于从无姿态的透视图像生成360°球面全景图,主要受限于透视投影与球面全景之间的几何(如投影失真)和拓扑(如ERP图像的周期性边界)不匹配。 Method: 提出Gimbal360框架:1)构建Canonical Viewing Space作为透视图与球面全景间的几何正则化中间表示;2)设计Differentiable Auto-Leveling模块,在无相机参数下对齐特征方向;3)在潜在空间中施加拓扑等变性以保持ERP图像的S¹周期连续性;4)构建重力对齐的大规模全景数据集Horizon360。 Result: 在360°场景补全任务上达到SOTA性能,显著提升生成结果的结构一致性与边界无缝性。 Conclusion: 显式建模几何标准化(视角空间+自动调平)与拓扑等变性(周期性约束)是实现高质量、结构一致的360°全景生成的关键。 Abstract: Diffusion models excel at 2D outpainting, but extending them to $360^\circ$ panoramic completion from unposed perspective images is challenging due to the geometric and topological mismatch between perspective projections and spherical panoramas. We present Gimbal360, a principled framework that explicitly bridges perspective observations and spherical panoramas. We introduce a Canonical Viewing Space that regularizes projective geometry and provides a consistent intermediate representation between the two domains. To anchor in-the-wild inputs to this space, we propose a Differentiable Auto-Leveling module that stabilizes feature orientation without requiring camera parameters at inference. Panoramic generation also introduces a topological challenge. Standard generative architectures assume a bounded Euclidean image plane, while Equirectangular Projection (ERP) panoramas exhibit intrinsic $S^1$ periodicity. Euclidean operations therefore break boundary continuity. We address this mismatch by enforcing topological equivariance in the latent space to preserve seamless periodic structure. To support this formulation, we introduce Horizon360, a curated large-scale dataset of gravity-aligned panoramic environments. Extensive experiments show that explicitly standardizing geometric and topological priors enables Gimbal360 to achieve state-of-the-art performance in structurally consistent $360^\circ$ scene completion.[155] ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Yeonkyung Lee,Dayun Ju,Youngmin Kim,Seil Kang,Seong Jae Hwang
Main category: cs.CV
TL;DR: 本文提出ViKey框架,通过视觉提示(VP)和关键词-帧映射(KFM)模块,在不训练的情况下显著提升VideoLLMs在稀疏帧采样下的时间推理能力,仅用20%帧即可接近全帧性能。
Details
Motivation: 现有帧选择方法虽降低计算成本,但损害需时间推理的任务性能;VideoLLMs难以像人类一样从稀疏帧中推断事件时序。 Method: 引入无需训练的ViKey框架,结合视觉提示(为每帧添加序数标注)与轻量级关键词-帧映射(KFM)模块,利用帧索引作为键将文本线索显式锚定到最相关帧。 Result: 在多个数据集上显著提升时间推理能力,部分任务中仅用20%帧即达到或接近全帧采样的基线性能。 Conclusion: 视觉提示与显式时间锚定是提升稀疏帧下VideoLLMs时间理解的有效轻量策略,ViKey为高效视频理解提供了新范式。 Abstract: Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.[156] Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
Anupam Pani,Yanchao Yang
Main category: cs.CV
TL;DR: 本文提出了一种眼动(gaze)正则化框架,将眼动信息直接融入视觉语言模型(VLMs)中,以提升其在第一人称视角行为理解与未来事件预测中的性能。通过生成眼动引导的查询和注意力对齐机制,模型显著提升了语义预测准确率(+13%)。
Details
Motivation: 现有方法仅依赖视觉数据,忽视了蕴含人类意图和未来行为线索的眼动信息(如注视点和扫视),限制了VLMs在egocentric行为理解和未来预测中的表现。 Method: 提出 gaze-regularized 框架:1)在VLM训练中直接引入眼动信息;2)生成 gaze-based queries 使模型动态聚焦于眼动高亮区域;3)设计 gaze-regularization 机制对齐模型注意力与人类注意力模式;4)系统探索多种眼动融合策略。 Result: 在语义评分上相较无眼动基线模型提升近13%;验证了眼动信息能有效增强VLM对未来事件(含详细动作描述)的预测能力。 Conclusion: 眼动信息是提升VLM在egocentric场景下行为理解和未来预测能力的关键信号;本工作为将人类眼动机制融入VLM提供了可扩展的基础框架。 Abstract: Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.[157] FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation
Yukinori Yamamoto,Kazuya Nishimura,Tsukasa Fukusato,Hirokazu Nosato,Tetsuya Ogata,Hirokatsu Kataoka
Main category: cs.CV
TL;DR: 本文提出FDIF框架,利用隐式函数(SDF)实现无需真实数据和专家标注的公式驱动3D医学图像分割预训练,在多个基准和模型上达到媲美基于大规模真实数据的自监督方法的性能。
Details
Motivation: 深度学习在3D医学图像分割中依赖大量带标注数据,但隐私限制和专家标注高成本使其难以获取;现有基于体素的公式驱动方法几何表达能力弱、无法合成真实纹理。 Method: 提出FDIF框架,采用基于有符号距离函数(SDF)的隐式函数表示,支持复杂几何建模与可控的几何及强度纹理合成,实现完全无真实数据的可扩展预训练。 Result: 在AMOS、ACDC、KiTS三个分割基准及SwinUNETR、nnUNet ResEnc-L、nnUNet Primus-M三种架构上,FDIF持续优于原有公式驱动方法,并达到与基于大规模真实数据的自监督预训练相当的性能;还验证其对3D分类任务的有效性。 Conclusion: 基于隐式函数的公式监督是一种有前景的无数据表征学习范式,为医学图像分析提供了高效、隐私友好的预训练新路径。 Abstract: Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at https://github.com/yamanoko/FDIF.[158] Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation
Anupam Pani,Yanchao Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于人类注视数据的正则化训练框架,用于提升视觉-语言-动作(VLA)模型在机器人精细操作任务中的性能与可解释性,无需修改模型结构或增加推理开销。
Details
Motivation: 当前VLA模型在精细操作任务中表现不佳,因其缺乏主动视觉注意分配机制;而人类注视行为天然蕴含意图、规划和执行信息,可作为强监督信号指导机器人感知。 Method: 将时序聚合的人类注视热图转化为patch级分布,并通过KL散度正则化Transformer注意力机制,从而引入对任务相关特征的归纳偏置,且不改变模型架构或增加推理负担。 Result: 在多个操作基准上提升4–12%性能;训练收敛更快;在光照变化和传感器噪声下鲁棒性更强;注意力可视化结果更符合人类策略,提升系统可信度;无需眼动仪,可直接应用于现有数据集。 Conclusion: 人类感知先验能显著加速机器人学习,在提升任务性能的同时增强模型可解释性与部署效率。 Abstract: Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.[159] PoseDriver: A Unified Approach to Multi-Category Skeleton Detection for Autonomous Driving
Yasamin Borhani,Taylor Mordan,Yihan Wang,Reyhaneh Hosseininejad,Javad Khoramdel,Alexandre Alahi
Main category: cs.CV
TL;DR: 本文提出了PoseDriver,一种面向自动驾驶场景的统一多类别骨架检测框架,通过将每种类别建模为独立任务来解决多任务学习挑战,并在车道线检测(OpenLane数据集)和新构建的自行车骨架数据集上验证了其有效性与可迁移性。
Details
Motivation: 现有方法缺乏一个仅基于输入图像即可同时处理多实例、多类别的统一骨架检测架构,而骨架表示对自动驾驶中的姿态与朝向理解至关重要。 Method: 提出PoseDriver框架,采用自底向上的多类别骨架检测范式;将各类别建模为独立任务以缓解多任务干扰;设计基于骨架表示的新型车道线检测方法;并构建自行车骨架新数据集用于评估跨类别泛化能力。 Result: 在OpenLane数据集上车道检测达到SOTA性能;在新提出的自行车骨架数据集上验证了框架对未见类别的良好迁移能力;整体实验表明该方法有效。 Conclusion: PoseDriver是一种可扩展、可迁移的统一骨架检测框架,成功兼顾多类别、多实例建模,在自动驾驶关键任务中展现出优越性能与实用性。 Abstract: Object skeletons offer a concise representation of structural information, capturing essential aspects of posture and orientation that are crucial for autonomous driving applications. However, a unified architecture that simultaneously handles multiple instances and categories using only the input image remains elusive. In this paper, we introduce PoseDriver, a unified framework for bottom-up multi-category skeleton detection tailored to common objects in driving scenarios. We model each category as a distinct task to systematically address the challenges of multi-task learning. Specifically, we propose a novel approach for lane detection based on skeleton representations, achieving state-of-the-art performance on the OpenLane dataset. Moreover, we present a new dataset for bicycle skeleton detection and assess the transferability of our framework to novel categories. Experimental results validate the effectiveness of the proposed approach.[160] GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models
Zekai Gu,Shuoxuan Feng,Yansong Wang,Hanzhuo Huang,Zhongshuo Du,Chengfeng Zhao,Chengwei Ren,Peng Wang,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出GO-Renderer框架,结合3D重建代理与扩散视频生成模型,在任意视角和光照条件下实现高质量物体渲染,无需显式建模复杂材质与光照。
Details
Motivation: 现有前馈3D重建方法难以准确建模复杂外观;扩散生成模型虽能合成逼真图像/视频,但缺乏精确视角控制。 Method: 将重建的3D代理作为引导,融入视频扩散生成模型,实现视角可控、光照自适应的高质量渲染。 Result: 在新视角图像合成、新光照环境渲染及对象视频插入等任务上达到SOTA性能。 Conclusion: GO-Renderer统一了3D几何控制与生成式外观建模优势,为高质量、可控物体渲染提供了新范式。 Abstract: Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.[161] Multi-Modal Image Fusion via Intervention-Stable Feature Learning
Xue Wang,Zheng Guan,Wenhua Qian,Chengchao Wang,Runzhuo Ma
Main category: cs.CV
TL;DR: 本文提出了一种基于因果干预的多模态图像融合框架,通过三种干预策略(互补掩码、随机掩码、模态丢弃)识别鲁棒的跨模态依赖关系,并设计因果特征整合器(CFI)学习干预稳定的特征,显著提升分布外泛化能力与下游任务性能。
Details
Motivation: 现有方法主要优化模态间的统计相关性,易捕获数据集诱导的虚假关联,导致在分布偏移下性能下降。 Method: 受Pearl因果层次启发,设计三种因果干预策略:互补掩码(空间不重叠扰动)、随机掩码(相同区域扰动)、模态丢弃;并构建因果特征整合器(CFI),通过自适应不变性门控学习干预下稳定的重要特征。 Result: 在多个公开基准及高层视觉下游任务上达到SOTA性能。 Conclusion: 基于因果干预的建模范式能有效区分鲁棒跨模态依赖与虚假统计关联,提升多模态融合模型的泛化性与可靠性。 Abstract: Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl's causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other's missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.[162] CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Yuchen Wu,Kun Wang,Yining Pan,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了一种提升多模态3D目标检测跨域泛化能力的新方法,通过解耦查询损失、激光雷达引导的深度先验和互补跨模态掩码,缓解单模态退化与LiDAR主导问题,在保持源域性能的同时显著提升在雨天、夜间等挑战性目标域的表现。
Details
Motivation: 现有双分支提案级多模态3D检测器在跨域部署时性能大幅下降,主要受限于恶劣环境下单一模态(如图像或点云)严重退化,以及LiDAR分支过度主导导致视觉线索利用不足和鲁棒性差。 Method: 提出三个核心组件:1)Query-Decoupled Loss,对2D-only、3D-only和融合查询分别监督以平衡梯度流;2)LiDAR-Guided Depth Prior,通过融合图像预测与LiDAR推导的深度分布,为2D查询提供实例感知的几何先验;3)Complementary Cross-Modal Masking,在图像和点云上施加互补空间掩码,促使双模态查询在融合解码器中竞争,实现自适应融合。 Result: 在多个跨域场景(如雨天、夜间)下显著超越当前最优方法,同时保持源域检测精度不下降。 Conclusion: 所提方法有效提升了多模态3D检测器的跨域鲁棒性与泛化能力,解决了模态退化与融合偏差两大关键问题,具备实际部署价值。 Abstract: Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.[163] WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction
Xinyong Cai,Runming Xie,Hu Chen,Yuankai Wu
Main category: cs.CV
TL;DR: 本文提出WaveSFNet,一种结合小波编解码器和时空双域门控翻译器的高效框架,用于无监督时空预测,兼顾长程动态建模与高频细节保持。
Details
Motivation: 现有方法难以同时建模长程动态和保留高频细节,导致多步预测模糊;纯空间操作难平衡局部交互与全局传播,而跳步卷积或池化易丢失纹理和边界信息。 Method: 提出WaveSFNet:1)小波编解码器在下采样与重建中保留高频子带线索;2)时空双域门控翻译器先注入相邻帧差分以增强动态信息,再进行大核空间局部建模与频域全局调制的双域门控融合,并辅以门控通道交互实现跨通道特征交换。 Result: 在Moving MNIST、TaxiBJ和WeatherBench数据集上达到具有竞争力的预测精度,同时保持较低计算复杂度。 Conclusion: WaveSFNet通过融合小波表示与双域门控机制,在效率与预测质量之间取得更好权衡,为无监督时空预测提供了一种新范式。 Abstract: Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial--frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.[164] Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis
Shiheng Nie,Yunguang Yue
Main category: cs.CV
TL;DR: 本文提出了Knots-10细粒度视觉分类基准,聚焦于仅依赖绳结交叉结构的物理绳结分类任务;实验发现主流模型准确率接近97%,但跨域泛化能力差,主要失败原因是模型依赖绳子外观而非拓扑结构;提出TACA正则化方法提升了嵌入空间与拓扑距离的一致性,但未提升分类精度。
Details
Motivation: 物理绳结分类是一种外观线索被刻意抑制的细粒度视觉分类任务,类别差异仅在于拓扑交叉结构,现有数据集和方法难以评估模型是否真正学习了拓扑特征。 Method: 构建Knots-10基准(含1440张图像,训练集为松散系法、测试集为紧致系法);对比Swin-T、TransFG、PMG等主流模型;采用McNemar检验评估性能差异显著性;用Mantel置换检验分析混淆模式与拓扑距离的相关性;提出TACA正则化以增强嵌入空间与拓扑距离的一致性,并设计随机距离消融实验验证其作用机制;开展手机实拍图像的跨域测试。 Result: Swin-T和TransFG达97.2%准确率,PMG为94.5%;McNemar检验显示多数模型间差异不显著;Mantel检验表明三种模型的混淆模式与拓扑距离显著相关(p<0.01);TACA将嵌入-拓扑对齐度(rho)从0.46提升至0.65,但分类精度未提升;随机距离正则化效果相当,说明TACA收益源于通用正则化;跨域测试准确率下降58–69个百分点。 Conclusion: 当前模型在受控环境下表现优异,但严重依赖绳子外观而非绳结拓扑结构,泛化能力差;提升嵌入空间与拓扑距离的一致性不必然提升分类性能;未来工作需设计更鲁棒的拓扑感知表征学习方法。 Abstract: Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p < 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.[165] Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning
Konstantinos Barmpounakis,Theodoros P. Vagenas,Maria Vakalopoulou,George K. Matsopoulos
Main category: cs.CV
TL;DR: 本文探索了基于Mamba架构(一种状态空间模型)的MRI-to-CT合成方法,用于MRI-only放疗计划,相比主流nnU-Net,在保持快速推理的同时提升了CT图像合成精度与几何一致性。
Details
Motivation: 减少患者电离辐射暴露、避免多模态配准误差,推动MRI-only放疗流程;探索Mamba等新兴状态空间模型在跨模态图像生成中的潜力,弥补CNN在长程依赖建模上的不足。 Method: 将面向分割设计的U-Mamba和SegMamba架构改造为3D MRI-to-CT生成模型;在SynthRAD2025子集(含头颈、胸、腹三部位配对MRI-CT数据)上训练与验证;采用HU域图像相似性指标(如MAE、PSNR)和TotalSegmentator分割评估几何一致性。 Result: 所提3D Mamba模型在CT合成任务中实现了优于或媲美nnU-Net的定量指标,同时具备更快的推理速度;有效捕获体积特征与长程依赖,保障解剖结构几何一致性。 Conclusion: Mamba类状态空间模型是MRI-to-CT合成的有力候选,有望推动高效、精准的MRI-only放射治疗临床落地。 Abstract: Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.[166] Drop-In Perceptual Optimization for 3D Gaussian Splatting
Ezgi Ozyilkan,Zhiqi Chen,Oren Rippel,Jona Ballé,Kedar Tatwawadi
Main category: cs.CV
TL;DR: 本文提出了一种名为WD-R的正则化Wasserstein失真损失,用于优化3D高斯点绘(3DGS)方法的感知质量,在大规模人类主观评估中显著优于现有方法,并在多个指标和框架上实现SOTA性能。
Details
Motivation: 现有3DGS方法多采用启发式像素级损失,导致渲染结果模糊,缺乏针对人类视觉感知的系统性优化。 Method: 系统搜索多种失真损失,并开展首个大规模人类主观研究(39,320次成对评分),提出正则化的Wasserstein Distortion(WD-R)作为优化目标。 Result: WD-R在LPIPS、DISTS、FID等指标上达到SOTA;在Mip-Splatting和Scaffold-GS中替换原损失后显著提升人眼偏好率(1.8×和3.6×);在场景压缩中实现约50%码率节省。 Conclusion: WD-R是一种通用、高效且感知友好的3DGS损失函数,可广泛适配于不同3DGS框架并显著提升渲染质量与压缩效率。 Abstract: Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than $2.3\times$ over the original 3DGS loss, and $1.5\times$ over current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters $1.8\times$ and $3.6\times$, respectively. We also find that this carries over to the task of 3DGS scene compression, with $\approx 50\%$ bitrate savings for comparable perceptual metric performance.[167] Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression
V. K. Cody Bumgardner,Mitchell A. Klusty,Mahmut S. Gokmen,Evan W. Damron
Main category: cs.CV
TL;DR: 本文提出Ker-VLJEPA-3B,一种四阶段课程学习框架,用于从胸部CT体积生成放射科自由文本报告;其核心是将自监督视觉编码器(LeJEPA ViT-Large)与冻结视觉、微调语言模型(Llama 3.2 3B)解耦对齐,通过多项创新技术缓解视觉令牌被忽略、类别不平衡和后验坍塌等问题,在CT-RATE上达到SOTA性能。
Details
Motivation: 解决3D CT影像自动报告生成中的三大挑战:序列过长、严重类别不平衡、以及大语言模型倾向于依赖语言先验而忽略视觉信息。 Method: 提出四阶段课程学习框架Ker-VLJEPA-3B:1)自监督训练LeJEPA ViT-Large视觉骨干(无文本监督);2)引入区域约束交叉注意力压缩切片嵌入为32个空间定位视觉token;3)对LLM嵌入进行PCA白化;4)采用仅阳性发现策略避免后验坍塌;5)桥接阶段暖初始化+选择性交叉注意力冻结+弹性权重巩固防止灾难性遗忘。 Result: 在CT-RATE基准(2984例验证CT,18类)上macro F1达0.429,超越SOTA(U-VLM, 0.414)3.6%;阈值优化后达0.448(+8.2%);消融表明56.6%生成质量源自患者特异性视觉内容。 Conclusion: 模态无关的设计使任意自监督视觉编码器均可无缝接入LLM,无需配对文本基础训练;视觉-语言对齐可延后至课程后期,显著提升医学报告生成的视觉忠实性与诊断相关性。 Abstract: Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum's bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.[168] ARGENT: Adaptive Hierarchical Image-Text Representations
Chuong Huynh,Hossein Souri,Abhinav Kumar,Vitali Petsiuk,Deen Dayal Mohan,Suren Kumar
Main category: cs.CV
TL;DR: 本文提出ARGENT,一种新的超球面视觉-语言模型,通过自适应蕴含损失和范数正则化解决现有模型中锥体坍塌问题,并引入基于角度的概率蕴含协议(PEP)以更可靠地评估层次理解能力。
Details
Motivation: 现有超球面视觉-语言模型在欧氏空间中无法有效建模概念的层次结构,且其蕴含损失不稳定,导致锥体坍塌;同时,层次评估方法不可靠,易受分类体系依赖和模糊负样本影响。 Method: 提出自适应蕴含损失与范数正则化防止锥体坍塌;设计基于角度的概率蕴含协议(PEP)用于评估,采用AUC-ROC和平均精度评分。 Result: ARGENT在图像分类、文本到图像检索及新提出的层次度量上分别提升SOTA超球面VLM 0.7、1.1和0.8绝对点。 Conclusion: ARGENT为超球面视觉-语言建模提供了更强基线,有效缓解锥体坍塌问题,并建立了更鲁棒的层次理解评估范式。 Abstract: Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.[169] Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Chuanqing Zhuang,Xin Lu,Zehui Deng,Zhengda Lu,Yiqun Wang,Junqi Diao,Jun Xiao
Main category: cs.CV
TL;DR: 本文提出了一种无需相机位姿先验的全向3D高斯泼溅方法PFGS360,通过球面一致性感知的姿态估计模块和深度内点感知的致密化模块,实现了从无位姿全景视频中高质量重建与新视角合成。
Details
Motivation: 现有全向3D高斯泼溅方法依赖缓慢的运动恢复结构(SfM)提供相机位姿和稀疏点先验,限制了效率与实用性。 Method: 提出PFGS360:1)球面一致性感知的姿态估计模块,利用高斯内部深度先验建立2D-3D对应关系以恢复相机位姿;2)深度内点感知的致密化模块,基于一致单目深度先验筛选深度内点与高斯离群点,提升高斯致密化效率与渲染质量。 Result: 在真实与合成360度视频上显著优于现有无位姿及有位姿3DGS方法,实现高质量新视角合成。 Conclusion: PFGS360成功实现了无需初始位姿的全向3D高斯泼溅重建,兼顾姿态估计精度与渲染保真度,推动了3D场景表征的实用化进展。 Abstract: Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at https://github.com/zcq15/PFGS360.[170] ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images
Yunfeng Wu,Hongying Cheng,Zihao He,Songhua Liu
Main category: cs.CV
TL;DR: 本文提出了一种纯图像适配框架,通过两阶段LoRA微调(Relay LoRA)和高频感知训练目标,使预训练的视频扩散Transformer无需视频数据即可生成超高清视频,显著提升细节并超越先前SOTA。
Details
Motivation: Transformer视频扩散模型因3D注意力机制导致计算和内存开销巨大,难以端到端训练超高清视频;直接用高分辨率图像微调又因图像-视频模态差异引入噪声。 Method: 提出Relay LoRA两阶段适配:第一阶段用低分辨率图像对齐图像-视频模态;第二阶段用高分辨率图像学习空间外推能力;同时设计高频感知训练目标,通过专用重建损失恢复潜在表示中的高频成分。 Result: 在VBench基准上比先前在高清视频上训练的SOTA模型高出0.8分,且完全不依赖视频训练数据,生成超高清视频细节丰富。 Conclusion: 纯图像适配策略可有效克服视频扩散模型的分辨率瓶颈,在无视频数据条件下实现高质量超高清视频生成,为高效视频生成提供了新范式。 Abstract: Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.[171] An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net
MD Rashidul Islam,Bakary Gibba
Main category: cs.CV
TL;DR: 本文提出了一种基于改进U-Net(引入注意力门机制)、结合Dice损失与可解释AI(Grad-CAM+高斯平滑热图)的脑胶质瘤MRI多区域分割方法,在BraTS 2020数据集上取得优异性能,兼顾精度与临床可解释性。
Details
Motivation: 胶质瘤异质性强、恶性程度高,手动分割耗时且不可靠,亟需鲁棒、自动、可解释的分割方法辅助临床决策。 Method: 基于U-Net架构,引入执行注意力门机制;采用Dice Loss、Categorical Dice Loss与交叉熵联合优化以缓解类别不平衡;使用Grad-CAM生成注意力热图,并通过高斯滤波提升可视化平滑度;评估指标包括Dice系数、IoU、敏感性、特异性等。 Result: 在BraTS 2020数据集上达到准确率0.9919、Dice系数0.9901、平均IoU 0.9873、敏感性0.9908、特异性0.9974,性能优于现有方法。 Conclusion: 注意力机制、定制化损失函数与可解释AI的协同设计显著提升了复杂脑肿瘤MRI分割的精度与可信度,为临床应用提供了可靠、可解释的技术支持。 Abstract: Computer-aided segmentation of brain tumors from MRI data is of crucial significance to clinical decision-making in diagnosis, treatment planning, and follow-up disease monitoring. Gliomas, owing to their high malignancy and heterogeneity, represent a very challenging task for accurate and reliable segmentation into intra-tumoral sub-regions. Manual segmentation is typically time-consuming and not reliable, which justifies the need for robust automated techniques.This research resolves this problem by leveraging the BraTS 2020 dataset, where we have labeled MRI scans of glioma patients with four significant classes: background/healthy tissue, necrotic/non-enhancing core, edema, and enhancing tumor. In this work, we present a new segmentation technique based on a U-Net model augmented with executed attention gates to focus on the most significant regions of images. To counter class imbalance, we employ manually designed loss functions like Dice Loss and Categorical Dice Loss, in conjunction with standard categorical cross-entropy. Other evaluation metrics, like sensitivity and specificity, were used to measure discriminability of the model between tumor classes. Besides, we introduce Grad-CAM-based explainable AI to enable visualizing attention regions and improve model interpretability, together with a smooth heatmap generation technique through Gaussian filtering. Our approach achieved superior performance with accuracy of 0.9919, Dice coefficient of 0.9901, mean IoU of 0.9873, sensitivity of 0.9908, and specificity of 0.9974. This study demonstrates that the use of attention mechanisms, personalized loss functions, and explainable AI significantly improves highly complex tumor structure segmentation precision in MRI scans, providing a reliable and explainable method for clinical applications.[172] FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
Yujie Sun,Zhuoqiang Cai,Chaoyue Niu,Jianchuan Chen,Zhiwen Chen,Chengfei Lv,Fan Wu
Main category: cs.CV
TL;DR: FHAvatar是一种新型框架,通过解耦面部(平面高斯)与头发(基于发丝的高斯)表征,实现仅需少量视角即可快速重建高质量、可动画、可编辑的3D高斯数字人。
Details
Motivation: 现有方法将面部与头发耦合建模,依赖密集多视角采集或昂贵的单体优化,限制了实用性与可扩展性。 Method: 提出FHAvatar框架:1)在纹理空间显式解耦面部(平面高斯)与头发(strand-based高斯);2)设计聚合Transformer骨干网络,从多视角数据中学习几何感知的跨视角先验与头-发结构一致性,支持少视角(few casual views)下的高效特征提取与融合。 Result: 在少量新身份视角下(数分钟内)达到SOTA重建质量;支持实时动画、便捷发型迁移与风格化编辑。 Conclusion: FHAvatar显著提升了3D数字人重建的效率、灵活性与可访问性,拓展了其在实际场景中的应用潜力。 Abstract: We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.[173] Object Pose Transformer: Unifying Unseen Object Pose Estimation
Weihang Li,Lorenzo Garattoni,Fabien Despinoy,Nassir Navab,Benjamin Busam
Main category: cs.CV
TL;DR: 本文提出Object Pose Transformer(OPT),一种统一的前馈框架,通过任务分解在单个模型中同时实现类别级绝对姿态估计和未见物体的相对姿态估计,无需语义标签,支持RGB-only和可选深度输入,且具备相机无关性。
Details
Motivation: 现有方法分为两类:类别级方法依赖预定义分类体系预测绝对姿态,相对姿态方法无法恢复单视角绝对姿态;二者割裂,难以兼顾。 Method: 提出Object Pose Transformer(OPT),联合预测深度、点图、相机参数和归一化物体坐标(NOCS);利用对比式物体中心隐空间嵌入实现无标签规范表示,以点图为相机空间表征支持多视角几何推理;通过跨帧特征交互与共享物体嵌入提升单视角绝对姿态估计精度。 Result: 在NOCS、HouseCat6D、Omni6DPose、Toyota-Light等多个基准上,OPT在绝对与相对姿态估计任务中均达到SOTA性能,且支持相机自适应与RGB-only设置。 Conclusion: OPT成功统一了绝对与相对姿态估计范式,实现了对未见物体的模型无关姿态估计,兼具鲁棒性、泛化性与实用性。 Abstract: Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.[174] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment
Yuzhi Chen,Ronghan Chen,Dongjie Huo,Yandan Yang,Dekang Qi,Haoyun Liu,Tong Lin,Shuang Zeng,Junjin Xiao,Xinyuan Chang,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
Main category: cs.CV
TL;DR: 本文提出ABot-PhysWorld,一个14B参数的扩散Transformer视频生成模型,通过物理感知数据集和DPO后训练框架显著提升生成视频的物理合理性与动作可控性,并引入首个训练无关的零样本评测基准EZSbench。
Details
Motivation: 现有基于通用视觉数据和似然目标的视频世界模型常生成违反物理规律的动作(如物体穿透、反重力运动),缺乏对物理真实性的建模。 Method: 构建三百万条物理感知标注的操作视频数据集;提出基于DPO的后训练框架,采用解耦判别器抑制不物理行为;设计并行上下文模块实现空间精准动作注入以支持跨具身控制;构建训练无关的零样本评测基准EZSbench,采用解耦协议分别评估物理真实性和动作对齐性。 Result: 在PBench和新提出的EZSbench上达到SOTA,物理合理性和轨迹一致性超越Veo 3.1与Sora v2 Pro;EZSbench将开源以推动具身视频生成标准化评测。 Conclusion: 物理约束与动作可控性可协同提升具身视频世界模型的真实性与实用性,数据质量、训练目标设计与评测基准创新共同构成关键突破路径。 Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.[175] FG-Portrait: 3D Flow Guided Editable Portrait Animation
Yating Xu,Yunqi Miao,Evangelos Ververas,Jiankang Deng,Jifei Song
Main category: cs.CV
TL;DR: 本文提出了一种基于3D流的扩散模型方法,用于提升肖像动画中的运动迁移质量,通过几何驱动的3D流编码与深度引导采样,实现更准确的源-驱动对应关系建模,并支持表情与姿态编辑。
Details
Motivation: 现有扩散模型仅依赖驱动动作作为条件,难以建模源图像与驱动动作间的像素级对应关系;而2D光流估计又因病态性导致不准确。 Method: 引入无需学习、基于参数化3D头部模型计算的3D流作为几何先验;设计3D流编码模块为每个目标像素查询其3D位移;结合深度引导采样对齐3D流与2D运动变化;将该先验嵌入扩散模型中。 Result: 在运动迁移一致性与源身份保真度上显著优于现有方法,并支持用户指定的表情与头部姿态编辑。 Conclusion: 3D流提供了一种鲁棒、可解释的几何先验,有效弥补了纯2D扩散模型在运动迁移中的结构性缺陷,提升了肖像动画的质量与可控性。 Abstract: Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.[176] From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Feifan Luo,Hongyang Chen
Main category: cs.CV
TL;DR: 本文提出Advanced Functional Maps框架,通过可学习的谱基替代固定基函数,并设计抑制函数优化谱基,首次实现无监督谱基学习,提升非刚性3D形状匹配性能与效率。
Details
Motivation: 现有深度函数映射方法忽视谱基优化,且依赖计算昂贵的传统求解器,导致匹配效果不佳和效率低下。 Method: 提出可学习谱基的广义函数映射框架,利用学习到的抑制函数优化谱基;设计热扩散模块和无监督损失函数;端到端联合优化特征提取与谱基;避免使用传统函数映射求解器。 Result: 在非等距形变和拓扑噪声等挑战场景下显著优于当前最优特征学习方法,同时保持高计算效率;揭示谱基优化等价于谱卷积,抑制函数即谱滤波器。 Conclusion: 优化谱基对函数映射至关重要;所提无监督谱基学习方法兼具高性能与高效率,并为谱图网络启发的表示学习开辟新方向。 Abstract: Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis-a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning.[177] SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM
Chuanrui Zhang,Minghan Qin,Yuang Wang,Baifeng Xie,Hang Li,Ziwei Wang
Main category: cs.CV
TL;DR: 本文提出SIMART,一种统一的多模态大语言模型框架,通过稀疏3D VQ-VAE实现部件级分解与运动学预测联合建模,显著降低3D token数量,生成高质量、可仿真的 articulated 3D资产。
Details
Motivation: 现有3D生成方法集中于静态网格,缺乏面向物理仿真和具身AI所需的‘sim-ready’可交互 articulated 3D资产;多阶段流水线易累积误差,而基于密集体素的统一MLLM又面临token序列长、内存开销大、难以扩展的问题。 Method: 提出SIMART框架,核心是稀疏3D VQ-VAE,替代密集体素tokenization,大幅压缩token数量;在此基础上统一建模部件分解与运动学参数预测,实现端到端的articulated对象生成。 Result: 在PartNet-Mobility和in-the-wild AIGC数据集上达到SOTA性能,支持高保真多部件装配,并成功应用于基于物理的机器人仿真。 Conclusion: SIMART验证了稀疏3D表征与统一MLLM结合的有效性,为生成可直接用于仿真与交互的articulated 3D内容提供了新范式。 Abstract: High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.[178] Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation
Xinyu Liu,Zhen Chen,Wuyang Li,Chenxin Li,Yixuan Yuan
Main category: cs.CV
TL;DR: 本文提出Light-UNETR,一种轻量级Transformer模型,结合LIDR模块和CGLU提升模型效率,并通过CSE策略(含Attention-Guided Replacement与Spatial Masking Consistency)提升数据效率,在少量标注数据下实现高性能且低计算开销的3D医学图像分割。
Details
Motivation: Transformer在3D医学图像分割中性能优异,但存在计算开销大、依赖大量标注数据的问题,限制了其实际应用。 Method: 提出Light-UNETR模型:包含轻量维度约简注意力(LIDR)模块以兼顾全局与局部特征并降低空间/通道维度;引入紧凑门控线性单元(CGLU)控制通道交互;设计上下文协同增强(CSE)学习策略,融合外在(Attention-Guided Replacement)与内在(Spatial Masking Consistency)上下文信息提升无标签数据利用效率。 Result: 在多个基准上验证有效性;在左心房分割数据集仅用10%标注数据时,Jaccard指标较BCP提升1.43%,FLOPs降低90.8%,参数量减少85.8%。 Conclusion: Light-UNETR在保证分割精度的同时显著提升模型与数据效率,为资源受限场景下的3D医学图像分割提供了实用解决方案。 Abstract: Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.[179] GeoSANE: Learning Geospatial Representations from Models, Not Data
Joelle Hanna,Damian Falk,Stella X. Yu,Damian Borth
Main category: cs.CV
TL;DR: 本文提出GeoSANE——一种地理空间模型工坊,通过融合现有基础模型和任务特定模型的权重,生成统一神经表征,并按需生成适配多种任务(分类、分割、检测)和模态的目标网络权重,显著提升性能与泛化能力。
Details
Motivation: 现有遥感基础模型虽在各自领域表现优异,但彼此互补而非统一,难以全面覆盖广阔的地理空间知识;需一种机制整合多模型优势,避免模型选择困境。 Method: 提出GeoSANE框架,基于已有基础模型与任务模型的权重,学习统一神经表示,并能根据目标网络结构按需生成可微调的初始权重。 Result: GeoSANE生成的模型在10个数据集及GEO-Bench上超越从头训练模型,匹敌或超越当前最优遥感基础模型,并在轻量化网络生成中优于剪枝与知识蒸馏方法。 Conclusion: GeoSANE通过权重生成替代传统预训练范式,为跨模型、跨任务的地理空间知识统一与迁移提供了新框架。 Abstract: Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href{https://hsg-aiml.github.io/GeoSANE/}{hsg-aiml.github.io/GeoSANE/}.[180] I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation
Jia Li,Han Yan,Yihang Chen,Siqi Li,Xibin Song,Yifu Wang,Jianfei Cai,Tien-Tsin Wong,Pan Ji
Main category: cs.CV
TL;DR: 本文提出I3DM方法,通过隐式3D感知记忆机制解决视频生成中长期场景一致性难题,无需显式3D重建,利用预训练FF-NVS模型中间特征进行鲁棒视图检索,并设计3D对齐记忆注入模块提升重访一致性和相机控制精度。
Details
Motivation: 现有视频生成方法在重访先前探索区域时难以维持长期场景一致性:基于显式3D几何重建的方法存在误差累积和尺度模糊问题;基于简单相机视场检索的方法在复杂遮挡下易失效。 Method: 提出I3DM隐式3D感知记忆机制,包括:1)基于预训练FF-NVS模型中间特征的3D感知记忆检索策略,用于评分视图相关性;2)3D对齐记忆注入模块,隐式扭曲历史帧内容至目标视角,并自适应地基于可靠扭曲区域调节生成过程。 Result: 实验表明该方法在重访一致性、生成保真度和相机控制精度上均优于当前最先进方法。 Conclusion: I3DM通过绕过显式3D重建,利用隐式3D感知记忆实现更鲁棒、高保真的长视频场景生成,尤其在复杂遮挡下展现出优越性能。 Abstract: Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.[181] SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
Bao Truong,Quang Nguyen,Baoru Huang,Jinpei Han,Van Nguyen,Ngan Le,Minh-Tan Pham,Doan Huy Hien,Anh Nguyen
Main category: cs.CV
TL;DR: 本文提出了一个名为SIGMA的新型物理驱动数据集,用于地震图像中气烟囱的检测与增强,包含像素级掩码和配对退化/真值图像,并验证了其作为挑战性基准的有效性。
Details
Motivation: 气烟囱的准确检测对评估油气潜力和规避钻井风险至关重要,但受强地震衰减和散射影响,传统物理方法计算昂贵且对模型误差敏感,而深度学习又缺乏标注数据。 Method: 构建了一个基于物理模型的地震数据集SIGMA,涵盖多种地质场景和采集条件,包含像素级气烟囱掩码和成对的退化-真值图像,用于检测与增强任务。 Result: 实验表明SIGMA是一个具有挑战性的气烟囱解释基准,并有助于提升通用地震理解能力。 Conclusion: SIGMA数据集填补了地震领域气烟囱研究中高质量标注数据的空白,为结合物理建模与深度学习的方法提供了坚实基础。 Abstract: Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbf{SIGMA}, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.[182] 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
Yiping Chen,Jinpeng Li,Wenyu Ke,Yang Luo,Jie Ouyang,Zhongjie He,Li Liu,Hongchao Fan,Hao Wu
Main category: cs.CV
TL;DR: 本文提出了3DCity-LLM,一个面向3D城市尺度的多模态大语言模型框架,通过粗到细的三路特征编码策略(目标物体、物体间关系、全局场景)实现城市级视觉-语言感知与理解,并构建了包含120万高质量样本的3DCity-LLM-1.2M数据集及多维评估协议,在两个基准上显著超越现有方法。
Details
Motivation: 现有多模态大语言模型在物体级或室内场景表现优异,但在扩展至3D城市尺度时面临巨大挑战,亟需能支持空间推理与城市智能的统一框架。 Method: 提出3DCity-LLM统一框架,采用目标物体、物体间关系和全球场景三路并行的粗到细特征编码策略;构建大规模、高质量、融合显式3D数值信息与用户导向仿真的3DCity-LLM-1.2M数据集(约120万样本,覆盖7类任务);设计基于文本相似性与大语言模型语义评估的多维评测协议。 Result: 在两个城市级基准测试中,3DCity-LLM显著优于现有最先进方法,验证了其在空间推理与城市智能方面的有效性与先进性。 Conclusion: 3DCity-LLM为3D城市尺度的视觉-语言理解提供了可行且富有前景的技术路径,推动了空间智能与城市AI的发展。 Abstract: While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.[183] DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection
Gautam Rajendrakumar Gare,Neehar Peri,Matvei Popov,Shruti Jain,John Galeotti,Deva Ramanan
Main category: cs.CV
TL;DR: 本文提出Detection Prompt Optimization (DetPO),一种无需梯度的测试时文本提示优化方法,用于提升多模态大语言模型(MLLMs)在少样本目标检测任务中的泛化能力,尤其针对分布外类别、任务和成像模态。该方法通过最大化少量视觉示例上的检测准确率并校准预测置信度来优化纯文本提示,在Roboflow20-VL和LVIS数据集上显著优于现有黑盒方法。
Details
Motivation: 当前多模态大语言模型(MLLMs)虽在标准检测基准上表现良好,但在分布外类别、任务和成像模态上泛化能力差;且现有少样本提示(如上下文学习)效果常不如仅用类别名提示,表明模型尚不能有效利用视觉示例与文本描述;同时因API访问限制或开源模型微调成本过高,需探索黑盒提示优化方案。 Method: 提出Detection Prompt Optimization (DetPO),一种梯度无关的测试时优化方法:仅通过调整纯文本提示,在少量带标注视觉样本上最大化检测准确率,并同步校准模型预测置信度,无需访问模型参数或梯度。 Result: DetPO在多个通用MLLM上于Roboflow20-VL和LVIS数据集上实现一致性能提升,相比先前黑盒方法最高提升9.7%。 Conclusion: 纯文本提示可通过测试时黑盒优化显著增强MLLM在少样本目标检测中的泛化能力;DetPO为不具模型访问权限或计算资源受限场景提供了一种高效、实用的提示工程解决方案。 Abstract: Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO[184] RealMaster: Lifting Rendered Scenes into Photorealistic Video
Dana Cohen-Bar,Ido Sobol,Raphael Bensadoun,Shelly Sheynin,Oran Gafni,Or Patashnik,Daniel Cohen-Or,Amit Zohar
Main category: cs.CV
TL;DR: RealMaster是一种结合3D引擎与视频扩散模型的方法,通过锚帧传播和IC-LoRA微调,将渲染视频提升为几何一致、动态保真且高度逼真的视频。
Details
Motivation: 现有视频生成模型缺乏对场景的精确控制和3D一致性保证;而3D引擎虽具结构精度但视觉真实感不足,需弥合‘仿真到真实’的鸿沟。 Method: 提出RealMaster方法:1)基于锚帧(首尾帧)的几何引导传播策略构建配对数据集;2)在该数据上训练IC-LoRA模型,实现从渲染视频到光度真实视频的可控转换,并支持中间出现的对象/角色及无锚帧推理。 Result: 在复杂GTA-V序列上显著优于现有视频编辑基线,在提升光度真实感的同时,严格保持原始3D控制所指定的几何结构、运动动力学和身份特征。 Conclusion: RealMaster成功融合了3D引擎的可控性与扩散模型的真实感生成能力,为可控视频生成提供了兼顾结构精度与全局语义迁移的新范式。 Abstract: State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.[185] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
Duc Vu,Kien Nguyen,Trong-Tung Nguyen,Ngan Nguyen,Phong Nguyen,Khoi Nguyen,Cuong Pham,Anh Tran
Main category: cs.CV
TL;DR: 本文提出InverFill方法,通过在初始噪声中注入掩码图像的语义信息,实现高质量、单步反演的少步长图像修复,无需额外训练且推理开销小。
Details
Motivation: 现有扩散模型在图像修复中虽能生成逼真图像,但采样步数多、效率低;而少步长文生图模型直接用于修复时因随机高斯噪声初始化导致语义错位和融合差。 Method: 提出InverFill:一种面向修复任务的一步反演方法,将输入掩码图像的语义信息编码进初始噪声,并嵌入到少步长文生图模型的混合采样流程中,避免重训练或迭代优化。 Result: InverFill显著提升少步长基线模型的修复质量与文本一致性,在低NFE下媲美专用修复模型,且无需真实图像监督、推理开销极小。 Conclusion: InverFill为少步长扩散模型提供了高效、即插即用的修复能力,解决了语义对齐与采样效率之间的矛盾,推动了实用化图像修复的发展。 Abstract: Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.[186] UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation
Jiaying Lin,Dan Xu
Main category: cs.CV
TL;DR: UniFunc3D是一种无需训练的统一框架,利用多模态大语言模型作为主动观察者,通过粗到细的时空主动定位策略,在单次前向传播中联合进行语义、时序和空间推理,实现3D场景中基于自然语言指令的功能性分割,显著提升性能。
Details
Motivation: 现有方法依赖碎片化流程,在初始任务解析阶段存在视觉盲区,受限于单尺度、被动和启发式的帧选择策略。 Method: 提出UniFunc3D框架,将多模态大语言模型视为主动观察者,整合语义、时序与空间推理于单次前向传播中,并引入粗到细的主动时空定位策略,自适应选择关键视频帧并聚焦高细节交互区域,同时保留全局上下文以消歧。 Result: 在SceneFun3D数据集上达到SOTA性能,mIoU相对提升59.9%,显著超越各类训练型与无训练型方法,且无需任何任务特定训练。 Conclusion: UniFunc3D验证了无需训练、统一建模与主动感知在3D功能性分割任务中的有效性,为具身智能中的语言-视觉-动作对齐提供了新范式。 Abstract: Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.[187] TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation
Jini Yang,Eunbeen Hong,Soowon Son,Hyunkoo Lee,Sunghwan Hong,Sunok Kim,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出TETO框架,通过知识蒸馏利用少量真实世界事件相机数据(约25分钟)学习运动估计,无需大量合成数据,并将其用于提升视频扩散模型的帧插值质量。
Details
Motivation: 现有基于事件的运动估计方法依赖大规模合成数据,存在显著的仿真到现实差距;而真实世界标注数据稀缺且昂贵。 Method: 提出TETO(Tracking Events with Teacher Observation)教师-学生框架,利用预训练RGB追踪器作为教师,对未标注的真实事件数据进行知识蒸馏;设计运动感知的数据筛选与查询采样策略,分离物体运动与自运动;联合预测点轨迹和稠密光流,并作为显式运动先验引导预训练视频扩散Transformer进行帧插值。 Result: 在EVIMO2上实现SOTA点跟踪性能,在DSEC上达到SOTA光流估计性能,仅使用远少于以往方法的训练数据;在BS-ERGB和HQ-EVFI上验证了其帧插值质量显著优于现有方法。 Conclusion: 仅需极少量真实事件数据即可高效学习高质量运动表示,且该表示可有效迁移至下游视频生成任务,弥合了事件相机运动估计与实际应用之间的鸿沟。 Abstract: Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.[188] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
Adrien Ramanana Rahary,Nicolas Dufour,Patrick Perez,David Picard
Main category: cs.CV
TL;DR: OVIE is a monocular novel-view synthesis method trained solely on unpaired internet images, using a monocular depth estimator as a geometric scaffold during training but operating geometry-free at inference, achieving state-of-the-art zero-shot performance and 600x speedup over the second-best baseline.
Details
Motivation: Monocular novel-view synthesis has traditionally relied on multi-view image pairs for supervision, restricting data scale and diversity; the authors argue that only one view is sufficient. Method: OVIE uses a monocular depth estimator to lift a source image into 3D, applies sampled camera transformations, and projects to generate pseudo-target views; it introduces a masked training formulation to restrict losses to valid regions, enabling scalable training on 30 million uncurated images. Result: OVIE outperforms prior methods in zero-shot novel-view synthesis and is 600x faster than the second-best baseline, trained exclusively on in-the-wild images. Conclusion: Single-view supervision is sufficient for high-quality monocular novel-view synthesis; OVIE demonstrates that leveraging unpaired data with geometric scaffolding and masked losses enables scalable, efficient, and generalizable learning. Abstract: Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.[189] AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation
Woojeong Jin,Jaeho Lee,Heeseong Shin,Seungho Jang,Junhwan Heo,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出AgentRVOS,一种无需训练的代理式视频指代表分割方法,结合SAM3与MLLM优势:SAM3提供全时空掩码轨迹作为对象级证据,MLLM基于此进行查询驱动推理与迭代剪枝,显著提升时序覆盖与推理质量,在多个基准上达到训练自由方法SOTA。
Details
Motivation: 现有训练自由RVOS方法让MLLM在缺乏对象级证据前提下做时序决策,导致推理质量与时空覆盖受限。 Method: 提出AgentRVOS框架:先由SAM3生成全视频时空掩码轨迹(提供可靠感知与对象级证据),再由MLLM基于该轨迹进行查询驱动的迭代推理与剪枝(利用SAM3提供的目标存在性时序信息)。 Result: 在多个基准上达到训练自由RVOS方法的最先进性能(SOTA),且结果在不同MLLM主干网络上具有一致性。 Conclusion: 将感知(SAM3)与推理(MLLM)解耦并构建闭环交互,可有效克服纯MLLM时序决策的局限,为训练自由视频理解任务提供新范式。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.[190] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation
Brian Chao,Lior Yariv,Howard Xiao,Gordon Wetzstein
Main category: cs.CV
TL;DR: 本文提出了一种基于人眼注视点(foveation)的混合分辨率生成方法,利用人眼视觉在注视点处高分辨、周边低分辨的特性,非均匀分配token,显著减少计算量并保持感知质量。
Details
Motivation: 扩散模型和流匹配模型在内容生成中表现出色,但高分辨率、高帧率、长上下文等需求导致计算复杂度急剧上升(与token数呈平方关系),亟需提升生成效率。 Method: 提出基于注视点的混合分辨率token建模:设计foveated mask,使token密度随视觉敏感度变化;从高分辨率数据中构建混合分辨率token;对已有基础扩散模型进行foveated后训练,保证跨分辨率内容一致性。 Result: 在保持生成结果感知质量与全分辨率相当的前提下,大幅降低token数量和生成时间;通过用户研究和大量分析验证了foveation作为高效生成新维度的有效性与可扩展性。 Conclusion: foveation是一种实用且可扩展的高效生成优化路径,为交互式高保真图像/视频生成提供了新的计算-感知协同设计范式。 Abstract: Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.[191] VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Adrian Bulat,Alberto Baldrati,Ioannis Maniadis Metaxas,Yassine Ouali,Georgios Tzimiropoulos
Main category: cs.CV
TL;DR: VISOR is a novel method for improving the efficiency of Large Vision-Language Models (LVLMs) by sparsifying image-text token interactions instead of reducing visual tokens, enabling high-resolution reasoning without sacrificing performance.
Details
Motivation: Existing visual token reduction methods create an information bottleneck that harms performance on fine-grained visual understanding and reasoning tasks. Method: VISOR sparsifies image-text interaction by using efficient cross-attention for general context and dynamically selecting few self-attention layers to refine visual representations; it trains a universal network across computational budgets and adds a lightweight policy for per-sample visual computation allocation. Result: VISOR drastically reduces computational cost while matching or exceeding state-of-the-art performance across diverse benchmarks, especially excelling in challenging, detail-intensive visual tasks. Conclusion: Sparsifying image-text interaction—not discarding visual tokens—is a more effective paradigm for efficient and capable LVLMs. Abstract: Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.[192] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
Zhen Li,Zian Meng,Shuwei Shi,Wenshuo Peng,Yuwei Wu,Bo Zheng,Chuanhao Li,Kaipeng Zhang
Main category: cs.CV
TL;DR: 本文提出WildWorld数据集,一个大规模、动作条件化的世界建模数据集,源自AAA级游戏《怪物猎人:荒野》,包含超1亿帧视频、450+语义明确动作及同步的骨骼、状态、相机和深度标注;并构建WildBench评测基准,揭示当前模型在语义动作建模与长时程状态一致性上的瓶颈。
Details
Motivation: 现有视频世界建模数据集缺乏多样且语义清晰的动作空间,动作常与像素变化纠缠,难以学习结构化、长时程一致的世界动力学。 Method: 从真实游戏引擎中自动采集带显式状态标注的大规模动作-视频对数据(WildWorld),设计Action Following和State Alignment双任务评测基准(WildBench),系统评估世界模型性能。 Result: 实验证明现有模型在语义丰富动作建模和长时程状态一致性方面存在持续挑战,凸显状态感知视频生成的必要性。 Conclusion: WildWorld为动作驱动的世界建模提供了高质量基准,推动模型从像素级建模转向状态感知、语义驱动的动力学学习。 Abstract: Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.[193] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models
Jaewon Min,Jaeeun Lee,Yeji Choi,Paul Hyunbin Cho,Jin Hyeon Kim,Tae-Young Lee,Jongsik Ahn,Hwayeong Lee,Seonghyun Park,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出Degradation-Aware Optical Flow新任务,利用图像恢复扩散模型的退化感知中间表征,结合时空注意力与卷积特征,构建DA-Flow混合架构,在严重退化视频上实现零样本光流估计性能显著提升。
Details
Motivation: 现有光流模型在真实世界退化(如模糊、噪声、压缩伪影)下性能严重下降,亟需具备退化鲁棒性的新方法。 Method: 提出DA-Flow:将图像恢复扩散模型的中间特征通过全时空注意力扩展至帧间建模,并融合卷积特征于迭代优化框架中。 Result: DA-Flow在多种严重退化基准上大幅超越现有光流方法,展现出强零样本对应能力。 Conclusion: 扩散模型中间表征天然具备退化感知性,经时空注意力增强后可有效支撑鲁棒光流估计,为退化场景下的运动分析提供了新范式。 Abstract: Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.[194] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
Jie Liu,Zilyu Ye,Linxiao Yuan,Shenhan Zhu,Yu Gao,Jie Wu,Kunchang Li,Xionghui Wang,Xiaonan Nie,Weilin Huang,Wanli Ouyang
Main category: cs.CV
TL;DR: 本文提出了一种面向交错生成(text-image interleaved generation)的统一强化学习框架UniGRPO,将推理驱动的图像生成建模为稀疏奖励的马尔可夫决策过程,并联合优化文本推理与图像合成策略;通过简化FlowGRPO(去CFG、改KL为velocity MSE惩罚),实现可扩展、抗奖励作弊的多轮交错生成训练。
Details
Motivation: 现有交错生成模型多采用分离式训练范式,缺乏统一、可扩展的后训练框架来联合优化文本推理与图像生成;尤其在多轮交互与条件编辑等复杂场景下,传统方法(如CFG引导、KL正则)导致 rollout 复杂、易奖励作弊,难以扩展。 Method: 将单轮推理-图像生成建模为稀疏奖励MDP;提出UniGRPO框架,集成标准GRPO(用于推理文本策略)与改进版FlowGRPO(用于图像流匹配策略):① 去除classifier-free guidance以保证线性rollout;② 将latent KL正则替换为velocity field的MSE惩罚;整体采用最小主义设计,复用各模态成熟训练配方。 Result: 实验表明,该统一训练显著提升推理驱动图像生成的质量;所提改进使FlowGRPO更鲁棒、可扩展,支持多轮交错生成与编辑任务,为全交错模型的后训练提供了强基线。 Conclusion: 统一强化学习是推动交错生成模型发展的可行路径;UniGRPO及其对FlowGRPO的关键改进(去CFG、velocity MSE正则)兼顾简洁性、可扩展性与稳定性,为构建真正端到端、多轮交互的多模态生成系统奠定基础。 Abstract: Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.[195] OccAny: Generalized Unconstrained Urban 3D Occupancy
Anh-Quan Cao,Tuan-Hung Vu
Main category: cs.CV
TL;DR: 本文提出OccAny,首个无需领域标注和精确传感器先验的通用城市3D占据预测模型,支持跨域、未标定场景下的度量级占据预测与几何补全,并兼容序列、单目及环视图像输入。