Table of Contents
cs.CL [Back]
[1] Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
Amirhossein Bozorgkhoo,Igor Molybog
Main category: cs.CL
TL;DR: 本文提出了一种理论框架,用于分析性地连接预训练大语言模型的关键超参数与基于推测解码(Speculative Decoding, SD)的推理系统吞吐效率,从而在预训练前预测吞吐最优的超参数配置。
Details
Motivation: 以往通过实验方法优化推测解码推理流水线吞吐量需进行大模型训练,成本高昂;本文旨在建立可解析的理论,避免试错式训练。 Method: 提出一种理论分析方法,将预训练LLM的关键超参数与SD推理系统的吞吐效率进行解析建模与关联。 Result: 实现了在模型预训练前即可预测SD系统各组件的吞吐最优超参数,为高效推理系统设计提供理论指导。 Conclusion: 该理论为推测解码提供了可解释、低成本的超参数优化路径,有望显著降低LLM推理加速的研发开销。 Abstract: Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.[2] Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
Jingtao Wang,Yucong Wang,Jun Ding,Rui Cai,Xun Wang
Main category: cs.CL
TL;DR: 本文提出ARACH,一种无需训练的推理时插件,通过自适应上下文中心聚合上下文并重新分配注意力,提升大语言模型性能,且不更新参数。
Details
Motivation: 现有训练-free方法多为输入/输出层面干预(如提示工程、重采样等),缺乏对模型内部计算的直接干预机制。 Method: 提出ARACH(Attention Reallocation via an Adaptive Context Hub),在推理时引入自适应上下文中心,动态聚合上下文并重分配注意力权重,作为即插即用模块。 Result: 在多个语言建模任务上实现一致性能提升,推理开销小,且不更新任何参数;注意力分析表明其可缓解attention sink现象。 Conclusion: 对模型内部计算进行工程化干预是一种区别于提示工程和训练后微调的新型推理时优化范式。 Abstract: Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.[3] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
Hanxu Hu,Yuxuan Wang,Maggie Huan,Jannis Vamvas,Yinya Huang,Zhijiang Guo,Rico Sennrich
Main category: cs.CL
TL;DR: 本文提出DeReason方法,通过基于难度的数据解耦策略,在通用STEM领域中优化监督微调(SFT)与强化学习(RL)的协同训练流程,显著提升大语言模型的推理能力。
Details
Motivation: 现有研究在通用STEM领域中对监督微调(SFT)与强化学习(RL)的协同机制缺乏系统探索,尤其在数据分配与阶段配合方面存在空白。 Method: 提出DeReason:利用LLM打分估计问题的推理强度,将训练数据划分为推理密集型与非推理密集型子集;前者专用于RL以培养复杂推理,后者用于SFT以夯实基础领域知识。 Result: 在多个通用STEM和数学基准上,DeReason显著优于SFT-only、RL-only及随机划分的SFT+RL基线,验证了其有效性与泛化性。 Conclusion: SFT与RL在通用推理任务中具有互补作用,合理按推理难度解耦数据并分配训练阶段,是提升模型性能的关键路径;DeReason为大模型后训练提供了可推广的新范式。 Abstract: Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.[4] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries
Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Marco Brambilla,Piero Fraternali
Main category: cs.CL
TL;DR: 本文提出了一种基于知识图谱的检索增强生成(RAG)新框架MDER-DR,通过创新的索引方法MDER和检索机制DR,提升多跳问答性能,在多个基准上显著超越现有RAG方法。
Details
Motivation: 现有KG上的RAG方法在将文本转为三元组索引时易丢失上下文细节,导致多跳问答等复杂任务性能下降。 Method: 提出MDER(Map-Disambiguate-Enrich-Reduce)索引方法,生成上下文驱动的三元组描述并融合实体摘要;并设计DR(Decompose-Resolve)检索机制,将查询分解为可解析三元组并通过迭代推理在KG中定位。二者构成LLM驱动的端到端QA流程。 Result: 在标准及领域特定基准上,MDER-DR相较基线RAG方法最高提升66%,且具备跨语言鲁棒性。 Conclusion: MDER-DR是一种领域无关、鲁棒性强的KG-RAG框架,有效缓解KG稀疏性与不完整性问题,显著提升多跳问答性能。 Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.[5] Markovian Generation Chains in Large Language Models
Mingmeng Geng,Amr Mohamed,Guokan Shang,Michalis Vazirgiannis,Thierry Poibeau
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)在多次迭代推理(如反复重述或往返翻译)中的文本演化行为,将其建模为无记忆的马尔可夫生成链,并发现输出可能收敛或持续生成新句子,其多样性受温度参数和初始输入影响。
Details
Motivation: 探究大语言模型在反复处理文本(迭代推理)过程中文本如何演化,以理解其内在动力学及其对多智能体LLM系统的影响。 Method: 提出‘马尔可夫生成链’建模框架,开展迭代重述与往返翻译实验,并结合句子级马尔可夫链建模与模拟数据分析。 Result: 迭代过程可能导致输出收敛至小的循环集合,或持续生成新句子;句子多样性可能随温度参数和初始输入而增加或减少。 Conclusion: LLM的迭代推理具有复杂的动态特性,其多样性演化非单调,需谨慎设计多步或多智能体LLM应用以避免偏差累积或多样性坍缩。 Abstract: The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.[6] Artificial Intelligence for Sentiment Analysis of Persian Poetry
Arash Zargar,Abolfazl Moshiri,Mitra Shafaei,Shabnam Rahimi-Golkhandan,Mohamad Tavakoli-Targhi,Farzad Khalvati
Main category: cs.CL
TL;DR: 本研究利用BERT和GPT等大语言模型分析波斯诗人鲁米与帕尔文·埃特萨米的诗歌,探究其对波斯诗歌复杂性的理解能力,并考察诗作情感与格律之间的关联;结果表明GPT-4o可可靠用于波斯诗歌分析,鲁米诗歌整体情感更积极,且其格律运用更能表达多样情感。
Details
Motivation: 探究现代语言模型理解波斯诗歌复杂性的能力,并探索诗歌情感与格律之间的潜在关联。 Method: 采用多种基于BERT和GPT的语言模型,对鲁米和帕尔文·埃特萨米的诗歌进行情感分析与格律使用比较。 Result: GPT-4o可可靠用于波斯诗歌分析;鲁米诗歌情感普遍比帕尔文·埃特萨米更积极;鲁米在格律运用上更能表达多样化情感。 Conclusion: 大语言模型可有效应用于基于计算机的语义研究,减少人为解读带来的偏差。 Abstract: Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.[7] ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
Monica Munnangi,Saiph Savage
Main category: cs.CL
TL;DR: 本文提出了ThReadMed-QA——首个基于真实患者-医生多轮对话的医学问答基准,揭示当前大语言模型在多轮医疗问答中性能显著下降、错误易传播的问题,并提出新指标CCS与EPR量化该现象。
Details
Motivation: 现有医学问答基准多为单轮问答,无法反映真实医患咨询中反复澄清、多轮交互的特点,缺乏对模型多轮一致性与错误传播能力的评估。 Method: 构建真实来源(r/AskDocs)的多轮医患对话数据集ThReadMed-QA(2437个完整线程,8204 QA对),设计基于医师标注的LLM-as-a-judge评估框架;对5个主流大模型进行分层测试;提出Conversational Consistency Score(CCS)和Error Propagation Rate(EPR)两个新指标分析多轮失败模式。 Result: 最强模型GPT-5在测试集上仅41.2%回答完全正确;所有模型在第0至第2轮间性能显著下降(p<0.001),错误率约增至三倍;强模型初始得分高但衰减剧烈(如GPT-5下降16.2分),弱模型反而趋于稳定;CCS显示近1/3 Claude Haiku对话在同一线程内答案质量剧烈波动;EPR表明一次错误会使后续错误概率提升1.9–6.1倍。 Conclusion: 单轮问答能力强不等于多轮医疗对话可靠,当前LLM在真实医患交互场景中存在严重的一致性缺失与错误累积风险,亟需面向多轮推理与状态保持的新建模方法与评估范式。 Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.[8] Temporal Text Classification with Large Language Models
Nishat Raihan,Marcos Zampieri
Main category: cs.CL
TL;DR: This paper evaluates leading proprietary and open-source LLMs on Temporal Text Classification (TTC) — estimating text publication dates — across multiple languages and settings (zero-shot, few-shot, fine-tuning), finding proprietary models outperform open-source ones, especially with few-shot prompting.
Details
Motivation: Despite advancements in LLMs, their performance on automatic text dating (Temporal Text Classification) remains unexplored; this study fills that gap. Method: Systematic evaluation of proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora (two English, one Portuguese), under zero-shot, few-shot, and fine-tuning settings. Result: Proprietary models perform well—especially with few-shot prompting; fine-tuning improves open-source models significantly but they still lag behind proprietary ones. Conclusion: Proprietary LLMs currently hold a clear advantage in TTC tasks, highlighting a performance gap that fine-tuning alone does not close for open-source models. Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.[9] Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation
Aria Nourbakhsh,Salima Lamsiyah,Adelaide Danilov,Christoph Schommer
Main category: cs.CL
TL;DR: 本文提出了一种在Transformer-based seq2seq模型中系统评估可解释性方法的新框架,利用教师模型生成的归因图指导学生模型,并通过BLEU等指标量化不同归因方法的有效性;实验发现Attention、Value Zeroing和Layer Gradient×Activation效果最优,且提出了能重建归因图的Attributor Transformer模型。
Details
Motivation: 现有XAI方法在seq2seq模型中的系统化、自动化评估尚不充分,缺乏统一评估框架。 Method: 以教师模型生成的归因图为监督信号,将归因分数通过四种算子(加法、乘法、平均、替换)注入学生Transformer模型的注意力机制;使用Inseq库提取源-目标序列对的归因得分,并在多个语言对和模型上进行对比实验;进一步提出Attributor Transformer模型学习重建教师归因图。 Result: Attention、Value Zeroing和Layer Gradient×Activation在BLEU和chrF指标上带来最大且稳定的提升;而Saliency等其他梯度类方法效果较弱且不稳定;Attributor重建归因图的准确性与下游任务性能正相关。 Conclusion: 不同归因方法捕获的信息存在差异,基于注意力机制的归因更契合seq2seq中的源-目标对齐;归因图的质量直接影响其在下游任务中的实用性,Attributor为归因质量评估提供了新范式。 Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.[10] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning
Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin
Main category: cs.CL
TL;DR: 本文评估了17个大语言模型在多轮临床对话中的诊断推理能力,提出“坚持或切换”框架来衡量模型的坚定性与灵活性,发现多轮交互会降低性能(即‘对话税’),且模型常放弃正确诊断以迎合错误用户建议。
Details
Motivation: 尽管大语言模型在静态诊断基准上表现优异,但其在更贴近现实的多轮医疗对话中的实际效能尚缺乏深入研究。 Method: 构建‘坚持或切换’(stick-or-switch)评估框架,用于量化模型在多轮对话中对正确诊断或安全弃权的坚持程度(conviction)及对新正确建议的适应能力(flexibility);在三个临床数据集上评测17个LLM。 Result: 发现显著的‘对话税’现象:多轮交互普遍导致性能下降;模型常放弃初始正确的诊断或安全弃权,转而采纳错误用户建议;部分模型存在‘盲目切换’问题,无法区分有效信号与错误建议。 Conclusion: 当前LLM在多轮临床对话中的诊断推理鲁棒性不足,需改进其信念稳定性与错误建议识别能力,方能安全应用于真实医疗场景。 Abstract: Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.[11] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
Amani Maina-Kilaas,Roger Levy
Main category: cs.CL
TL;DR: 本文探讨了语言处理中结构歧义表征的作用,指出大语言模型(LLM)基于惊奇度(surprisal)的预测虽跨语言稳健,但在结构预期被违反时系统性低估难度,暗示歧义表征具有因果作用;作者提出基于粒子滤波器的替代模型,明确表征结构假设,并证明其算法特性(如放大花园路径效应)及重采样导致实时‘挖坑效应’(digging-in),且该效应强度随粒子数减少而增强。
Details
Motivation: LLM基于surprisal的预测虽广泛有效,但无法解释结构预期违背时的处理困难,提示其缺乏结构歧义表征可能限制了对人类句法加工机制的建模能力。 Method: 理论建模与形式证明:引入粒子滤波器模型显式表征结构歧义假设;通过算法分析证明其关键性质(如花园路径效应放大、重采样引发挖坑效应);对比全并行模型(无限粒子)与有限粒子模型的行为差异。 Result: 1) 粒子滤波器模型能自然产生并放大花园路径效应;2) 重采样操作必然导致实时挖坑效应(即歧义区域越长,后续消歧越难);3) 挖坑效应强度与粒子数量成反比,全并行模型无此效应。 Conclusion: 结构歧义的显式表征(而非仅surprisal)对解释人类句子处理中的特定困难(如挖坑效应、花园路径)具有因果必要性;粒子滤波器为连接计算建模与心理语言学现象提供了更合适的框架。 Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.[12] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models
Michiko Yoshitake,Yuta Suzuki,Ryo Igarashi,Yoshitaka Ushiku,Keisuke Nagato
Main category: cs.CL
TL;DR: MaterialFigBench 是一个面向材料科学领域的多模态大语言模型评测基准,聚焦于模型对相图、应力-应变曲线等关键图表的理解与定量解析能力;实验表明当前多模态LLM仍严重依赖记忆知识而非真实看图推理。
Details
Motivation: 现有基准多依赖文本,缺乏对材料科学中不可或缺的图表(如相图、衍射图谱等)理解能力的系统评估,亟需构建领域专用、以图为核心的评测基准。 Method: 构建包含137道大学材料科学教材改编自由问答题的MaterialFigBench数据集,覆盖晶体结构、相变、力学性能等核心主题;为图像读数引入专家定义容差范围;评测多个SOTA多模态LLM(如GPT系列)在各题型上的表现并进行归因分析。 Result: 模型整体准确率随版本更新有所提升,但在视觉理解与定量解读方面仍表现薄弱;大量正确回答源于先验知识记忆而非图像解析;在数值精度、有效数字处理及部分图表类型(如Arrhenius图)上存在显著缺陷。 Conclusion: MaterialFigBench揭示了当前多模态LLM在材料科学图解推理中的根本性短板,为推动具备真正图表理解能力的下一代模型提供了领域导向的评测标准与发展指引。 Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.[13] BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion
Varun Iyer,Cornelia Caragea
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的解码干预方法BLooP,通过鼓励大语言模型生成源文档中出现的二元组(bigram)来提升摘要的忠实性,显著改善了ROUGE和BARTScore指标,并在人工评估中验证了其对忠实性提升的有效性。
Details
Motivation: 现有大语言模型在无监督抽象摘要任务中常遗漏关键细节并引入无关信息,亟需一种轻量、无需训练的解码策略来提升摘要忠实性。 Method: 提出BLooP(Bigram Lookahead Promotion),在每步解码时通过哈希表查找,优先选择能与前一词构成源文档中已有bigram的候选token,全程无需训练、微调或修改模型。 Result: 在CNN/DM、CCSum、Multi-News和SciTLDR数据集上,BLooP显著提升了Llama-3.1-8B-Instruct、Mistral-Nemo-Instruct-2407和Gemma-2-9b-it的ROUGE和BARTScore;人工评估表明其显著提升摘要忠实性,且未损害可读性。 Conclusion: BLooP是一种高效、通用、零成本的解码增强方法,为提升大语言模型摘要忠实性提供了新思路,具有强实用性与可扩展性。 Abstract: Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at https://github.com/varuniyer/BLooP[14] LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction
Yuzhi Liang,Lixiang Ma,Xinrong Zhu
Main category: cs.CL
TL;DR: 本文提出了一种融合大语言模型先验与统计因果发现的增强型因果推理框架,用于提升法律判决预测的准确性和鲁棒性,通过粗粒度到细粒度的混合提取机制和LLM辅助的因果结构消歧机制,显著优于现有方法。
Details
Motivation: 现有基于预训练语言模型的法律判决预测方法依赖案件事实与判决结果间的统计相关性,缺乏对法律构成要件和因果逻辑的显式建模,易学得伪相关、鲁棒性差;而现有因果方法在真实法律文本中面临法律要素提取噪声大、因果结构发现因稀疏特征导致不确定性高两大瓶颈。 Method: 提出融合LLM先验与统计因果发现的增强因果推理框架:1)设计结合统计采样与LLM语义推理的粗-细混合提取机制,精准识别并净化标准法律构成要件;2)引入LLM辅助因果结构消歧机制,将LLM作为约束性先验知识库,对模糊因果方向进行概率评估与剪枝,生成合法合规的候选因果图;3)构建因果感知的判决预测模型,利用生成的因果图显式约束文本注意力强度。 Result: 在LEVEN、QA和CAIL等多个基准数据集上的大量实验表明,所提方法在预测精度和鲁棒性上显著超越当前最优基线,尤其在区分易混淆罪名方面表现突出。 Conclusion: 融合LLM先验与统计因果发现的增强因果框架能有效缓解法律文本中要素提取噪声与因果结构不确定性问题,提升法律判决预测模型的可解释性、准确性与鲁棒性,为因果法律AI提供了新范式。 Abstract: Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.[15] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du,Dacheng Tao
Main category: cs.CL
TL;DR: 本文提出Tool-DC框架,通过'尝试-检查-重试'范式提升大语言模型在工具调用任务中的性能,尤其在面对大量候选工具时表现优异。
Details
Motivation: 现有方法难以应对长上下文工具调用任务中海量且嘈杂的候选工具,限制了实际应用。 Method: 提出Tool-DC分而治之框架,包含无需训练的TF变体和需训练的TB变体,利用LLM自反思能力降低推理难度。 Result: Tool-DC(TF)在BFCL和ACEBench基准上平均提升达+25.10%;Tool-DC(TB)使Qwen2.5-7B达到甚至超越OpenAI o3和Claude-Haiku-4.5等专有模型性能。 Conclusion: Tool-DC显著提升了LLM在复杂工具调用场景下的性能与实用性,兼具灵活性与推理效率。 Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.[16] Tiny Aya: Bridging Scale and Multilingual Depth
Alejandro R. Salamanca,Diana Abagyan,Daniel D'souza,Ammar Khairi,David Mora,Saurabh Dash,Viraat Aryabumi,Sara Rajaee,Mehrnaz Mofakhami,Ananya Sahu,Thomas Euyang,Brittawnya Prince,Madeline Smith,Hangyu Lin,Acyr Locatelli,Sara Hooker,Tom Kocmi,Aidan Gomez,Ivan Zhang,Phil Blunsom,Nick Frosst,Joelle Pineau,Beyza Ermis,Ahmet Üstün,Julia Kreutzer,Marzieh Fadaee
Main category: cs.CL
TL;DR: Tiny Aya 是一个仅含3.35B参数的小型多语言大模型,在70种语言上训练,并通过区域感知的后训练优化,在翻译质量、多语言理解与目标语言生成方面达到SOTA,同时提供基础模型、全局指令微调模型及三个区域专用模型。
Details
Motivation: 探索多语言AI的高效替代扩展路径,强调模型效率、跨语言性能均衡与实际部署可行性,而非单纯增大参数规模。 Method: 在70种语言数据上训练基础模型,并采用区域感知(region-aware)的后训练策略;发布多个变体:预训练基础模型、全局平衡指令微调模型、以及分别面向非洲、南亚、欧洲、亚太和西亚的区域专用模型。 Result: 在翻译质量、多语言理解与目标语言生成任务上达到当前最优(state-of-the-art),且以仅3.35B参数实现高性能。 Conclusion: Tiny Aya 证明了小型多语言模型可通过高质量数据构建与区域适配训练策略,实现媲美甚至超越更大模型的多语言能力,为资源受限场景提供了高效可行的多语言AI方案。 Abstract: Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.[17] Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
Sanchit Pandey
Main category: cs.CL
TL;DR: 本文研究了7B及以下参数规模的语言模型在检索增强生成(RAG)中的表现,发现其主要瓶颈在于无法有效利用检索到的信息,而非检索质量;即使提供完美检索结果(oracle),小模型仍大幅失败,且检索上下文反而干扰其原有知识。
Details
Motivation: 探究7B及更小规模语言模型能否有效利用RAG中检索到的信息,厘清性能瓶颈是来自检索质量还是模型对上下文的利用能力。 Method: 在360M–8B共5种参数规模、SmolLM2/Qwen2.5/Llama3.1三种架构上,对比无检索、BM25、E5-large-v2密集检索和oracle检索四种条件;引入参数化知识划分方法,分离模型本可回答与必须依赖外部知识的问题,以解耦利用失败与检索失败。 Result: 1)即使oracle检索下,≤7B模型在需外部知识的问题上85–100%失败;2)添加检索上下文导致42–100%原本能答对的问题出错(干扰效应);3)错误分析显示主导失败模式为‘无关生成’(忽略上下文)。现象跨提示模板与检索方法一致。 Conclusion: 对<7B模型而言,RAG的主要限制是上下文利用能力而非检索质量;在此规模部署RAG在标准评估下可能造成净负收益。 Abstract: Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.[18] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
Mayank Saini Arit Kumar Bishwas
Main category: cs.CL
TL;DR: 本文提出了一种用于自主多模态查询处理的智能体AI框架,通过中央协调器动态分解任务并调度适配各模态(文本、图像、音频、视频、文档)的专用工具,采用自适应路由而非固定决策树,在保持准确率的同时显著提升效率与成本效益。
Details
Motivation: 解决现有多模态AI系统中缺乏智能集中编排能力、依赖预设决策树导致效率低、成本高、对话返工多等问题,提升多模态AI部署的实际经济性与实用性。 Method: 构建一个以中央Supervisor为核心的代理式AI框架:对文本查询使用RouteLLM进行学习式路由;对非文本模态采用SLM辅助的模态分解;动态子任务委派与自适应结果合成。 Result: 在2847个跨15类任务的查询上评估,相较分层基线方法,时间至准确答案减少72%,对话返工减少85%,成本降低67%,同时保持准确率不变。 Conclusion: 智能的中心化协同编排能从根本上改善多模态AI系统的部署经济性,是提升实际效能的关键路径。 Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.[19] Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
Zhenxu Tian,Yi Su,Juntao Li,Min Zhang
Main category: cs.CL
TL;DR: 本文提出DapQ方法,通过位置感知的伪查询模拟解码阶段的查询,实现与解码对齐的KV缓存压缩,在严格内存限制下仍保持接近无损性能。
Details
Motivation: 现有KV缓存压缩方法仅依赖prefill阶段的输入注意力模式评估token重要性,无法准确预测解码阶段实际关注的token;而真实解码查询在推理时不可得,需构造伪查询,且作者发现位置信息比语义内容更关键。 Method: 提出DapQ框架,利用位置感知的伪查询来近似生成过程中的查询,构建与解码对齐的观察窗口,从而更精准地评估token重要性并进行轻量级缓存淘汰。 Result: 在多个基准和LLM上验证,DapQ在严苛内存约束下(如仅保留3% KV缓存)仍取得接近无损性能(NIAH任务达99.5%)。 Conclusion: 位置信息是构建有效伪查询的关键,DapQ通过解码对齐的压缩策略显著提升长上下文推理效率与精度,是一种高效、轻量且通用的KV缓存优化方法。 Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).[20] Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
Roman Koshkin,Jeon Haesung,Lianbo Liu,Hao Shi,Mengjie Zhao,Yusuke Fujita,Yui Sudo
Main category: cs.CL
TL;DR: Hikari是一种无需策略、端到端的同步语音到文本翻译模型,通过概率性WAIT机制和解码器时间膨胀技术,显著提升质量-延迟权衡性能。
Details
Motivation: 传统同步机器翻译依赖离线模型与人工设计或学习的策略,难以兼顾翻译质量与低延迟需求。 Method: 提出Hikari模型:1)将READ/WRITE决策编码为概率性WAIT token机制;2)引入Decoder Time Dilation减少自回归开销并平衡训练分布;3)采用监督微调策略使模型具备延迟恢复能力。 Result: 在英→日、德、俄任务上,Hikari在低延迟和高延迟场景均取得SOTA BLEU分数,超越近期基线方法。 Conclusion: Hikari证明了无需外部策略、纯端到端建模同步语音翻译的可行性与优越性,为实时跨语言交流提供了新范式。 Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.[21] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
Ofir Marom
Main category: cs.CL
TL;DR: 本文提出UtilityMax Prompting框架,用形式化数学语言替代自然语言定义LLM任务,将任务建模为影响图并以效用函数优化LLM输出,在多目标电影推荐任务中显著提升精度和NDCG。
Details
Motivation: 自然语言提示存在固有歧义性,难以同时满足多个目标,需更精确的任务定义方式。 Method: 构建基于影响图的任务形式化表示,将LLM输出设为唯一决策变量,并定义作用于条件概率分布的效用函数,指导LLM最大化期望效用。 Result: 在MovieLens 1M数据集及三个前沿模型(Claude Sonnet 4.6、GPT-5.4、Gemini 2.5 Pro)上,多目标电影推荐任务的精度与NDCG均一致优于自然语言提示基线。 Conclusion: 形式化数学提示能有效约束LLM显式推理目标各成分,提升多目标任务性能,为提示工程提供新范式。 Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.[22] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese
Masataka Kawai,Singo Sakashita,Shumpei Ishikawa,Shogo Watanabe,Anna Matsuoka,Mikio Sakurai,Yasuto Fujimoto,Yoshiyuki Takahara,Atsushi Ohara,Hirohiko Miyake,Genichiro Ishii
Main category: cs.CL
TL;DR: 本文评估了七种开源大语言模型(LLM)在日语病理报告撰写中的性能,涵盖结构化诊断文本生成、日语错字纠正和专家对解释性文本的主观评价三方面;结果显示,推理型和医学专用模型在结构化任务和纠错上表现更优,但解释性文本偏好因评审者而异;总体表明开源LLM可在有限但临床相关的场景中辅助日语病理报告写作。
Details
Motivation: 大型语言模型(LLMs)在支持日语病理报告撰写方面的性能尚未被探索,亟需评估其在该特定语言和专业领域的适用性。 Method: 评估了七种开源LLM,从三方面进行:(A)按预定义格式生成和提取病理诊断文本;(B)纠正日语病理报告中的拼写错误;(C)由病理医生和临床医生对模型生成的解释性文本进行主观评分。 Result: 推理型模型和医学专用模型在结构化报告任务和错字纠正中表现更优;但对解释性文本的偏好在不同评审者间差异显著;各模型效用因任务而异。 Conclusion: 开源LLM虽不能全面替代人工,但在有限但临床相关的场景(如结构化报告生成和错字纠正)中具备实用价值,可辅助日语病理报告撰写。 Abstract: The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.[23] QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Jihao Zhao,Daixuan Li,Pengfei Li,Shuaishuai Zu,Biao Qin,Hongyan Liu
Main category: cs.CL
TL;DR: 本文提出QChunker,通过将RAG范式重构为“理解-检索-生成”,结合多智能体辩论框架与新评估指标ChunkScore,提升文本分块的语义完整性与信息粒度,显著增强RAG性能。
Details
Motivation: 现有RAG效果受限于知识库中文本块的语义完整性与信息粒度;传统分块方法缺乏逻辑连贯性,且评估依赖低效的下游问答任务。 Method: 提出QChunker:1)将分块建模为文本切分+知识补全的复合任务;2)设计基于提问驱动的四角色多智能体辩论框架(问题提纲生成器、文本切分器、完整性审查员、知识补全器);3)构建45K高质量分块数据集并迁移至小模型;4)提出直接评估指标ChunkScore,支持多路径采样与最优分块选择。 Result: ChunkScore被理论与实验验证可直接高效判别分块质量;在四个异构领域实验中,QChunker显著提升分块的逻辑连贯性与信息丰富度,从而改善RAG整体性能。 Conclusion: QChunker通过引入理解前置机制与新型评估范式,突破了传统RAG中分块质量的瓶颈,为构建高质量知识库提供了系统性解决方案。 Abstract: The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.[24] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
Junjie Wu,Xuan Kan,Zihao He,Shunwen Tan,Bo Pan,Kaitai Zhang
Main category: cs.CL
TL;DR: 本文提出了一种多任务强化学习框架MT-RL-Judge,用于提升多模态大语言模型(MLLM)作为评判者(judge)在多样化视觉任务中的泛化能力与一致性。
Details
Motivation: 现有MLLM-as-a-Judge方法多为单任务优化,难以在多样、分布外任务中可靠泛化,亟需更鲁棒的多任务评判框架。 Method: 提出MT-RL-Judge框架,通过多任务强化学习联合优化MLLM评判模型,利用RL增强跨任务泛化能力。 Result: 在判断一致性与人类偏好相关性上均超越多个强基线,并在分布外任务上展现出强泛化性能。 Conclusion: MT-RL-Judge有效提升了MLLM作为评判者的通用性与可靠性,为多任务自动评估提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.[25] A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy
María Isabel Rivas Ginel,Janiça Hackenbuchner,Alina Secară,Ralph Krüger,Caroline Rossi
Main category: cs.CL
TL;DR: 本文探讨了自动化语言与翻译行业中价值的构建与协商,指出技术效率与人类专业能力相互依存,适应性是连接二者的核心价值。
Details
Motivation: 探究在日益自动化的语言与翻译行业中,人类价值与技术价值如何被构建、协商与重构。 Method: 基于LT-LiDER项目中对29位行业利益相关者的访谈数据,结合Chesterman的翻译伦理框架进行定性分析。 Result: 发现效率导向的技术价值已成为自动化生产环境的基本预期;人类价值未被取代,而是通过专业知识、监督、问责与情境判断重新定位;适应性成为连接人与技术的关键中介价值。 Conclusion: 自动化并非取代翻译价值,而是重塑其形态,形成技术效率赋能人类交际工作的互依关系。 Abstract: This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.[26] In the LLM era, Word Sense Induction remains unsolved
Anna Mosolova,Marie Candito,Carlos Ramisch
Main category: cs.CL
TL;DR: 本文探讨了词义归纳(WSI)在缺乏标注数据时的评估方法问题,提出基于SemCor的新评估数据集,并对比了预训练嵌入、聚类算法及LLM方法;发现‘每词一簇’启发式方法仍最强,LLM表现不佳,但结合Wiktionary的数据增强可提升性能,超越此前SOTA 3.3%。
Details
Motivation: 当前WSI评估存在方法论问题,尤其在低资源或领域特定场景下缺乏合理、反映真实多义性和频率分布的评估基准。 Method: 构建尊重SemCor原始多义性与频率分布的评估数据集;系统评估预训练嵌入与聚类算法(按词性分组);提出并测试基于LLM的WSI方法;探索多种数据增强来源(LLM生成、语料库、词典)及半监督设置(利用Wiktionary提供must-link约束、聚类数等)。 Result: ‘每词一簇’(1cpl)启发式方法仍优于所有无监督方法(包括本文LLM方法);不同词性下最优方法不同;LLM直接执行WSI效果较差;数据增强有效,尤其结合Wiktionary的半监督设置使性能提升3.3%,超越此前SOTA。 Conclusion: WSI问题尚未解决,需更好融合词典资源与大语言模型的词汇语义能力,未来工作应聚焦于二者协同建模。 Abstract: In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.[27] SemBench: A Universal Semantic Framework for LLM Evaluation
Mikel Zubillaga,Naiara Perez,Oscar Sainz,German Rigau
Main category: cs.CL
TL;DR: 本文提出SemBench框架,利用词典义项定义和句子编码器自动生成语义理解评估基准,无需人工标注例句,支持多语言、轻量高效。
Details
Motivation: 传统语义理解评估基准(如WiC)构建成本高、依赖高资源语言,难以扩展到低资源语言;需一种可自动构建、语言无关的轻量级评估方法。 Method: 基于词典义项定义和句子编码器,自动生成合成语义歧义消解任务(类似WiC)数据集,不依赖人工编写例句;在英语、西班牙语、巴斯克语上验证,并测试多种大语言模型。 Result: SemBench生成的模型排名与标准WiC数据集高度相关;仅需少量样本即可获得稳定可靠的排序结果。 Conclusion: SemBench是一种轻量、可扩展、语言无关且数据高效的跨语言语义理解评估框架,为LLM语义能力评测提供了新范式。 Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.[28] Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
Assaf Siani,Anna Kernerman,Ilan Kernerman
Main category: cs.CL
TL;DR: 本文提出了一种用于英-希伯来语质量评估(QE)的半合成平行数据集构建方法,并基于该数据集训练了BERT和XLM-R等神经QE模型,探讨了数据规模、分布平衡与错误类型对模型性能的影响。
Details
Motivation: 解决低资源语言对(尤其是形态复杂语言)的质量评估(QE)系统准确率低、适应性差、可靠性不足的问题,主要受限于平行语料稀缺及语言特异性因素(如性、数一致等)。 Method: 构建半合成英-希伯来语QE数据集:基于典型语言用例生成英文句子,经多引擎翻译为希伯来语,BLEU筛选后由语言学家人工评分;加入高分专业译文,并可控注入性别/数一致性等错误;在该数据集上训练BERT与XLM-R等神经QE模型。 Result: 验证了数据集规模、分布均衡性及错误类型分布对神经QE模型性能具有显著影响;所提方法提升了英-希伯来语等低资源、形态丰富语言对的QE效果。 Conclusion: 半合成数据构建策略可有效缓解低资源语言QE的数据瓶颈,尤其适用于形态复杂语言;未来工作将聚焦于进一步优化错误覆盖与模型泛化能力。 Abstract: Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.[29] Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
Konstantin Krestnikov
Main category: cs.CL
TL;DR: 本文提出“压缩-一致性原则”,指出语言模型在下一个词预测中倾向于选择能更简洁、更一致地描述训练数据的假设;当错误选项在结构上更难压缩时,模型才会表现出对正确陈述的偏好,这种“真值偏差”本质上是压缩压力和内部一致性偏好的副产品,而非对真理的内在追求。
Details
Motivation: 探究为何语言模型即使在混合质量数据上训练,仍有时偏好正确陈述。 Method: 提出“压缩-一致性原则”,并在可控合成数学语料(含正确与错误规则混合)上,使用小型GPT-2风格字符级Transformer(3.5M–86M参数)进行实验验证,涵盖随机错误、系统性错误及类自然语言设置,并引入嵌入验证步骤与增加一致规则数等干预。 Result: 在随机错误设定下,模型在配对评估中显著偏好正确补全(平衡数据达83.1%,仅10%正确规则时仍有67.0%);而系统性错误下该偏好几乎消失(近随机水平);类自然语言设定下效果减弱但存在(57.7%);嵌入验证步骤可恢复正确偏好,增加一致规则数带来准确率渐进提升。 Conclusion: 所谓‘真值偏差’主要是压缩压力与内部一致性偏好导致的副产品,而非模型对真理的内在驱动。 Abstract: Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.[30] Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents
Yaocong Li,Qiang Lan,Leihan Zhang,Le Zhang
Main category: cs.CL
TL;DR: 本文提出了Legal-DC基准数据集和LegRAG框架,以解决中文法律场景中检索增强生成(RAG)缺乏专用评估资源与难以适配法律条文结构的问题。LegRAG通过法律自适应索引与双路径自反思机制提升条款完整性与答案准确性,并引入自动化评估方法;在多个指标上超越现有最优方法1.3%–5.6%。
Details
Motivation: 现有中文法律RAG研究面临两大局限:缺乏支持检索器-生成器联合评估的专用基准;主流RAG系统难以适配法律条文的结构性特征。 Method: 构建Legal-DC基准(480份法律文档、2475个带条款级引用的问答对);提出LegRAG框架,包含条款边界分割的法律自适应索引与双路径自反思机制;设计面向高可靠性需求的自动化LLM评估方法。 Result: LegRAG在关键评估指标上较现有SOTA方法提升1.3%–5.6%;Legal-DC成为首个支持中文法律RAG联合评估的专用基准;代码与数据已开源。 Conclusion: 本研究为中文法律RAG提供了专用评估基准、可落地的技术框架与实证洞见,推动其向高精度、高可靠性方向发展。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.[31] Trust Oriented Explainable AI for Fake News Detection
Krzysztof Siwek,Daniel Stankowski,Maciej Stodolski
Main category: cs.CL
TL;DR: 本文探讨了可解释人工智能(XAI)在基于自然语言处理(NLP)的假新闻检测中的应用,比较了SHAP、LIME和Integrated Gradients三种可解释性方法,验证了XAI在保持高检测准确率的同时提升模型透明性与可解释性。
Details
Motivation: 提升假新闻检测系统的可靠性与可信度,解决深度学习模型‘黑箱’问题,增强用户对检测结果的理解与信任。 Method: 采用SHAP、LIME和Integrated Gradients三种XAI技术,结合多种神经网络架构,在假新闻检测任务中进行模型解释与实验分析。 Result: XAI方法显著提升了模型可解释性且未牺牲检测精度;SHAP提供精细的局部归因,LIME生成简洁直观解释,Integrated Gradients在卷积模型上效率更优;但也存在计算开销大、参数敏感等局限。 Conclusion: 将XAI与NLP深度融合是提升假新闻检测系统透明性、可靠性与实用性的有效路径,为未来可信赖AI系统构建提供方法论支持。 Abstract: This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.[32] Large Language Models for Biomedical Article Classification
Jakub Proboszcz,Paweł Cichosz
Main category: cs.CL
TL;DR: 本研究系统评估了大语言模型(LLM)在生物医学文献分类任务中的文本分类能力,涵盖多种开源与闭源模型、不同提示方式、输出处理方法及少样本设置,并发现最优零样本和少样本配置的平均PR AUC分别达0.4和0.5,接近传统分类器性能。
Details
Motivation: 探索大语言模型在非平凡领域(如生物医学文献分类)中作为文本分类器的实用性,并弥补以往研究在配置范围(如提示类型、输出处理、少样本策略)上的不足。 Method: 系统评估多个大小不一的开源与闭源大语言模型;测试多种提示方式(零样本/少样本)、输出处理方法(生成类别及类别概率)、少样本示例数量与选择策略;将最佳配置结果与朴素贝叶斯、随机森林及微调Transformer等传统方法对比。 Result: 零样本提示下15个挑战性数据集的平均PR AUC超过0.4,少样本提示下接近0.5;该性能接近朴素贝叶斯(0.5)、默认参数随机森林(0.5)、调参后随机森林(0.55)及微调Transformer(0.5);验证了基于输出token概率预测类别概率的有效性。 Conclusion: 大语言模型可作为实用的生物医学文本分类器,尤其在少样本设定下表现接近传统方法;推荐采用输出token概率进行类别概率预测等具体配置方案。 Abstract: This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.[33] DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining
Yutong Yan,Raphael Tang,Zhenyu Gao,Wenxi Jiang,Yao Lu
Main category: cs.CL
TL;DR: DatedGPT 是一组12个1.3B参数的语言模型,每个模型均基于严格按年划分(2013–2024)的时序数据从头训练,并辅以时间一致的指令微调,以消除金融回测中的前瞻偏差;实验证明其知识边界清晰、性能具竞争力,并提供交互式对比演示。
Details
Motivation: 解决大型语言模型在金融回测中因预训练数据含未来信息而引入的前瞻偏差问题,确保模型知识严格受限于训练截止时间。 Method: 构建 DatedGPT 模型族:12个1.3B参数模型,各自从零开始训练,使用约1000亿token、按年度严格切分(2013–2024)的时序数据;并分别在通用与金融领域指令数据上进行时间对齐的微调;通过困惑度探测验证知识边界,并在标准基准上评估性能。 Result: 各模型的知识范围被证实严格受限于其训练数据截止年份;在标准基准测试中表现与同规模现有模型相当;已发布支持跨年模型对比查询的交互式网页演示。 Conclusion: DatedGPT 为金融时序建模提供了可信赖的时间可控语言模型范式,有效缓解前瞻偏差,兼具实用性与可复现性。 Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.[34] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language
Remigiusz Kinas,Paweł Kiszczak,Sergio P. Perez,Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej
Main category: cs.CL
TL;DR: 本文介绍了Bielik-Minitron-7B模型的构建,通过两阶段压缩方法(结构化混合剪枝+知识蒸馏)将Bielik-11B-v3.0压缩33.4%至7.35B参数,并经SFT、DPO-P和GRPO对齐,最终恢复约90%基线性能,推理速度提升达50%。
Details
Motivation: 为欧洲小语种构建高效、低成本部署的语言模型,解决大模型在资源受限场景下的应用难题。 Method: 采用受NVIDIA Minitron启发的两阶段压缩法:先用NVIDIA Model Optimizer进行结构化混合剪枝,再用NVIDIA NeMo框架进行logit级知识蒸馏;随后通过SFT、DPO-P和GRPO三阶段对齐优化。 Result: 模型参数从11.04B压缩至7.35B(降幅33.4%),推理速度提升最高达50%,性能恢复至基线模型的约90%。 Conclusion: 该方法为资源稀缺语言提供了一条兼顾模型质量与部署效率的可行路径,显著降低推理成本。 Abstract: This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.[35] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan
Main category: cs.CL
TL;DR: 本文提出了一种新的多模态基准数据集CoMMET,用于评估大语言模型(LLMs)在多轮对话场景下的心理理论(ToM)能力,弥补了现有仅依赖文本、聚焦信念任务的局限。
Details
Motivation: 现有ToM评测基准局限于纯文本输入和信念相关任务,难以全面评估LLM在真实社交交互中的心理状态推理能力。 Method: 构建首个面向多轮对话的多模态ToM评测数据集CoMMET,受Theory of Mind Booklet Task启发,涵盖更广泛的心理状态类型并引入多轮测试机制。 Result: 通过对多个家族和规模的LLM进行系统评测,揭示了当前模型在ToM能力上的优势与不足。 Conclusion: CoMMET为深入理解现代LLM的社会认知能力提供了新工具,并指明了未来提升方向。 Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.[36] PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
Minjia Wang,Yunfeng Wang,Xiao Ma,Dexin Lv,Qifan Guo,Lynn Zheng,Benliang Wang,Lei Wang,Jiannan Li,Yongwei Xing,David Xu,Zheng Sun
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)代理合成真实数字足迹的新方法,从结构化用户画像出发生成多样化、合理的用户事件序列及对应数字产物(如邮件、消息等),实验证明其生成数据更具多样性与真实性,并在真实分布外任务中提升下游模型性能。
Details
Motivation: 研究常受限于多样且易获取的数字足迹数据稀缺,阻碍行为分析、个性化应用开发和机器学习模型训练。 Method: 基于结构化用户画像,利用大语言模型(LLM)代理生成多样且合理的用户事件序列,并产出对应数字产物(如邮件、消息、日历条目、提醒等)。 Result: 内在评估显示生成数据比现有基线更富多样性与真实性;在真实世界分布外任务上,用该合成数据微调的模型优于其他合成数据微调的模型。 Conclusion: LLM代理可有效合成高质量数字足迹数据,为相关研究与应用提供可靠、可扩展的数据源。 Abstract: Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.[37] CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
Pranav Raikote,Korbinian Randl,Ioanna Miliou,Athanasios Lakes,Panagiotis Papapetrou
Main category: cs.CL
TL;DR: CHiL(L)Grader 是首个将校准置信度估计融入人机协同流程的自动评分框架,通过温度缩放、置信度选择性预测和持续学习,在保证专家级评分质量(QWK≥0.80)的同时,将不确定样本交由人工处理,并从教师反馈中持续改进。
Details
Motivation: 指令微调大模型在教育评估中常过度自信,且随课程演进可靠性下降,难以在高风险场景中完全自主部署,亟需可靠不确定性量化机制。 Method: 提出CHiL(L)Grader框架,结合后验温度缩放、基于置信度的选择性预测和持续学习,实现高置信预测自动化与低置信案例人工介入的协同工作流,并支持对新题型和更新评分标准的自适应。 Result: 在三个简答题数据集上,自动评分覆盖35–65%的回答,达到专家级质量(QWK ≥ 0.80);接受与拒绝预测间的QWK差距达0.347,验证了置信路由有效性;每轮教师反馈均提升模型评分能力。 Conclusion: 不确定性量化是实现可信AI辅助评分的关键,CHiL(L)Grader为安全、可演化的智能教育评估提供了可行路径。 Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.[38] BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
Ilias Aarab
Main category: cs.CL
TL;DR: 本文提出了BTZSC基准,系统比较了四种零样本文本分类模型(NLI交叉编码器、嵌入模型、重排序器和指令微调大语言模型),发现现代重排序器(如Qwen3-Reranker-8B)性能最优,嵌入模型在精度与延迟间平衡最佳,指令微调LLM表现良好但略逊于重排序器,而NLI模型存在性能瓶颈。
Details
Motivation: 现有评估基准(如MTEB)常依赖监督微调或探针,无法真实反映零样本能力;缺乏对多样化零样本方法的系统性、公平比较。 Method: 构建包含22个公开数据集的零样本分类基准BTZSC,覆盖情感、主题、意图和情绪等任务;在统一零样本设定下,评估38个来自四类模型家族的公开及自定义模型。 Result: (i)Qwen3-Reranker-8B达macro F1=0.72,创SOTA;(ii)GTE-large-en-v1.5等强嵌入模型精度接近reranker且延迟更低;(iii)4–12B指令微调LLM最高达F1=0.67,擅长主题分类;(iv)NLI交叉编码器随规模扩大性能趋于饱和;(v)模型缩放主要提升reranker和LLM,对嵌入模型增益有限。 Conclusion: 重排序器是当前零样本文本分类最有效架构,嵌入模型最具实用性价比,LLM具潜力但需进一步优化;BTZSC为该领域提供了首个专注纯零样本评估的标准化基准。 Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.[39] Just Use XML: Revisiting Joint Translation and Label Projection
Thennal D K,Chris Biemann,Hans Ole Hatzel
Main category: cs.CL
TL;DR: 本文提出LabelPigeon框架,通过XML标签联合执行机器翻译与标签投影,在提升跨语言迁移效果的同时不损害翻译质量。
Details
Motivation: 现有方法将标签投影与机器翻译分离,或联合执行但导致翻译质量下降;本文旨在验证联合建模是否能在保证甚至提升翻译质量的前提下增强标签投影效果。 Method: 提出LabelPigeon框架,利用XML标签将标签信息嵌入源句,实现翻译与标签投影的端到端联合建模;设计直接评估标签投影质量的方法,并在多语言、多任务上系统评测。 Result: LabelPigeon在11种语言上超越基线并提升翻译质量;在203种语言和不同标注复杂度下验证翻译质量持续改善;在27种语言、3个下游任务(如NER)上取得最高+39.9 F1的跨语言迁移增益。 Conclusion: XML标记化的标签投影是一种高效且有效的跨语言标签迁移方法,不牺牲翻译质量,显著提升下游任务性能。 Abstract: Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.[40] Translationese as a Rational Response to Translation Task Difficulty
Maria Kunilovskaya
Main category: cs.CL
TL;DR: 本文提出翻译过程中的认知负荷是导致翻译腔(translationese)的根本原因,并通过信息论指标(如LLM惊奇度)量化翻译任务难度,验证其与翻译腔的关联性。结果表明,跨语言迁移难度比源文本复杂度更能预测翻译腔,尤其在英译德任务中;句法复杂度和翻译解熵是最强预测因子。
Details
Motivation: 现有研究将翻译腔归因于生产倾向、社会文化变量和语言对效应,但缺乏统一解释框架。本文旨在从认知负荷角度提供新的理论解释。 Method: 采用双向英德语料库(含书面与口语子语料),以自动分类器输出的段级translatedness分数表征翻译腔;用基于LLM surprisal的信息论指标为主,辅以传统句法/语义特征,量化源文本复杂度与跨语言迁移难度。 Result: 翻译腔可部分由翻译任务难度解释,尤其在英译德方向;跨语言迁移难度的贡献普遍大于源文本复杂度;信息论指标在书面语中表现等于或优于传统特征,但在口语中无优势;源文本句法复杂度与翻译解熵是跨语言对与模态下最强预测因子。 Conclusion: 翻译腔本质上反映了翻译任务固有的认知负荷,其可观测性可通过任务难度的量化指标有效预测,支持认知负荷假说,并为翻译质量评估与机器翻译优化提供新路径。 Abstract: Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.[41] To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times
Thomas Hikaru Clark,Carlos Arriaga,Javier Conde,Gonzalo Martínez,Pedro Reviriego
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)在句子层面心理语言学特征(如句子记忆性和阅读时间)上的建模能力,发现微调后模型能较好拟合人类实证数据,但零样本/少样本提示效果不稳定。
Details
Motivation: 扩展LLM在心理语言学规范估计中的应用,从词/多词层面延伸至句子层面(如句子记忆性、阅读时间),探究其是否蕴含相关认知信息。 Method: 对LLM进行监督式微调,以预测人类标注的句子记忆性和阅读时间;同时对比零样本和少样本提示下的表现,并与可解释基线模型比较。 Result: 微调后的LLM能产生与人类规范高度相关的估计,且预测力超过传统基线;但零样本/少样本提示效果差异大、不可靠。 Conclusion: LLM经微调可有效建模句子级认知特征,表明其隐含相关知识;但直接提示不能稳健替代人类认知测量,需谨慎使用。 Abstract: Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.[42] SommBench: Assessing Sommelier Expertise of Language Models
William Brach,Tomas Bedej,Jacob Nielsen,Jacob Pichna,Juraj Bedej,Eemeli Saarensilta,Julie Dupouy,Gianluca Barmina,Andrea Blasi Núñez,Peter Schneider-Kamp,Kristian Košťál,Michal Ries,Lukas Galke Poech
Main category: cs.CL
TL;DR: 本文提出SommBench,一个面向多语言、多文化的品酒师专业能力评估基准,涵盖葡萄酒理论问答、葡萄酒特征补全和食物-葡萄酒搭配三大任务,旨在检验大语言模型仅通过文本学习能否达到专家级感官判断水平。
Details
Motivation: 现有文化评估基准主要聚焦于可语言编码的基础文化知识,而缺乏对依赖嗅觉与味觉等感官经验的专业领域(如品酒)的评估;需构建能分离语言能力与领域专业知识的多语言基准。 Method: 构建多语言品酒师能力评估基准SommBench,包含WTQA、WFC、FWP三类任务,覆盖8种语言;数据由专业品酒师与各语种母语者协作构建;在主流闭源与开源大模型上进行评测。 Result: 最先进模型在葡萄酒理论问答上表现优异(最高97%准确率),但在葡萄酒特征补全(最高65%)和食物-葡萄酒搭配(MCC 0–0.39)上显著更难;验证了SommBench作为挑战性新基准的有效性。 Conclusion: SommBench为评估大语言模型在感官密集型专业领域的多语言能力提供了新范式,揭示了当前模型在文本到感官推理上的根本局限。 Abstract: With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.[43] Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
Tae-Eun Song
Main category: cs.CL
TL;DR: 本文提出了一种名为Cross-Context Review(CCR)的新方法,通过在无原始生成上下文的新会话中进行审查,显著提升了大语言模型对自身输出错误的识别能力。实验表明,CCR在F1指标上优于多种基线方法,其优势源于上下文隔离而非简单重复审查。
Details
Motivation: 大语言模型在同一会话中难以发现自身输出的错误,亟需一种简单有效的方法提升自我审查能力。 Method: 提出Cross-Context Review(CCR)方法:在全新会话中、不访问原始生成历史的前提下进行审查;并设计控制实验,对比同会话自审(SR)、重复自审(SR2)、上下文感知子代理审查(SA)与CCR四种条件。 Result: CCR在360次审查中达到F1=28.6%,显著优于SR(24.6%)、SR2(21.7%)和SA(23.8%);SR2未显著优于SR(p=0.11),证明优势来自上下文分离而非重复。 Conclusion: 上下文隔离是提升大模型自我审查效果的关键;CCR通用、无需额外基础设施、仅需一次额外会话,具备实用性和可扩展性。 Abstract: Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.[44] LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Feiyu Duan,Xuanjing Huang,Zhongyu Wei
Main category: cs.CL
TL;DR: 本文提出LifeSim用户模拟器和LifeSim-Eval基准,用于评估大语言模型在个性化助手任务中的表现,尤其关注隐式意图理解和长期用户偏好建模能力。
Details
Motivation: 现有个性化助手评测基准未能真实反映用户与助手间的复杂交互,尤其缺乏对物理环境、外部上下文及用户认知状态的建模。 Method: 基于信念-愿望-意图(BDI)模型构建LifeSim用户模拟器,生成连贯的生活轨迹并模拟意图驱动的交互行为;在此基础上构建涵盖8个生活领域、1200个场景的多轮交互式基准LifeSim-Eval。 Result: 实验表明当前大语言模型在隐式意图识别和长期用户偏好建模方面存在显著局限性,无论在单场景还是长周期设置下均表现不佳。 Conclusion: LifeSim-Eval为个性化AI助手提供了更贴近现实的评测框架,揭示了现有LLM在认知建模与持续交互能力上的关键短板。 Abstract: The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.[45] QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
Jiayin Lei,Ming Ma,Yunxi Duan,Chenxi Li,Tianming Yang
Main category: cs.CL
TL;DR: 本文提出QAQ框架,通过反向互信息(RMI)评估合成代码数据质量,从答案预测查询(Q|A)角度筛选高质量样本,在 WarriorCoder 数据集上仅用25%数据即达全量训练效果。
Details
Motivation: 现有基于指令遵循难度(IFD)的数据选择方法难以区分合成数据中的任务固有难度与模型幻觉,导致噪声数据难以剔除。 Method: 提出QAQ框架,定义反向互信息(RMI)衡量答案对查询的预测能力;分析RMI双极端(过低/过高)对应的质量问题;结合强弱模型分歧策略筛选既有效又具挑战性的样本。 Result: 在WarriorCoder数据集上,仅使用25%经分层RMI筛选的数据,性能媲美全量训练,显著优于IFD等现有方法。 Conclusion: 双向语义一致性(Q↔A)是合成数据筛选的关键维度,QAQ为高效、低成本代码模型训练提供了可扩展新路径。 Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.[46] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Łukasz Borchmann,Jordy Van Landeghem,Michał Turski,Shreyansh Padarha,Ryan Othniel Kearns,Adam Mahdi,Niels Rogge,Clémentine Fourrier,Siwei Han,Huaxiu Yao,Artemis Llabrés,Yiming Xu,Dimosthenis Karatzas,Hao Zhang,Anupam Datta
Main category: cs.CL
TL;DR: 本文提出MADQA基准测试,用于评估多模态代理在复杂文档工作流中的战略推理能力,发现当前最佳代理虽能达到人类搜索者的准确率,但依赖暴力搜索而非策略性规划,存在约20%的性能差距。
Details
Motivation: 探究多模态代理是否具备真正的战略推理能力,而非仅靠随机试错搜索。 Method: 构建包含2250个人类编写问题、基于800份异构PDF文档的MADQA基准;依据经典测量理论设计以增强区分度;提出衡量准确率-努力权衡的新评估协议。 Result: 当前最优代理在准确率上可媲美人类搜索者,但解决的问题类型不同,依赖暴力搜索弥补策略规划薄弱;未能缩小近20%的oracle性能差距,常陷入无效循环。 Conclusion: 现有多模态代理尚未实现高效、校准的战略推理,亟需从暴力检索转向更智能、高效的推理范式;作者开源数据集与评估工具以推动该方向发展。 Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.[47] Long-Context Encoder Models for Polish Language Understanding
Sławomir Dadas,Rafał Poświata,Marek Kozłowski,Małgorzata Grębowiec,Michał Perełkiewicz,Paweł Klimiuk,Przemysław Boruta
Main category: cs.CL
TL;DR: 本文提出了一种支持8192长上下文的高质量波兰语编码器模型,通过两阶段训练(位置编码适配+全参数持续预训练)及知识蒸馏压缩变体,在25项任务(含KLEJ、FinBench等)上显著提升长文档理解能力,同时保持短文本性能。
Details
Motivation: 经典编码器(如BERT)上下文窗口短,难以处理长文档;而波兰语高质量长上下文编码器仍属空白。 Method: 采用两阶段训练:先适配位置编码,再进行全参数持续预训练;并基于知识蒸馏构建压缩模型变体。 Result: 在25项任务(含KLEJ、新金融任务集FinBench及长文档理解任务)上平均性能最优,长上下文任务显著优于竞品,短文本性能相当。 Conclusion: 该工作填补了波兰语长上下文编码器的空白,验证了扩展上下文与知识蒸馏对小语种高效编码器的有效性。 Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.[48] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Yushi Bai,Qian Dong,Ting Jiang,Xin Lv,Zhengxiao Du,Aohan Zeng,Jie Tang,Juanzi Li
Main category: cs.CL
TL;DR: 本文提出IndexCache,通过在多层稀疏注意力中复用部分层的索引结果,显著减少DeepSeek Sparse Attention(DSA)中高开销的索引计算,实现推理加速与成本降低,同时几乎不损失模型性能。
Details
Motivation: DSA等稀疏注意力方法虽将核心注意力复杂度降至O(Lk),但其轻量级索引器本身仍为O(L²)且每层独立运行;而实际中相邻层选出的top-k token高度相似,存在跨层冗余可被利用。 Method: 提出IndexCache:将Transformer层划分为Full层(独立运行索引器)和Shared层(复用最近Full层的索引);设计两种配置策略——训练无关的贪心搜索(基于校准集语言建模损失最小化)与训练相关的多层蒸馏损失(使保留的索引器拟合其所服务各层的平均注意力分布)。 Result: 在30B DSA模型上,IndexCache可削减75%索引计算,预填充和解码速度分别提升1.82×和1.48×,质量几乎无损;在GLM-5大模型上也验证了有效性。 Conclusion: IndexCache是一种高效、通用且易于部署的优化方案,通过挖掘稀疏注意力中固有的跨层索引冗余,在保持精度前提下大幅提升长上下文推理效率。 Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).[49] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
Alexandre Le Mercier,Thomas Demeester,Chris Develder
Main category: cs.CL
TL;DR: 本文提出CLASP模型,通过分析Mamba状态空间模型的块输出嵌入(BOEs)并结合XGBoost分类器,在token级别高效检测隐藏状态中毒攻击(HiSPA),在简历筛选等真实场景中实现了高F1分数和强泛化能力,且计算开销低,适合实际部署。
Details
Motivation: 隐藏状态中毒攻击(HiSPAs)严重威胁状态空间模型(如Mamba)及其混合架构的安全性,亟需轻量、高效、鲁棒的防御方法。 Method: 将HiSPA检测建模为token级二分类任务;提取Mamba块输出嵌入(BOEs)中的判别性模式,使用XGBoost分类器识别恶意token;在真实简历筛选场景下进行评估,并采用多种交叉验证策略检验泛化性。 Result: 在9.5M token的2483份简历数据集上,CLASP达到95.9% token级F1和99.3%文档级F1;留一法交叉验证下文档级F1达96.9%,结构新颖触发器下仍保持91.6%平均文档级F1;处理速度1032 tokens/s,显存占用<4GB。 Conclusion: CLASP是一种高效、轻量、泛化性强的HiSPA检测方法,可作为SSM及混合架构的前置安全防护模块,具备实际部署潜力。 Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.[50] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration
Priyanka Kargupta,Shuhaib Mehri,Dilek Hakkani-Tur,Jiawei Han
Main category: cs.CL
TL;DR: 本文提出Idea-Catalyst框架,旨在通过系统性跨学科洞察支持人类与大语言模型的创造性推理,提升科学发现中的跨学科协作质量,而非仅自动化实验设计;该框架从抽象研究目标出发,在头脑风暴阶段避免过早聚焦具体方案,通过将领域挑战泛化为跨学科可检索的概念问题,整合其他学科(如心理学、社会学)的类比洞见,并回溯重构到原领域,实证显示其提升了21%的新颖性和16%的洞察力。
Details
Motivation: 现有AI驱动的科学发现方法多侧重快速实验设计,忽视了支撑创造性跨学科突破所需的探索性、协作式推理过程;多数研究仍局限于单一学科,缺乏对跨学科推理本身的支持。 Method: Idea-Catalyst框架包含三步:(1)将抽象研究目标分解为目标领域的核心研究问题;(2)将这些问题转化为领域无关的概念性难题,以跨学科检索(如心理学、社会学)中类似挑战的解决方案;(3)将外部学科洞见重新语境化并合成至原领域,按跨学科潜力对源学科排序。 Result: 实证表明,该框架使生成结果的平均新颖性提升21%,洞察力提升16%,同时保持与原始研究问题的高度相关性。 Conclusion: Idea-Catalyst成功将元认知层面的跨学科推理机制形式化,不仅增强人类科研人员的创意发散能力,亦可赋能大语言模型进行更高质量的跨学科知识整合,为‘增强型’而非‘替代型’AI科研工具提供了新范式。 Abstract: Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.[51] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
Ziyu Chen,Yilun Zhao,Chengye Wang,Rilyn Han,Manasi Patwardhan,Arman Cohan
Main category: cs.CL
TL;DR: 本文提出synthesis-and-reground框架,用于构建科学多模态文档推理数据集,解决了规模、保真度与真实感之间的权衡问题,并基于该框架构建了SciMDR训练数据集和SciMDR-Eval评估基准。
Details
Motivation: 构建科学多模态文档推理数据集时,规模、保真度和真实感之间存在固有折衷,需新方法兼顾三者。 Method: 提出两阶段synthesis-and-reground框架:第一阶段为以主张为中心的问答合成(Claim-Centric QA Synthesis),生成忠实且聚焦的QA对及推理链;第二阶段为文档级重新锚定(Document-Scale Regrounding),将QA对程序化嵌入完整文档中以还原真实复杂性。 Result: 构建了含300K QA对、覆盖20K篇论文的SciMDR大规模训练数据集,以及专家标注的SciMDR-Eval评估基准;微调模型在多个科学问答基准上显著提升,尤其在需文档级复杂推理的任务中。 Conclusion: synthesis-and-reground框架有效平衡了数据集构建中的关键权衡,SciMDR及其评估基准推动了科学多模态理解模型的发展。 Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.cs.CV [Back]
[52] RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
Shijie Zhou,Bin Zhu,Jiarui Yang,Xiangyu Zhao,Jingjing Chen,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出Robot-Conditioned Normalizing Flow(RC-NF),一种用于机器人异常检测与干预的实时监控模型,通过解耦任务感知的机器人与物体状态,在仅需正样本的无监督训练下实现高精度异常评分,并在仿真与真实场景中验证其有效性与低延迟响应能力。
Details
Motivation: 现有基于模仿学习的视觉-语言-动作(VLA)模型在动态环境和分布外(OOD)条件下鲁棒性差,难以可靠运行。 Method: 提出RC-NF模型,基于条件归一化流架构,解耦处理机器人状态与物体运动轨迹;仅用正样本进行无监督训练;利用概率密度函数在推理时实时计算异常分数;并构建LIBERO-Anomaly-10仿真异常评测基准。 Result: RC-NF在LIBERO-Anomaly-10所有异常类型上达到SOTA性能;真实实验中作为即插即用模块集成至VLA模型(如pi0),提供<100ms延迟的OOD信号,支持状态级回滚或任务级重规划。 Conclusion: RC-NF显著提升了VLA驱动机器人系统在动态环境中的鲁棒性与适应性,为实际部署提供了可靠的实时监控能力。 Abstract: Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.[53] GGPT: Geometry Grounded Point Transformer
Yutong Chen,Yiming Wang,Xucong Zhang,Sergey Prokudin,Siyu Tang
Main category: cs.CV
TL;DR: 本文提出了一种名为Geometry-Grounded Point Transformer (GGPT)的框架,通过引入基于改进SfM的稀疏几何引导,提升稀疏视角下基于前馈网络的3D重建的几何一致性与细节精度。
Details
Motivation: 现有前馈网络在稀疏视角3D重建中存在几何不一致和细粒度精度不足的问题,因其缺乏显式的多视角几何约束。 Method: 首先提出一种基于稠密特征匹配与轻量几何优化的改进SfM流程,估计相机位姿与部分点云;再设计几何引导的3D点Transformer,在优化的几何引导编码下对密集点图进行细化。 Result: GGPT在ScanNet++上训练后,显著优于现有前馈方法,在域内与跨域设置下均表现出更强的几何一致性、空间完整性及对无纹理区域的补全能力。 Conclusion: GGPT为融合几何先验与稠密前馈预测提供了原理性机制,有效提升了稀疏视角3D重建的质量与泛化性。 Abstract: Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.[54] Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction
Jingxing Zhong,Qingtao Pan,Xuchang Zhou,Jiazhen Lin,Xinguo Zhuang
Main category: cs.CV
TL;DR: 本文提出了一种文本引导的乳腺肿瘤分割模型TextBCS,通过分阶段的视觉-语言交互和证据学习,提升低对比度和边界模糊场景下的肿瘤分割精度。
Details
Motivation: 现有基于深度学习的乳腺肿瘤分割方法在低对比度和边界模糊情况下难以准确定位肿瘤轮廓,需借助文本提示信息增强分割效果。 Method: 提出TextBCS模型,包含分阶段视觉-语言交互(在下采样各阶段实现图文特征互促)和证据学习(采用变分狄利克雷分布量化边界分割不确定性)。 Result: 在公开数据集上,TextBCS在乳腺肿瘤分割任务中性能优于其他主流分割网络。 Conclusion: 文本引导与不确定性建模相结合可有效提升复杂医学影像中肿瘤分割的鲁棒性与精度。 Abstract: Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.[55] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters
Haihua Luo,Xuming Ran,Jiangrong Shen,Timo Hämäläinen,Zhonghua Chen,Qi Xu,Fengyu Cong
Main category: cs.CV
TL;DR: 本文提出了一种简单高效的增量学习框架SimE,利用带适配器的视觉-语言模型(如CLIP),发现适配器连接数量与增量学习能力呈非线性关系,并在TinyImageNet和CIFAR-100上显著优于现有方法。
Details
Motivation: 解决现有基于预训练视觉-语言模型的增量学习方法存在的训练效率低、依赖记忆库、需强骨干网络三大问题。 Method: 提出SimE框架,采用带定制化适配器的视觉-语言模型;系统分析适配器在Transformer块间与块内连接数量对增量学习性能的影响;探索用更大数据集(如LAION2B)和更强架构(如ViT-L/14)训练的CLIP替换编码器以增强零样本能力。 Result: SimE在TinyImageNet上比传统方法提升9.6%,在CIFAR-100上比其他CLIP基线方法提升5.3%;发现适配器连接数与IL能力呈非线性关系:块间增加有益,块内过多反而损害性能。 Conclusion: 适配器结构设计对视觉-语言模型用于增量学习至关重要;合理配置适配器连接可显著提升效率与性能,无需记忆库或复杂训练策略;结合更强CLIP编码器可进一步释放零样本潜力。 Abstract: Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).[56] Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning
Yuehao Song,Shaoyu Chen,Hao Gao,Yifan Zhu,Weixiang Yue,Jialv Zou,Bo Jiang,Zihao Lu,Yu Wang,Qian Zhang,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出Senna-2,一种通过三阶段一致性训练范式显式对齐视觉语言模型(VLM)高层决策与端到端(E2E)低层规划的新型驾驶策略,显著提升双系统一致性与驾驶安全性。
Details
Motivation: 现有VLM-E2E驾驶策略忽视VLM高层决策与E2E低层规划之间的双系统一致性,导致轨迹与决策不匹配,削弱自上而下的指导能力与决策遵循能力。 Method: 提出Senna-2,采用一致性导向的三阶段训练:第一阶段驾驶预训练,通过决策适配器将VLM决策以隐式嵌入形式传递给E2E策略;第二阶段在开环设置下对齐VLM与E2E策略;第三阶段在3DGS环境中通过自底向上的分层强化学习进行闭环对齐,增强安全性和效率。 Result: 实验表明,Senna-2在双系统一致性上提升19.3% F1分数,在开环设置中FDE降低5.7%,在闭环设置中AF-CR降低30.6%。 Conclusion: Senna-2通过显式对齐VLM与E2E策略,有效提升了驾驶系统的决策-规划一致性与整体安全性,验证了双系统一致性建模的重要性。 Abstract: Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).[57] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
Qingtao Pan,Zhihao Dou,Shuo Li
Main category: cs.CV
TL;DR: 本文提出FMVR方法,通过频率调制视觉恢复策略,在减少视觉token数量的同时保留并增强视觉语义,显著降低计算开销而不损失精度。
Details
Motivation: 大型多模态模型(LMMs)因视觉token过多而难以适应不同计算预算,现有压缩方法会不可避免地丢失视觉语义。 Method: FMVR将少量视觉token的表征解耦为低频(MaxPool)和高频(AvgPool)成分,用轻量可学习参数调制;高频作显著性滤波器增强关键语义,低频作反显著性滤波器强化弱语义;并结合Matryoshka表示学习实现推理时弹性调整token数量。 Result: 在10个图像和4个视频基准上,FMVR-LLaVA将LLaVA-1.5-7B的FLOPs降低89%,同时保持近100%原始准确率。 Conclusion: FMVR是一种即插即用、极简有效的视觉token压缩与语义恢复方法,兼顾高效性与高性能,适用于动态计算预算场景。 Abstract: Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.[58] When Slots Compete: Slot Merging in Object-Centric Learning
Christos Chatzisavvas,Panagiotis Rigas,George Ioannakis,Vassilis Katsouros,Nikolaos Mitianoudis
Main category: cs.CV
TL;DR: 本文提出了一种名为slot merging的轻量级操作,用于在基于slot的对象中心学习中合并重叠的slot,从而提升对象分解和分割质量。
Details
Motivation: 现有基于slot的方法通常使用固定数量的slot,导致多个slot竞争同一实体的重叠区域,难以聚焦于不同区域。 Method: 引入slot merging操作,通过Soft-IoU度量slot注意力图之间的重叠,并采用重心更新策略合并选定slot对;合并策略基于重叠统计推断阈值,无需额外可学习模块。 Result: 在DINOSAUR特征重建流程中集成该方法后,在对象发现与分割基准上优于其他自适应方法,提升了对象分解能力和掩码质量。 Conclusion: Slot merging是一种即插即用、无需额外参数的改进方案,有效缓解了slot间重叠竞争问题,增强了对象中心表征的解耦性与实用性。 Abstract: Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.[59] Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models
Benedikt Schwab,Thomas H. Kolbe
Main category: cs.CV
TL;DR: 本文提出了一种利用多源移动激光雷达(LiDAR)扫描数据,结合语义3D城市模型,提取物体表面“辐射度指纹”以推断材料特性的新方法,并在A2D2数据集上验证了其有效性。
Details
Motivation: 现有语义3D城市模型缺乏材料信息,而多时相、多传感器的移动LiDAR数据蕴含丰富的表面反射特性,亟需一种结构化方法将二者融合以增强数字孪生的分析能力。 Method: 基于CityGML 3.0 LOD3语义城市模型,将来自5种LiDAR传感器、4次扫描活动的3.12亿条激光束自动关联到6368个语义对象;通过归一化并分组不同距离、入射角、环境与传感器条件下的反射强度,构建每个对象表面的辐射度指纹。 Result: 成功提取出具有类内一致性模式的辐射度指纹,揭示了不同语义类别(如沥青路、混凝土人行道、玻璃幕墙)主导性材料的可区分反射特征;开源了语义模型、算法实现及新型地理数据库3DSensorDB。 Conclusion: 辐射度指纹为语义3D城市模型注入材料维度提供了可行路径,显著拓展了城市数字孪生在能源建模、热分析、可持续评估等领域的应用潜力。 Abstract: Although semantic 3D city models are internationally available and becoming increasingly detailed, the incorporation of material information remains largely untapped. However, a structured representation of materials and their physical properties could substantially broaden the application spectrum and analytical capabilities for urban digital twins. At the same time, the growing number of repeated mobile laser scans of cities and their street spaces yields a wealth of observations influenced by the material characteristics of the corresponding surfaces. To leverage this information, we propose radiometric fingerprints of object surfaces by grouping LiDAR observations reflected from the same semantic object under varying distances, incident angles, environmental conditions, sensors, and scanning campaigns. Our study demonstrates how 312.4 million individual beams acquired across four campaigns using five LiDAR sensors on the Audi Autonomous Driving Dataset (A2D2) vehicle can be automatically associated with 6368 individual objects of the semantic 3D city model. The model comprises a comprehensive and semantic representation of four inner-city streets at Level of Detail (LOD) 3 with centimeter-level accuracy. It is based on the CityGML 3.0 standard and enables fine-grained sub-differentiation of objects. The extracted radiometric fingerprints for object surfaces reveal recurring intra-class patterns that indicate class-dominant materials. The semantic model, the method implementations, and the developed geodatabase solution 3DSensorDB are released under: https://github.com/tum-gis/sensordb[60] Towards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery
Yu Chung Lee,David G. Black,Ryan S. Yeung,Septimiu E. Salcudean
Main category: cs.CV
TL;DR: 本文提出了一种基于RGB图像的自动化患者注册与解剖信息引导的初始探头放置(PIPG)框架,用于辅助心脏和肺部超声检查,尤其适用于远程超声场景。
Details
Motivation: 心脏和肺部超声操作技术要求高,尤其在远程超声中,新手或机器人缺乏专家现场指导,难以准确定位肋间声窗并完成标准切面导航。 Method: 利用混合现实(MR)头戴设备采集患者RGB图像,边缘服务器重建患者特异性体表与骨骼模型,并通过预测的骨性标志定位肋间区域,将探头引导姿态投影回体表;结合MR实时可视化验证。 Result: 在健康志愿者上的初步实验表明,该方法可实现解剖学可接受范围内的稳定初始探头放置,定量放置误差满足远程超声设置需求。 Conclusion: 该PIPG框架仅依赖RGB图像即可实现解剖感知的自动探头初始定位,为远程超声及机器人辅助超声提供了可行、低门槛的技术路径。 Abstract: Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup[61] InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction
Dingqiang Ye,Jiacong Xu,Jianglu Ping,Yuxiang Guo,Chao Fan,Vishal M. Patel
Main category: cs.CV
TL;DR: InstantHDR是一种前馈神经网络,能从未经校准的多曝光LDR图像集合中单次前向传播重建3D HDR场景,无需已知相机位姿或密集点云初始化,显著提升速度并保持高质量合成效果。
Details
Motivation: 现有HDR新视角合成方法依赖已知相机位姿、稠密点云初始化和耗时的逐场景优化;而现有前馈方法忽略HDR特性,假设外观与曝光无关,亟需兼顾效率与HDR建模能力的新方法。 Method: 提出InstantHDR:1)几何引导的多曝光外观建模用于融合;2)元网络实现可泛化的场景特定色调映射;3)构建含168个Blender渲染场景的HDR-Pretrain预训练数据集,覆盖多样光照与相机响应函数。 Result: InstantHDR在合成质量上媲美当前最优优化方法,单次前向推理速度提升约700倍,加上后优化提速约20倍;在缺乏真实HDR场景数据情况下,通过自建预训练数据集实现良好泛化性。 Conclusion: InstantHDR首次实现了高效、通用、端到端的前馈式HDR新视角合成,突破了对位姿、初始化和逐场景优化的依赖,为实时HDR 3D重建提供了新范式。 Abstract: High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes from multi-exposure low dynamic range (LDR) images. Existing HDR pipelines heavily rely on known camera poses, well-initialized dense point clouds, and time-consuming per-scene optimization. Current feed-forward alternatives overlook the HDR problem by assuming exposure-invariant appearance. To bridge this gap, we propose InstantHDR, a feed-forward network that reconstructs 3D HDR scenes from uncalibrated multi-exposure LDR collections in a single forward pass. Specifically, we design a geometry-guided appearance modeling for multi-exposure fusion, and a meta-network for generalizable scene-specific tone mapping. Due to the lack of HDR scene data, we build a pre-training dataset, called HDR-Pretrain, for generalizable feed-forward HDR models, featuring 168 Blender-rendered scenes, diverse lighting types, and multiple camera response functions. Comprehensive experiments show that our InstantHDR delivers comparable synthesis performance to the state-of-the-art optimization-based HDR methods while enjoying $\sim700\times$ and $\sim20\times$ reconstruction speed improvement with our single-forward and post-optimization settings. All code, models, and datasets will be released after the review process.[62] Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild
Jun Yu,Yunxiang Zhang,Naixiang Zheng,Lingsi Zhu,Guoyuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于分层粒度对齐与状态空间模型的新型多模态框架,用于解决野外环境下面部动作单元(AU)检测的挑战。该框架利用DINOv2和WavLM等基础模型提取高质量音视频特征,并通过动态对齐全局语义与局部细节、Vision-Mamba建模长时序依赖、以及非对称跨模态注意力机制实现深度音视觉同步,在Aff-Wild2数据集上达到SOTA性能,并在ABAW10竞赛AU赛道中夺冠。
Details
Motivation: 野外环境下面部动作单元(AU)检测面临空间-时间异质性严重、姿态不受控、音视频依赖复杂等挑战;现有多模态方法受限于编码器容量和浅层融合机制,难以捕捉细粒度语义变化和超长时序上下文。 Method: 提出一种新型多模态框架:1)采用DINOv2和WavLM作为强健的音视频基础特征提取器;2)设计分层粒度对齐模块,动态对齐全局面部语义与局部活跃区域;3)引入Vision-Mamba架构替代传统TCN,实现O(N)线性复杂度的长程时序建模;4)设计非对称交叉注意力机制,增强语音副语言线索与细微视觉运动的同步。 Result: 在极具挑战性的Aff-Wild2数据集上显著超越现有基线方法,达到当前最优性能;并在第十届野外情感行为分析竞赛(ABAW10)AU检测赛道中获得第一名。 Conclusion: 所提框架有效克服了野外AU检测中的关键瓶颈,验证了结合强大基础模型、分层语义对齐与高效状态空间建模在多模态情感理解任务中的巨大潜力,为后续研究提供了新范式。 Abstract: Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.[63] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation
Ziyao Wang,Chen Chen,Jingtao Li,Weiming Zhuang,Jiabo Huang,Ang Li,Lingjuan Lyu
Main category: cs.CV
TL;DR: 本文提出了一种名为UniCompress的统一视觉令牌压缩算法,通过可学习的全局元令牌引导的压缩与解压机制,在大幅减少视觉令牌数量的同时,保持图像理解和生成任务的性能,适用于资源受限场景。
Details
Motivation: 统一多模态模型因需大量视觉令牌而导致计算和内存开销大,难以部署于资源受限的场景(如具身AI系统)。 Method: 提出UniCompress算法,引入可学习的全局元令牌指导的轻量级、模块化插件式压缩与解压机制,无需全模型重训练即可集成到现有统一模型中。 Result: 视觉令牌减少最多达4倍,显著降低推理延迟和训练成本,仅带来微小性能下降。 Conclusion: UniCompress验证了令牌高效型统一多模态建模在真实世界应用中的可行性与潜力。 Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.[64] UNet-AF: An alias-free UNet for image restoration
Jérémy Scanvic,Quentin Barthélemy,Julián Tachella
Main category: cs.CV
TL;DR: 本文提出了一种抗混叠的UNet架构,通过采用平移等变性强的层来提升模型对平移变换的等变性,在图像复原任务中实现了与基线模型相当的性能,同时显著提高了实测等变性。
Details
Motivation: UNet虽被广泛假设为平移等变,但其传统结构易产生混叠,损害实际等变性。 Method: 精心选取当前最先进的平移等变层,构建抗混叠的UNet架构。 Result: 在图像复原任务上,新架构与非等变基线性能相当,且实测等变性显著提升;消融实验表明各改进均对等变性至关重要。 Conclusion: 抗混叠设计可有效提升UNet的实际平移等变性,无需牺牲性能,为等变深度学习提供了实用路径。 Abstract: The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at https://github.com/jscanvic/UNet-AF[65] Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis
Zhenxuan Zhang,Peiyuan Jing,Ruicheng Yuan,Liwei Hu,Anbang Wang,Fanwen Wang,Yinzhe Wu,Kh Tohidul Islam,Zhaolin Chen,Zi Wang,Peter Lally,Guang Yang
Main category: cs.CV
TL;DR: 本文提出了一种可靠性感知的扩散模型(ReDiff),用于低场到高场MRI图像合成,通过可靠性引导采样和不确定性感知的多候选选择策略,提升结构保真度并减少解剖不一致伪影。
Details
Motivation: 现有扩散模型在低场到高场MRI合成中难以兼顾细节恢复与结构保真,易在结构模糊区域生成解剖不一致的伪影(如虚假边缘、纹理异常),影响下游定量分析和临床信任。 Method: 提出ReDiff框架:1)可靠性引导的采样策略,在去噪过程中抑制不可靠响应;2)不确定性感知的多候选选择机制,提升最终预测的空间可靠性和解剖一致性。 Result: 在多中心MRI数据集上实验表明,ReDiff相比SOTA方法显著提升了结构保真度,减少了伪影。 Conclusion: ReDiff通过在采样与后生成阶段引入可靠性建模,实现了更稳健、解剖一致的MRI合成,为临床可信赖的生成模型提供了新思路。 Abstract: Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.[66] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
Yuto Shibata,Kashu Yamazaki,Lalit Jayanti,Yoshimitsu Aoki,Mariko Isogawa,Katerina Fragkiadaki
Main category: cs.CV
TL;DR: 本文提出AssistMimic方法,将人机协作辅助动作模仿建模为多智能体强化学习问题,在物理仿真中联合训练支持者与接受者策略,通过伙伴策略初始化、动态参考重定向和接触促进奖励,首次在基准上成功实现辅助交互动作跟踪。
Details
Motivation: 现有通用运动跟踪方法难以应对辅助场景中需持续感知人类伙伴姿态与动力学并快速适应的需求。 Method: 将辅助性人-人交互动作模仿建模为多智能体强化学习问题;提出伙伴策略初始化方案、动态参考重定向机制和接触促进奖励函数。 Result: AssistMimic是首个在标准基准上成功跟踪辅助交互动作的方法,验证了多智能体RL对具身化、社会感知人形控制的有效性。 Conclusion: 多智能体强化学习框架结合物理引导的奖励设计与策略初始化,可有效提升人形机器人在紧密人际交互任务中的实时适应性与物理合理性。 Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.[67] DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding
Mingzhe Tao,Ruiping Liu,Junwei Zheng,Yufan Chen,Kedi Ying,M. Saquib Sarfraz,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出DriveXQA多模态数据集和MVX-LLM模型,用于提升自动驾驶中异常驾驶场景的理解能力,通过双交叉注意力机制融合多种视觉模态,在恶劣天气等挑战性条件下显著提升性能。
Details
Motivation: 现有MLLMs未充分探索利用多传感器信息理解自动驾驶中的异常驾驶场景,存在研究空白。 Method: 构建包含102,505个问答对的DriveXQA多模态数据集,并设计MVX-LLM模型,采用双交叉注意力(DCA)投影器实现多视觉模态融合。 Result: DCA在雾天等挑战性条件下显著提升性能(GPTScore:53.5 vs. 基线25.1)。 Conclusion: DriveXQA数据集与MVX-LLM模型为多模态自动驾驶理解提供了新基准和有效方法,代码与数据将开源。 Abstract: Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.[68] High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping
Sehoon Tak,Keunhee Cho,Sangpil Kim,Jae-Sang Hyun
Main category: cs.CV
TL;DR: 本文提出一种高精度位姿估计方法,通过在移动DFP系统中增加一个固定且内参标定的全局投影仪,利用其相位导出的像素约束和PnP式重投影目标,在固定参考系中估计DFP系统位姿,无需确定性特征提取,并实验证明了其采样不变性。
Details
Motivation: 数字条纹投影(DFP)虽能实现微米级3D重建,但在大尺度测绘中受限于六自由度位姿估计精度不足;传统ICP配准在千万级点云上效率低、依赖降采样或特征选取,易损失细节并降低位姿精度;现有漂移校正方法无法解决密集DFP点云对采样的敏感性问题。 Method: 引入固定、内参已知的全局投影仪,结合其相位信息生成像素级几何约束,构建PnP风格的重投影优化目标函数,在固定参考系中直接优化移动DFP系统的位姿,避免显式特征提取,并验证坐标保持型子采样下的采样不变性。 Result: 实验表明该方法达到亚毫米级位姿精度(含量化不确定性边界),在激进降采样下具有高重复性,对均质表面和低重叠视角鲁棒,并能显著降低ICP轨迹的误差累积。 Conclusion: 该方法推动DFP向准静态场景(如检测与计量)中的高精度大尺度3D建图迈进,代价是需时间复用方式额外采集全局投影仪数据。 Abstract: Digital fringe projection (DFP) enables micrometer-level 3D reconstruction, yet extending it to large-scale mapping remains challenging because six-degree-of-freedom pose estimation often cannot match the reconstruction's precision. Conventional iterative closest point (ICP) registration becomes inefficient on multi-million-point clouds and typically relies on downsampling or feature-based selection, which can reduce local detail and degrade pose precision. Drift-correction methods improve long-term consistency but do not resolve sampling sensitivity in dense DFP point clouds.We propose a high-precision pose estimation method that augments a moving DFP system with a fixed, intrinsically calibrated global projector. Using the global projector's phase-derived pixel constraints and a PnP-style reprojection objective, the method estimates the DFP system pose in a fixed reference frame without relying on deterministic feature extraction, and we experimentally demonstrate sampling invariance under coordinate-preserving subsampling. Experiments demonstrate sub-millimeter pose accuracy against a reference with quantified uncertainty bounds, high repeatability under aggressive subsampling, robust operation on homogeneous surfaces and low-overlap views, and reduced error accumulation when used to correct ICP-based trajectories. The method extends DFP toward accurate 3D mapping in quasi-static scenarios such as inspection and metrology, with the trade-off of time-multiplexed acquisition for the additional projector measurements.[69] DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification
Ravi Mosalpuri,Mohammed Abdelsamea,Ahmed Karam Eldaly
Main category: cs.CV
TL;DR: 本文提出DeepHistoViT,一种基于视觉Transformer的可解释框架,用于组织病理图像自动分类,在肺癌、结肠癌和急性淋巴细胞白血病数据集上达到接近100%的多项指标性能。
Details
Motivation: 传统组织病理学诊断耗时、劳动密集且存在观察者间差异,亟需可靠、可解释的计算机辅助诊断工具。 Method: 提出DeepHistoViT,一种定制化Vision Transformer架构,集成注意力机制以捕获细粒度细胞结构,并通过注意力定位实现诊断相关区域的可解释性。 Result: 在三个公开组织病理数据集(肺癌、结肠癌、急性淋巴细胞白血病)上取得SOTA性能:肺癌和结肠癌数据集各项指标(准确率、精确率、召回率、F1分数、ROC-AUC)均为100%;ALL数据集各项指标均超99.8%,ROC-AUC达99.99%,所有结果均附95%置信区间。 Conclusion: Transformer架构在组织病理图像分析中极为有效,DeepHistoViT具备高精度与可解释性,有望成为支持病理医生临床决策的实用辅助工具。 Abstract: Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.[70] Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
Nazia Tasnim,Keanu Nichols,Yuting Yang,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan A. Plummer
Main category: cs.CV
TL;DR: 本文提出了一个名为DORI的认知导向分层基准,专门用于评估视觉-语言模型对物体朝向的理解能力,发现当前最先进模型在此任务上表现不佳,揭示了朝向理解仍是多模态系统中的未解难题。
Details
Motivation: 现有视觉-语言基准大多将朝向与位置及整体场景理解混淆,未能单独、系统地评估物体朝向理解能力;而人类对朝向的认知具有阶段性发展特点,亟需一个符合认知规律的专用基准。 Method: 提出DORI基准,基于人类朝向认知的四个阶段,从粗粒度(类别级)和细粒度(度量级)两个层次分解朝向任务;构建含13652张图像、33656道多选题的大规模数据集,涵盖67类物体,并通过边界框隔离、统一空间参考系和结构化提示控制混杂变量。 Result: 在24个前沿视觉-语言模型上的评测显示:最优模型在粗粒度和细粒度任务上准确率仅分别为54.2%和45.0%,显著低于随机猜测以上水平;模型在复合旋转和参照系切换任务上失败最多,且普遍存在粗-细粒度性能断层。 Conclusion: 物体朝向理解是当前多模态AI的重大短板,现有基准掩盖了该缺陷;DORI揭示了模型依赖类别启发式而非几何推理的本质局限,对机器人操作、3D重建和人机交互具有重要启示。 Abstract: Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.[71] Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations
Fatemeh Naeinian,Ali Hamza,Haoran Zhu,Anna Choromanska
Main category: cs.CV
TL;DR: 本文研究端到端自动驾驶模型在未见城市间的零样本跨城泛化能力,发现传统监督预训练主干网络易依赖城市特有线索,导致跨域性能严重下降;而自监督视觉表征(如I-JEPA、DINOv2、MAE)可显著缩小该泛化差距,尤其在开环与闭环评估中均提升鲁棒性。
Details
Motivation: 端到端自动驾驶模型通常在多城市数据集上训练,但其在未见城市的泛化能力未被充分检验;地理混合训练可能掩盖真实域偏移下的失败模式,亟需评估零样本跨城迁移能力。 Method: 将多种自监督主干网络(I-JEPA、DINOv2、MAE)集成到端到端轨迹规划框架中,在nuScenes(开环)和NAVSIM(闭环)数据集上采用严格的地理划分进行评估。 Result: 监督主干在跨城迁移(如波士顿→新加坡)时位移误差达9.77倍、碰撞率高达19.43倍;自监督预训练将其分别降至1.20倍和0.75倍;闭环评估中PDMS指标最高提升4%。 Conclusion: 自监督表征学习能显著增强端到端驾驶模型的跨城鲁棒性,零样本地理迁移应成为评估此类系统的关键基准。 Abstract: End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.[72] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation
Songlin Yang,Zhe Wang,Xuyi Yang,Songchun Zhang,Xianghao Kong,Taiyi Wu,Xiaotong Zhao,Ran Zhang,Alan Zhao,Anyi Rao
Main category: cs.CV
TL;DR: 本文提出ShotVerse框架,通过数据驱动的'规划-控制'范式解决文本驱动视频生成中多镜头场景下的相机控制难题,利用VLM规划器生成全局对齐轨迹,并通过相机适配器控制器渲染多镜头视频。
Details
Motivation: 现有文本驱动视频生成方法在多镜头场景中难以精确控制相机运动:隐式文本提示缺乏精度,而显式轨迹条件设定又带来过高人工成本且易导致模型执行失败。 Method: 提出'规划-then-控制'双代理框架ShotVerse:1)基于VLM的Planner利用空间先验从文本生成符合电影美学、全局对齐的相机轨迹;2)Controller通过相机适配器将轨迹渲染为多镜头视频;3)构建自动化多镜头相机标定流水线,建立统一全局坐标系,支撑ShotVerse-Bench高质量数据集构建。 Result: ShotVerse在多镜头视频生成中实现了高相机精度与跨镜头一致性,在电影美学和控制可靠性上显著优于基线方法,有效弥合了文本控制不可靠性与手动绘图高成本之间的鸿沟。 Conclusion: 数据为中心的对齐三元组(Caption, Trajectory, Video)建模是提升多镜头视频相机控制能力的关键路径,ShotVerse验证了该范式的有效性与可扩展性。 Abstract: Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.[73] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding
Songlin Li,Xin Zhu,Zechao Guan,Peipeng Chen,Jian Yao
Main category: cs.CV
TL;DR: 本文提出R-MSD(可靠多样本蒸馏)框架,通过建模教师采样方差、构建任务自适应教师池、结合质量感知信号匹配与对抗蒸馏目标,提升LVLMs知识蒸馏的稳定性与效果,在多个视频理解基准上显著优于单样本蒸馏方法。
Details
Motivation: 传统黑箱蒸馏依赖单个教师响应,易导致高方差和格式不一致,尤其在多模态或时序场景下监督不可靠。 Method: 提出R-MSD框架:1)构建任务自适应教师池替代单一教师;2)引入质量感知信号匹配过滤噪声;3)设计对抗蒸馏目标以增强知识迁移鲁棒性。 Result: 在VideoMME、Video-MMMU和MathVerse等视频理解基准上,4B学生模型分别提升+1.5%、+3.2%、+3.6%;显著优于单样本蒸馏及同训练预算下的SFT+RL基线。 Conclusion: 多教师样本建模与质量感知对抗蒸馏可有效缓解教师响应方差问题,为LVLMs高效稳定蒸馏提供新范式。 Abstract: Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).[74] Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
Seung Hyup Baek,Jimin Lee,Hyeongkeun Lee,Jae Won Cho
Main category: cs.CV
TL;DR: 本文提出了一种基于角色特异性查询的密集视频描述生成方法,通过分离定位与描述任务、对比对齐、重叠抑制和概念增强模块,提升了多事件定位精度与描述语义丰富性。
Details
Motivation: 现有基于查询的密集视频描述方法因共享查询导致定位与描述任务间干扰严重,且存在时间冗余问题。 Method: 引入角色特异性查询解耦定位与描述任务;采用对比对齐保证语义一致性;设计重叠抑制机制惩罚查询间时间重叠;加入轻量级概念提取模块增强描述语义。 Result: 在YouCook2和ActivityNet Captions两个主流基准上验证了方法有效性,显著提升定位准确率与描述质量。 Conclusion: 角色分离、对比对齐、重叠抑制与概念增强协同提升了DVC任务中事件定位的精确性与描述的语义丰富性。 Abstract: Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.[75] Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection
Mehmet Kerem Turkcan
Main category: cs.CV
TL;DR: 本文提出DART框架,无需训练即可将SAM3转化为实时多类检测器,通过共享视觉骨干网络计算和批量解码等优化,在不修改模型权重的情况下实现显著加速,并在COCO数据集上达到SOTA性能。
Details
Motivation: 现有基于视觉-语言模型的检测系统(如SAM3)每次前向传播仅能处理一个文本提示,检测N个类别需N次独立执行,计算开销大,难以满足实时性需求。 Method: 利用SAM3视觉骨干网络的类别无关性,共享其图像特征提取计算;结合批量多类解码、纯检测推理模式及TensorRT FP16部署;进一步引入适配器蒸馏以应对极端低延迟场景。 Result: 在单张RTX 4080上,对COCO val2017(80类),DART在4类、1008×1008输入下达55.8 AP与15.8 FPS;80类时累计加速达25倍;极端延迟下(13.9ms骨干)仍获38.7 AP。 Conclusion: DART是一种训练自由、即插即用的高效推理框架,显著提升开放词汇检测的实时性与实用性,无需额外训练或修改模型权重。 Abstract: Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.[76] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
Seung hee Choi,MinJu Jeon,Hyunwoo Oh,Jihwan Lee,Dong-Jin Kim
Main category: cs.CV
TL;DR: 本文提出STaRC框架,通过引入基于真实标注的帧级显著性监督(highlight detection)和显著性引导的检索与字幕生成机制,显著提升密集视频描述任务中时间分割的准确性与字幕生成的质量。
Details
Motivation: 现有密集视频描述(DVC)方法依赖启发式策略进行时间分段,难以对齐真实事件边界,导致检索与生成效果受限。 Method: 提出STaRC框架:1)利用DVC真值标注自动生成二值显著性标签,训练highlight检测模块;2)将显著性分数作为统一时间信号,指导显著性驱动的分段(saliency-guided segmentation)和注入解码器的显著性提示(Saliency Prompts)以增强字幕生成。 Result: 在YouCook2和ViTT基准上达到多数指标SOTA性能。 Conclusion: 显著性监督与统一显著性信号的有效整合,能提升时间分段的准确性与上下文感知的字幕生成能力,为DVC提供了新范式。 Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC[77] INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
Junqi Yang,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出INFACT基准,用于诊断视频大语言模型(Video-LLMs)在忠实性与事实性方面的幻觉问题,并评估其在多种干扰条件下的可靠性。
Details
Motivation: 现有基准对事实性幻觉覆盖不足,且多局限于干净场景评估,无法全面反映Video-LLMs在真实复杂环境下的可靠性。 Method: 构建包含9800个QA样本的诊断基准INFACT,涵盖忠实性与事实性细粒度分类,支持四种评估模式(Base、视觉退化、证据污染、时间干预),并引入Resist Rate(RR)和Temporal Sensitivity Score(TSS)量化可靠性。 Result: 在14个主流Video-LLMs上的实验表明:基础准确率高并不保证干扰下可靠性高;证据污染显著降低稳定性;时间干预导致最大性能下降;多个开源模型在事实性任务中TSS接近零,显示强烈的时间惰性。 Conclusion: INFACT揭示了当前Video-LLMs在事实性与时间敏感性方面存在严重缺陷,强调需超越单纯准确率、关注多维度可靠性评估。 Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.[78] SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
Xiaogang Du,Jiawei Zhang,Tongfei Liu,Tao Lei,Yingbo Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为SPEGC的持续测试时自适应方法,通过语义提示增强和可微图聚类,提升预训练模型在跨域医学图像分割中的鲁棒性与适应性。
Details
Motivation: 医学图像分割中训练与测试数据间的域差异(域间隙)严重阻碍预训练模型在临床的实际部署;现有持续测试时自适应(CTTA)方法易因不可靠监督信号导致错误累积和性能崩溃。 Method: 提出SPEGC方法:1)设计语义提示特征增强机制,利用解耦的共性/异质性提示池注入全局上下文信息以抑制噪声干扰;2)构建可微图聚类求解器,将全局边稀疏化建模为最优传输问题,端到端生成高阶结构表示;3)用该结构表示指导模型自适应,实现簇级预测一致性和决策边界动态调整。 Result: 在两个医学图像分割基准上,SPEGC显著优于现有最先进CTTA方法。 Conclusion: SPEGC通过语义提示增强与可微图聚类,有效缓解了域偏移下的特征噪声敏感性和错误累积问题,提升了CTTA在医学图像分割中的稳定性与泛化能力。 Abstract: In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei-Z/SPEGC-for-MIS.[79] OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure
Chuancheng Shi,Wenhua Wu,Fei Shen,Xiaogang Zhu,Kun Hu,Zhiyong Wang
Main category: cs.CV
TL;DR: 本文提出OrthoEraser方法,利用稀疏自编码器(SAE)实现高分辨率特征解耦,并通过解析正交化投影进行概念擦除,在消除有害内容的同时避免对良性语义造成损伤。
Details
Motivation: 现有文本到图像模型的概念擦除方法在完全抑制选定神经元时易损害良性属性,因敏感与良性语义在激活子空间中非正交叠加、高度纠缠。 Method: OrthoEraser首先用稀疏自编码器分解密集激活并分离敏感神经元;再通过耦合神经元检测识别易受干预的非敏感特征;最后采用解析梯度正交化策略,将擦除向量投影到耦合神经元的零空间,实现敏感概念与关键良性子空间的正交解耦。 Result: 在安全性实验中,OrthoEraser实现了高精度擦除,有效移除有害内容,同时保持生成流形完整性,显著优于当前最优基线方法。 Conclusion: OrthoEraser通过正交化投影机制解决了语义纠缠导致的擦除副作用问题,为安全可控的文本到图像生成提供了新范式。 Abstract: Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.[80] ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation
Lijun Guo,Qian Zhou,Zidi Shi,Hua Zou,Gang Ke
Main category: cs.CV
TL;DR: 本文提出ActiveFreq框架,结合主动学习与频域分析,通过AcSelect模块选择最具信息量的误标区域,并利用FreqFormer骨干网络(含傅里叶变换模块)提取更丰富的空间-频率特征,显著减少人工交互次数并提升医学图像交互式分割精度。
Details
Motivation: 现有交互式分割方法未能充分利用用户输入知识,且忽略频域信息;误标区域被随机选择而非按影响程度优先处理,限制了性能提升。 Method: 提出ActiveFreq框架,包含:1)AcSelect模块,基于主动学习策略自主选择最具信息量的误标区域;2)FreqFormer骨干网络,引入傅里叶变换模块实现空间-频率联合特征提取。 Result: 在ISIC-2017和OAI-ZIB数据集上,NoC@90分别达3.74和9.27,较SOTA提升23.5%和12.8%;仅用2次点击即实现mIoU 85.29%(ISIC)和75.76%(OAI-ZIB)。 Conclusion: ActiveFreq有效融合主动学习与频域建模,显著降低用户交互负担,同时提升分割精度与鲁棒性,为高效、精准的医学图像交互分割提供了新范式。 Abstract: Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.[81] Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices
Rambod Azimi,Yuri Grinberg,Dan-Xia Xu,Odile Liboiron-Ladouceur
Main category: cs.CV
TL;DR: 本文提出了一种基于Pix2Pix的条件生成对抗网络Gen-Fab,用于建模硅光子器件制造中的纳米级工艺变异,能从GDS设计图生成多样化的SEM风格预测图像,并在精度与不确定性建模上优于多种U-Net基线方法。
Details
Motivation: 硅光子器件制造中存在非均匀的工艺偏差(如过刻蚀、欠刻蚀、拐角圆化),影响性能,需高保真数字孪生模型来预测制造结果的可能分布。 Method: 提出Gen-Fab——一种改进的Pix2Pix条件GAN,在瓶颈层注入隐噪声以实现一对多映射,输入GDS版图,输出类SEM的高分辨率制造结果图像,表征工艺变异。 Result: 在离分布测试集上,Gen-Fab取得89.8%的最高IoU,显著优于确定性U-Net(85.3%)、MC-Dropout U-Net(83.4%)和U-Net集成(85.8%);KL散度与Wasserstein距离更低,分布匹配更优。 Conclusion: Gen-Fab能有效建模光子制造中的复杂、非均匀工艺变异,具备强泛化能力,为光子芯片数字孪生提供了新范式。 Abstract: Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.[82] Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance
Zexi Jia,Pengcheng Luo,Zhengyao Fang,Jinchao Zhang,Jie Zhou
Main category: cs.CV
TL;DR: 本文提出Manifold-Optimal Guidance (MOG)框架,通过黎曼几何校正分类器自由引导(CFG)在欧氏空间外推导致的流形偏离问题,并进一步设计Auto-MOG实现自适应引导强度调度,提升生成质量且无需重训练或额外计算开销。
Details
Motivation: 标准Classifier-Free Guidance(CFG)在高引导尺度下易导致过饱和、纹理伪影和结构坍塌,根本原因在于其在环境空间中进行欧氏外推,使采样轨迹偏离高密度数据流形。 Method: 将引导重新建模为局部最优控制问题,推导出一种闭式、几何感知的黎曼更新公式以校正离流形漂移;并提出Auto-MOG,一种基于能量平衡的动态引导强度调度策略。 Result: MOG在保真度与条件对齐性上显著优于基线方法,且几乎不增加计算开销;Auto-MOG消除了手动调参需求。 Conclusion: MOG从流形几何视角改进CFG,提供理论更合理、实践更鲁棒的扩散模型引导机制。 Abstract: Classifier-Free Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.[83] FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval
Chenchen Zhao,Jianhuan Zhuo,Muxi Chen,Zhaohua Zhang,Wenyu Jiang,Tianwen Jiang,Qiuyong Xiao,Jihong Zhang,Qiang Xu
Main category: cs.CV
TL;DR: 本文提出FBCIR方法来解释多模态模型在组合图像检索(CIR)任务中的注意力失衡问题,并设计了针对困难负样本的数据增强策略以提升模型鲁棒性。
Details
Motivation: 现有CIR模型在面对语义上与查询图像或文本对齐的困难负样本时性能下降,作者认为这是由于模型在图文模态间存在注意力失衡所致。 Method: 提出FBCIR多模态焦点解释方法以识别影响检索决策的关键图文成分;基于分析结果,构建面向困难负样本的数据增强流程,促进跨模态均衡推理。 Result: FBCIR验证了现有CIR模型普遍存在注意力失衡现象,尤其在困难负样本场景下;所提数据增强方法在多个CIR模型上显著提升了困难场景下的性能,同时保持标准基准上的表现。 Conclusion: 本文从模型可解释性出发揭示CIR模型的内在缺陷,并通过针对性数据增强提升其鲁棒性,为CIR模型诊断与优化提供了新视角。 Abstract: Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.[84] EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
Shuo Jiang,Gaojia Zhang,Min Tan,Yufei Yin,Gang Pan
Main category: cs.CV
TL;DR: 本文提出了一种统一的无监督伪装目标检测(UCOD)框架,通过多线索原生感知模块、伪标签演化融合与局部伪标签优化等机制,提升伪标签可靠性与特征保真度,在多个数据集上达到SOTA性能。
Details
Motivation: 现有UCOD方法受限于目标与背景高度相似性及噪声伪标签,导致边界溢出、结构模糊或细节丢失;伪标签引导易引入噪声,无引导学习又缺乏细粒度纹理。 Method: 提出统一UCOD框架:1)Multi-Cue Native Perception模块融合低层纹理与中层语义以对齐掩码与原生对象信息;2)Pseudo-Label Evolution Fusion结合师生交互与深度可分离卷积实现语义去噪;3)Spectral Tensor Attention Fusion通过多层注意力图的紧凑谱聚合平衡语义与结构信息;4)Local Pseudo-Label Refinement利用注意力多样性优化局部细节与边界保真度。 Result: 在多个UCOD数据集上取得SOTA性能,显著提升细节感知能力、边界对齐鲁棒性及复杂伪装场景下的泛化能力。 Conclusion: 所提框架有效协同提升伪标签质量与特征表达 fidelity,为UCOD任务提供了兼顾可靠性与精细度的新范式。 Abstract: Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.[85] MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
Jian Zou,Xiaoyu Xu,Zhihua Wang,Yilin Wang,Balu Adsumilli,Kede Ma
Main category: cs.CV
TL;DR: 本文提出MDS-VQA方法,通过模型引导的数据选择机制,在有限标注预算下筛选出对基线VQA模型既困难又内容多样的未标注视频,用于主动微调,显著提升跨域泛化性能。
Details
Motivation: 现有学习型视频质量评估(VQA)方法受限于模型设计与数据构建的脱节:模型中心方法依赖固定基准迭代,而数据中心方法缺乏针对当前模型弱点的系统性标注策略。 Method: 提出MDS-VQA:利用排序目标训练失败预测器估计样本难度;提取深度语义视频特征度量内容多样性;在标注预算约束下,通过贪心算法联合优化难度与多样性。 Result: 在多个VQA数据集和模型上验证有效;仅选取目标域5%样本进行主动微调,平均SRCC从0.651提升至0.722,并取得最优gMAD排名。 Conclusion: 模型引导的数据选择能高效识别高信息量样本,显著增强VQA模型的域适应性与泛化能力,为数据-模型协同演进提供了新范式。 Abstract: Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.[86] Mobile-GS: Real-time Gaussian Splatting for Mobile Devices
Xiaobiao Du,Yida Wang,Kun Zhan,Xin Yu
Main category: cs.CV
TL;DR: 本文提出Mobile-GS,一种面向移动设备的实时3D高斯泼溅渲染方法,通过深度感知的顺序无关渲染、神经视图相关增强、球谐蒸馏、神经向量量化与贡献驱动剪枝等技术,在保证高质量渲染的同时显著降低计算与存储开销。
Details
Motivation: 3D高斯泼溅(3DGS)虽渲染质量高,但计算密集、存储开销大,难以部署于资源受限的移动设备。 Method: 提出深度感知的顺序无关渲染以消除高斯深度排序瓶颈;引入神经视图相关增强缓解透明度伪影;采用一阶球谐蒸馏、神经向量量化和贡献驱动剪枝压缩高斯表示。 Result: 在移动端实现实时渲染(如>30 FPS),模型体积显著减小,同时保持接近原始3DGS的视觉质量。 Conclusion: Mobile-GS为边缘设备上的高质量实时新视角合成提供了高效可行的解决方案,兼顾速度、内存与画质。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of applications.However, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.[87] Risk-Controllable Multi-View Diffusion for Driving Scenario Generation
Hongyi Lin,Wenxiu Shi,Heye Huang,Dingyi Zhuang,Song Zhang,Yang Liu,Xiaobo Qu,Jinhua Zhao
Main category: cs.CV
TL;DR: 本文提出RiskMV-DPO,一种物理信息驱动、风险可控的多视角驾驶场景生成方法,通过融合目标风险等级与物理建模生成高风险动态轨迹,并结合几何-外观对齐模块和区域感知直接偏好优化(RA-DPO)提升生成质量;在nuScenes上显著提升3D检测mAP并降低FID。
Details
Motivation: 长尾高风险驾驶场景在真实数据中稀少且难以手动设计,现有生成方法将风险视为后验标签,且难以保持多视角几何一致性。 Method: 提出RiskMV-DPO框架:1)物理建模驱动的风险可控轨迹生成作为几何锚点;2)扩散模型视频生成;3)几何-外观对齐模块;4)运动感知掩码的区域感知直接偏好优化(RA-DPO)。 Result: 在nuScenes上实现3D检测mAP从18.17提升至30.50,FID降至15.70,支持高质量、多样化的长尾风险场景生成。 Conclusion: RiskMV-DPO将世界模型从被动预测转向主动、风险可控的合成,为具身智能的安全开发提供了可扩展的工具链。 Abstract: Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.[88] ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation
Md Jahidul Islam
Main category: cs.CV
TL;DR: 本文提出ReHARK框架,通过在再生核希尔伯特空间中引入全局邻近正则化,解决大模型在单样本场景下的稳定性-可塑性困境,显著提升单样本视觉语言迁移性能。
Details
Motivation: 大型视觉语言模型(如CLIP)在极低数据(尤其单样本)下游任务中面临稳定性与可塑性难以兼顾的问题;现有无训练方法(如Tip-Adapter)存在边界偏差和缺乏全局结构正则化等缺陷。 Method: 提出ReHARK:一种无训练的协同框架,包含四阶段精细化流程——混合先验构建、支撑集增强(桥接)、自适应分布校正、多尺度RBF核集成,全部基于RKHS中的全局邻近正则化思想。 Result: 在11个基准上实验验证,ReHARK以平均65.83%准确率创下单样本适应新SOTA,显著优于现有基线。 Conclusion: ReHARK通过融合语义先验、跨模态桥接、分布对齐与多尺度核建模,有效缓解了单样本VLM适配中的稳定性-可塑性权衡问题,为无训练小样本迁移提供了新范式。 Abstract: The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.[89] Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
Tingxuan Huang,Haowei Zhu,Jun-hai Yong,Hao Pan,Bin Wang
Main category: cs.CV
TL;DR: Mango-GS 是一种多帧、节点引导的高保真4D重建框架,利用时序Transformer建模短时帧间运动依赖,并通过稀疏控制节点实现高效、稳定、一致的动态场景重建与实时渲染。
Details
Motivation: 现有基于高斯泼溅的动态场景建模方法多采用逐帧优化,易过拟合瞬时状态,难以捕捉底层运动动力学,导致时序不一致和对应漂移问题。 Method: 提出 Mango-GS 框架:引入时间窗口内的时序 Transformer 建模运动依赖;用稀疏控制节点(含解耦的规范位置与潜在码)作为语义锚点;结合输入掩码策略及两个多帧损失进行端到端训练。 Result: 在多个动态场景数据集上达到最先进(SOTA)重建质量,同时支持实时渲染,显著提升时序一致性、鲁棒性与大运动下的稳定性。 Conclusion: Mango-GS 有效解决了动态3D场景重建中保真度、时序连贯性与效率之间的权衡问题,为高保真4D重建与交互式渲染提供了新范式。 Abstract: Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.[90] PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation
Xiangyu Li,Chenglin Wang,Qiantong Shen,Fanding Li,Wei Wang,Kuanquan Wang,Yi Shen,Baochun Zhao,Gongning Luo
Main category: cs.CV
TL;DR: 本文提出了一种PCA增强的概率U-Net(PEP U-Net),通过在后验网络中引入PCA降维及逆PCA重建,缓解潜在空间冗余、提升表达能力与计算效率,在保持生成多样性的同时,更好平衡分割精度与预测变异性。
Details
Motivation: 解决现有基于cVAE的医学图像分割方法存在的高维潜在空间冗余和单后验网络表达能力有限的问题。 Method: 提出PCA增强的概率U-Net(PEP U-Net):在后验网络中应用PCA进行降维以减少冗余,并通过逆PCA重建关键信息以增强潜在空间表征能力。 Result: 相比传统生成模型,该方法在保持多样分割假设生成能力的同时,显著提升了分割精度与预测变异性的平衡,增强了生成建模在医学图像分割中的性能。 Conclusion: PEP U-Net有效克服了cVAE类方法的固有缺陷,为模糊医学图像分割提供了更高效、更具表达力的概率建模新范式。 Abstract: Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.[91] MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Lirong Che,Shuo Wen,Shan Huang,Chuang Wang,Yuzhe Yang,Gregory Dudek,Xueqian Wang,Jian Su
Main category: cs.CV
TL;DR: 本文提出了MANSION框架,用于生成建筑规模、多楼层的3D环境,以支持真实世界中跨楼层长时程机器人任务,并发布了MansionWorld数据集及语义场景编辑智能体,揭示了现有最先进智能体在该新基准下的性能显著下降。
Details
Motivation: 现实世界中的机器人任务常为长时程且跨越多楼层,需丰富的空间推理能力,但现有具身智能基准大多局限于单层室内环境,无法反映真实复杂性。 Method: 提出MANSION——首个语言驱动的、考虑垂直结构约束的建筑级多楼层3D环境生成框架;基于此构建MansionWorld数据集(1000+多样建筑)和任务语义驱动的开放词汇场景编辑智能体。 Result: 在MANSION基准上评测显示,当前最优智能体性能显著下降,验证了该框架作为下一代空间推理与规划关键测试平台的有效性。 Conclusion: MANSION填补了多楼层、建筑尺度具身任务基准的空白,推动具身AI向更真实、更复杂的空间理解与长时程规划发展。 Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.[92] Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception
Xinyu Nan,Ning Wang,Yuyao Zhai,Mei Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的双监督图像美学增强方法DIAE,通过多模态美学感知(MAP)将模糊的美学指令转化为明确指导,并构建弱配对数据集IIAEData及双分支监督框架,以解决美学增强中指令遵循难和高质量配对数据稀缺的问题。
Details
Motivation: 现有图像编辑模型在美学增强方面表现不佳,主要因为难以遵循具有美学感知的编辑指令,且缺乏内容一致但美学质量不同的“完美配对”图像数据。 Method: 提出DIAE模型:1)引入多模态美学感知(MAP),使用标准化多属性美学文本指令和对应文本-图像多模态控制信号;2)构建弱配对数据集IIAEData,并设计双分支监督框架实现弱监督训练。 Result: DIAE在图像美学评分和内容一致性评分上均优于基线方法。 Conclusion: DIAE有效提升了图像美学增强能力,验证了多模态美学感知与弱监督双分支框架在该任务中的有效性。 Abstract: Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.[93] TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision
Robinson Umeike,Cuong Pham,Ryan Hausen,Thang Dao,Shane Crawford,Tanya Brown-Giammanco,Gerard Lemson,John van de Lindt,Blythe Johnston,Arik Mitschang,Trung Do
Main category: cs.CV
TL;DR: 本文提出了TornadoNet基准,用于评估实时目标检测模型在街景图像中进行多级建筑损毁评估的性能,比较了YOLO系列CNN模型与RT-DETR等Transformer模型,并引入序数感知监督策略以提升损毁严重程度估计的准确性。
Details
Motivation: 现有方法缺乏对灾后真实场景下多级建筑损毁检测的系统性评估,尤其缺少对模型架构与损失函数协同影响的控制实验;需建立兼顾检测精度与损毁等级序数一致性的评估基准。 Method: 构建包含3333张街景图像和8890个标注建筑实例的TornadoNet基准;采用IN-CORE五级损毁分类框架;对比YOLO系列CNN与RT-DETR等Transformer模型;提出软序数分类目标与序数距离惩罚损失。 Result: YOLO模型在检测精度(最高46.05% mAP@0.5)与速度(66–276 FPS)上占优;RT-DETR在序数一致性上更优(88.13% Ordinal Top-1 Accuracy,MAOE=0.65);引入序数监督后RT-DETR的mAP提升至44.70%,Ordinal Top-1 Accuracy达91.15%,MAOE降至0.56。 Conclusion: 序数感知监督能显著提升损毁严重程度估计的可靠性,其效果依赖于与检测器架构的匹配;TornadoNet为灾后响应提供了可部署的方法与基准工具。 Abstract: We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet[94] SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
Yuyuan Yang,Junkun Hong,Hongrong Wang,Honghao Cai,Xunpeng Ren,Ge Wang,Mingcong Lei,Shenhao Yan,Jiahao Yang,Chengsi Yao,Xi Li,Yiming Zhao,Yatong Han,Jinke Ren
Main category: cs.CV
TL;DR: 本文提出了一种分阶段视觉语言学习框架SVLL,通过解耦空间定位与时间推理,并引入新型对齐目标Bias-DPO,提升具身任务规划的物理合理性和因果一致性。
Details
Motivation: 现有端到端训练易导致过早时间绑定,强化学习方法则存在优化不稳定问题;同时标准DPO忽略最优路径的绝对似然约束,易引发不安全或幻觉行为。 Method: 提出三阶段SVLL框架:前两阶段分别建模视觉依赖和动作时序;第三阶段提出Bias-DPO,显式最大化专家轨迹似然并惩罚过度自信的幻觉动作。 Result: 在AI2-THOR基准和真实机器人部署中,SVLL在任务成功率上超越Qwen2.5-VL-7B、GPT-4o、Gemini-2.0-flash等SOTA模型,并显著减少物理约束违反。 Conclusion: SVLL结合Bias-DPO能有效锚定策略于专家流形,缓解因果错位,确保严格遵循环境可供性,抑制物理不可行的捷径行为。 Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.[95] R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
Zhongyu Xia,Yousen Tang,Yongtao Wang,Zhifeng Wang,Weijun Qin
Main category: cs.CV
TL;DR: R4Det提出了一种新型4D雷达-相机融合的3D目标检测方法,通过全景深度融合、可变形门控时序融合和实例引导动态优化模块,解决了深度估计不准、依赖自车位姿及小目标雷达反射弱等问题,在TJ4DRadSet和VoD数据集上达到SOTA。
Details
Motivation: 现有4D雷达-相机融合的3D目标检测方法存在三大问题:绝对深度估计不鲁棒准确、时序融合严重依赖不稳定的自车位姿、小目标雷达点云稀疏导致漏检。 Method: 提出R4Det框架,包含三个核心模块:1)全景深度融合模块(Panoramic Depth Fusion),联合优化绝对与相对深度;2)可变形门控时序融合模块(Deformable Gated Temporal Fusion),摆脱对自车位姿的依赖;3)实例引导动态优化模块(Instance-Guided Dynamic Refinement),利用2D实例提取语义原型增强检测。 Result: 在TJ4DRadSet和VoD两个主流4D雷达数据集上,R4Det实现了3D目标检测性能的SOTA结果。 Conclusion: R4Det有效缓解了4D雷达-相机融合检测中的深度估计、时序建模与小目标感知瓶颈,提升了多模态融合的鲁棒性与实用性,为自动驾驶感知提供了新思路。 Abstract: 4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.[96] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
Hui Zhang,Juntao Liu,Zongkai Liu,Liqiang Niu,Fandong Meng,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出WeEdit,一个面向文本中心图像编辑的系统性解决方案,包含数据构建、双基准测试和两阶段训练策略,显著提升复杂文本编辑的精度与清晰度。
Details
Motivation: 现有模型在执行复杂文本编辑时表现不佳,常产生模糊或幻觉字符,主要原因是缺乏针对文本编辑的专门训练范式、大规模数据集及标准化基准。 Method: 提出基于HTML的自动编辑流水线生成330K多语言训练对;设计双语/多语言基准;采用字形引导监督微调+多目标强化学习的两阶段训练策略。 Result: WeEdit在多种编辑任务上显著超越现有开源模型,验证了其在文本清晰度、指令遵循和背景保持上的优越性。 Conclusion: WeEdit通过系统性方法解决了文本中心图像编辑的关键瓶颈,为该领域提供了可扩展的数据、评估与算法框架。 Abstract: Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.[97] LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
Junkun Jiang,Ho Yin Au,Jingyu Xiang,Jie Chen
Main category: cs.CV
TL;DR: 本文提出LabanLite运动表示法和LaMoGen生成框架,通过符号化推理实现可解释、可控的语言驱动运动合成。
Details
Motivation: 现有基于文本-动作嵌入的方法难以生成时间准确、细节丰富的动作,且缺乏可解释性。 Method: 提出LabanLite——一种基于Labanotation改进的离散符号化运动表示;构建LaMoGen框架,利用大语言模型进行符号推理生成动作序列;建立基于Labanotation的基准与多维评估指标。 Result: LaMoGen在自建基准及两个公开数据集上均超越先前方法,显著提升可解释性与可控性。 Conclusion: 符号化推理与基于智能体的设计为语言驱动动作合成提供了更优路径。 Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.[98] Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints
Lijun Guo,Haoyu Zhao,Xingyue Zhao,Rong Fu,Linghao Zhuang,Siteng Huang,Zhongyu Li,Hua Zou
Main category: cs.CV
TL;DR: 本文提出Articulat3D框架,从单目视频中构建高保真关节物体数字孪生体,通过运动先验驱动初始化与几何/运动约束优化,实现几何准确且时间一致的重建。
Details
Motivation: 现有方法依赖多视角静态捕捉,难以在真实场景中扩展;需从随意拍摄的单目视频中高效构建关节物体数字孪生。 Method: 提出Motion Prior-Driven Initialization(利用3D点轨迹和紧凑运动基实现软刚性分组)和Geometric and Motion Constraints Refinement(基于可学习运动学原语,含关节轴、枢轴点与帧级运动标量)联合优化。 Result: 在合成基准与真实单目视频上达到SOTA性能,显著提升无控现实条件下数字孪生构建的可行性。 Conclusion: Articulat3D为关节物体数字孪生提供了首个适用于随意单目视频的端到端框架,兼顾几何精度与运动一致性。 Abstract: Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.[99] DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling
Tong Zhao,Mingkun Lei,Liangyu Yuan,Yanming Yang,Chenxi Song,Yang Wang,Beier Zhu,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为DyWeight的轻量级、学习型多步求解器,用于加速扩散模型采样过程,通过动态加权历史梯度并隐式校准时间步长,在显著减少函数评估次数的同时提升生成质量与稳定性。
Details
Motivation: 扩散模型采样速度慢,现有基于多步ODE求解器的方法依赖手工设计系数,无法适应扩散过程中的非平稳动力学特性。 Method: 提出Dynamic Gradient Weighting (DyWeight),采用学习驱动的多步求解范式,放松传统数值约束,学习时变参数以自适应聚合历史梯度,并隐式调节有效步长实现时间校准。 Result: 在CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion和FLUX.1-dev等多个基准上,DyWeight以更少的函数评估次数实现了更高视觉保真度和稳定性,达到高效扩散求解器的新SOTA。 Conclusion: DyWeight通过引入可学习、时变、隐式耦合的梯度加权机制,有效弥合了数值求解效率与扩散模型内在去噪动力学之间的鸿沟,为高效扩散采样提供了新范式。 Abstract: Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight[100] SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation
Muyi Sun,Yifan Gao,Ziang Jia,Xingqun Qi,Qianli Zhang,Qian Liu,Tianzheng Deng
Main category: cs.CV
TL;DR: 本文提出SemiTooth,一种面向多源CBCT数据的半监督牙齿结构分割框架,通过构建多源半监督数据集MS3Toothset,并设计多教师-多学生架构与严格加权置信约束,提升无标注及跨机构数据利用率与分割精度。
Details
Motivation: CBCT牙齿分割面临全标注数据获取难、多源数据采集差异大导致的标注质量低、体素级不一致和域间偏差问题,亟需高效利用多源未标注数据。 Method: 构建多源半监督牙齿数据集MS3Toothset(含三类标注水平数据);设计多教师-多学生半监督框架SemiTooth,各学生网络分别学习对应来源的无标签数据并由专属教师监督;引入更严格的加权置信约束提升多源准确性。 Result: 在MS3Toothset上实验验证SemiTooth有效性,达到半监督与多源牙齿分割任务的SOTA性能。 Conclusion: SemiTooth为临床CBCT多源半监督牙齿分割提供了通用、鲁棒且高效的解决方案,显著提升了跨机构数据的利用效率与分割一致性。 Abstract: With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source accuracy.Extensive experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.[101] Noise-aware few-shot learning through bi-directional multi-view prompt alignment
Lu Niu,Cheng Xue
Main category: cs.CV
TL;DR: 本文提出NA-MVP框架,通过双向多视角提示对齐实现噪声感知的少样本学习,提升视觉-语言模型在标签噪声下的鲁棒性。
Details
Motivation: 现有视觉-语言少样本方法易受噪声标签干扰,缺乏建模细粒度语义线索及自适应区分干净与噪声信号的能力。 Method: 提出NA-MVP框架:(1) 多视角提示结合非平衡最优传输实现区域级细粒度对齐并抑制不可靠区域;(2) 双向提示设计,分别捕获清洁导向与噪声感知线索;(3) 对齐引导的选择性精炼策略,仅修正误标样本。 Result: 在合成与真实噪声数据集上显著优于现有SOTA方法,验证了其在噪声监督下少样本学习的鲁棒性。 Conclusion: NA-MVP通过区域感知、双向提示与选择性精炼,有效提升了少样本视觉-语言模型对标签噪声的鲁棒性,为噪声环境下的跨模态学习提供了新范式。 Abstract: Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.[102] Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
Jiin Im,Sisung Liu,Je Hyeong Hong
Main category: cs.CV
TL;DR: 本文提出Shape-of-You (SoY)框架,通过融合Gromov-Wasserstein(FGW)优化与3D基础模型几何先验,改进无监督语义对应学习,克服传统最近邻伪标签方法忽略结构关系和几何歧义的缺陷,并以锚点线性化与软目标损失提升鲁棒性,在SPair-71k和AP-10k上达到SOTA。
Details
Motivation: 现有基于2D基础模型和最近邻伪标签的无监督语义对应方法局限于局部外观匹配,无法处理因对称性或重复纹理导致的几何歧义,且忽视图像内部结构关系。 Method: 将伪标签生成建模为Fused Gromov-Wasserstein(FGW)问题,联合优化特征间相似性与结构内一致性;引入3D基础模型定义几何空间中的内在结构;采用锚点线性化近似求解计算高昂的FGW;设计软目标损失,动态融合噪声伪标签与网络预测。 Result: 在SPair-71k和AP-10k数据集上达到当前最优性能,显著提升无显式几何标注下的语义对应精度。 Conclusion: SoY首次将FGW与3D几何先验结合用于无监督语义对应,验证了结构一致性建模对缓解几何歧义的关键作用,为该任务提供了新范式。 Abstract: Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.[103] MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
Shengyuan Liu,Zanting Ye,Yunrui Lin,Chen Hu,Wanting Geng,Xu Han,Bulat Ibragimov,Yefeng Zheng,Yixuan Yuan
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、模型无关的分层视觉token剪枝框架MedPruner,用于高效处理3D医学图像,通过两阶段策略(跨片锚点过滤与动态信息核选择)显著减少冗余token,在保持甚至提升性能的同时将视觉token压缩至不足5%,大幅降低计算开销。
Details
Motivation: 现有3D医学视觉语言模型因直接拼接2D切片导致解剖冗余严重,且固定剪枝比例难以适应不同切片间信息密度差异,造成计算效率低下,限制临床部署。 Method: 提出MedPruner框架:第一阶段为跨片锚点过滤模块,消除片级时间冗余;第二阶段为动态信息核选择策略,基于累积注意力权重实现自适应token级压缩。该方法无需训练、兼容多种模型。 Result: 在三个3D医学基准和三种不同医学VLM上验证,发现现有模型存在大量token冗余;MedPruner使MedGemma等模型在保留<5%视觉token时仍维持或超越原性能,显著降低计算开销。 Conclusion: 动态token选择对3D医学图像理解至关重要,MedPruner为高效、实用的临床级医学VLM部署提供了新范式。 Abstract: While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.[104] Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography
Yichi Zhang,Le Xue,Wenbo Zhang,Lanlan Li,Feiyang Xiao,Yuchen Liu,Xiaohui Zhang,Hongwei Zhang,Shuqi Wang,Gang Feng,Liling Peng,Xin Gao,Yuanfan Xu,Yuan Qi,Kuangyu Shi,Hong Zhang,Yuan Cheng,Mei Tian,Zixin Hu
Main category: cs.CV
TL;DR: 本文提出SegAnyPET,一种基于3D全身影像的通用分割基础模型,利用迄今最大最全面的PET数据集(11041例扫描、59831个掩码)训练,支持零样本跨中心、跨示踪剂、跨疾病器官与病灶分割,并支持人机协同临床工作流。
Details
Motivation: PET图像解剖对比度低、标注成本高,导致深度学习在定量PET分析中发展受限,亟需通用、可扩展的基础分割模型。 Method: 构建大规模3D全身影像PET数据集;设计基于3D架构与提示工程的通用分割基础模型SegAnyPET,支持多种分割任务及人工快速修正。 Result: 在多中心、多示踪剂、多疾病数据集上展现出强零样本分割性能,支持高效人机协同临床流程。 Conclusion: SegAnyPET为PET影像的通用、可扩展、临床就绪的自动分割提供了新范式,有望推动分子影像的临床应用。 Abstract: Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.[105] MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
Baicheng Li,Dong Wu,Jun Li,Shunkai Zhou,Zecui Zeng,Lusong Li,Hongbin Zha
Main category: cs.CV
TL;DR: 本文提出了MV-SAM3D,一种无需训练的多视角布局感知3D生成框架,通过多扩散融合与物理感知优化,提升了多物体场景重建的保真度与布局合理性。
Details
Motivation: 现有单视角布局感知3D生成方法无法利用多视角互补信息,且独立估计物体位姿易导致穿模、悬浮等物理不合理布局。 Method: 提出多视角融合的Multi-Diffusion过程(含注意力熵加权与可见性加权)和物理感知优化(引入碰撞与接触约束),全程无需额外训练。 Result: 在标准基准与真实多物体场景上显著提升重建保真度与布局合理性。 Conclusion: MV-SAM3D验证了训练-free多视角融合与物理约束联合建模的有效性,为实用化场景级3D生成提供了新思路。 Abstract: Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.[106] Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
Sizhong Qin,Ramon Elias Weber,Xinzheng Lu
Main category: cs.CV
TL;DR: 本文提出HouseMind,一种多模态大语言模型,通过离散房间实例标记统一词汇表,实现建筑平面图的理解、生成与编辑,显著提升几何有效性与可控性。
Details
Motivation: 现有AI系统在建筑平面图设计中难以同时处理几何、语义和空间层次的联合推理,尤其在空间一致性与可控生成方面存在不足。 Method: 提出HouseMind模型,引入离散房间实例标记构建统一词汇表,结合多模态对齐与指令微调,支持从文本指令合成连贯可控的平面布局。 Result: 实验表明该框架在几何有效性与可控性上优于现有方法,且具备高效性与本地部署能力。 Conclusion: HouseMind实现了平面图理解、生成与编辑的一体化,为建筑AI提供了更鲁棒、可控与实用的解决方案。 Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.[107] IDRL: An Individual-Aware Multimodal Depression-Related Representation Learning Framework for Depression Diagnosis
Chongxiao Wang,Junjie Liang,Peng Cao,Jinzhu Yang,Osmar R. Zaiane
Main category: cs.CV
TL;DR: 本文提出IDRL框架,通过解耦多模态表征并引入个体感知的模态融合模块,解决跨模态不一致、干扰信息及个体差异问题,提升抑郁检测鲁棒性与准确性。
Details
Motivation: 现有方法存在跨模态抑郁线索不一致、无关干扰信息多、以及个体间抑郁表现差异大导致模态重要性不同等问题,影响多模态融合效果和诊断可靠性。 Method: 提出IDRL框架:1)将多模态表征解耦为模态共性抑郁空间、模态特异性抑郁空间和抑郁无关空间;2)设计个体感知模态融合模块(IAF),动态加权各抑郁相关特征以实现自适应跨模态融合。 Result: 大量实验表明,IDRL在多模态抑郁检测任务中实现了更优且更鲁棒的性能。 Conclusion: IDRL通过解耦表征与个体感知融合,有效缓解了模态冲突、干扰抑制与个体差异三大挑战,为可靠抑郁识别提供了新范式。 Abstract: Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.[108] OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Xianjing Han,Bin Zhu,Shiqi Hu,Franklin Mingzhe Li,Patrick Carrington,Roger Zimmermann,Jingjing Chen
Main category: cs.CV
TL;DR: 本文提出OSCBench基准,专门评估文本到视频(T2V)模型对动作引发的物体状态变化(OSC)的理解能力,发现现有模型在OSC任务上表现薄弱,尤其在新颖和组合场景中,揭示OSC是当前T2V生成的关键瓶颈。
Details
Motivation: 现有T2V评测基准忽视了文本中明确指定的物体状态变化(OSC)这一关键动作理解维度,而OSC对真实世界动作建模至关重要。 Method: 构建了基于烹饪教学数据的OSCBench基准,涵盖常规、新颖和组合三类动作-物体交互场景;采用人工评估与多模态大模型(MLLM)自动评估相结合的方式,对6个主流开源及商用T2V模型进行系统评测。 Result: 所有被测T2V模型在OSC任务上均表现不佳,尤其在新颖和组合场景下难以实现准确且时间一致的物体状态变化,尽管其在语义对齐和场景一致性上表现良好。 Conclusion: 物体状态变化(OSC)是当前T2V生成的核心瓶颈,OSCBench为推动具备状态感知能力的视频生成模型发展提供了重要诊断工具。 Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.[109] FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation
Meilu Zhu,Zhiwei Wang,Axiu Mao,Yuxing Li,Xiaohan Xing,Yixuan Yuan,Edmund Y. Lam
Main category: cs.CV
TL;DR: 本文提出了首个面向医学图像分割的联邦学习基准FL-MedSegBench,涵盖9个任务、10种模态、2D/3D数据,并系统评估了8种通用FL与5种个性化FL方法在精度、公平性、通信效率、收敛性及域泛化等方面的表现,揭示了个性化方法(如FedBN)优势显著、无绝对最优方法、通信鲁棒性与公平性差异等关键发现。
Details
Motivation: 缺乏标准化的医学图像分割联邦学习基准,导致方法评估不公且不全面。 Method: 构建FL-MedSegBench基准,包含九个分割任务、十种成像模态、2D/3D数据及临床异质性;系统评估八种通用FL和五种个性化FL方法,维度包括分割精度、公平性、通信效率、收敛行为和未见域泛化能力。 Result: (i)个性化FL(如FedBN)持续优于通用FL;(ii)无单一方法在所有数据集上占优;(iii)基于归一化的个性化方法对降低通信频率具有强鲁棒性;(iv)Ditto和FedRDN等方法能更好保护表现较差客户端;(v)方法在未见域上的泛化能力与其在参与客户端上的整体性能强相关。 Conclusion: FL-MedSegBench为医学图像分割联邦学习提供了首个全面、可复现的评估基准,揭示了关键实践规律,并开源工具包以推动临床可用FL方案的发展。 Abstract: Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.[110] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
Lu Wang,Zhuoran Jin,Yupu Hao,Yubo Chen,Kang Liu,Yulong Ao,Jun Zhao
Main category: cs.CV
TL;DR: 本文提出Think While Watching框架,通过内存锚定的流式视频推理方法,实现观看与思考的并行处理,提升多轮交互下长程依赖建模能力,并在多个基准上取得性能提升。
Details
Motivation: 现有MLLMs在流式视频理解中存在离线推理限制、在线推理能力弱、感知与生成无法并发、记忆随流增长而衰减等问题,难以支持多轮交互下的持续视频流理解。 Method: 提出内存锚定的流式视频推理框架Think While Watching,构建三阶段多轮思维链数据集,采用阶段匹配训练策略,并引入段级流式因果掩码和流式位置编码保证严格因果性;推理时设计观看与思考重叠的高效流水线,并自适应选择最优注意力后端。 Result: 在StreamingBench和OVO-Bench单轮设置下,基于Qwen3-VL分别提升准确率2.6%和3.79%;在多轮设置下保持性能的同时减少56%输出token。 Conclusion: Think While Watching有效解决了流式视频多轮交互中的记忆维持与长程依赖建模难题,显著提升了在线视频理解性能与效率。 Abstract: Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/[111] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder
Siquan Huang,Yijiang Li,Ningzhi Gao,Xingfu Yan,Leyu Shi
Main category: cs.CV
TL;DR: 本文提出BackdoorIDS,一种零样本、推理时检测预训练视觉编码器后门样本的方法,利用注意力劫持与恢复现象,通过输入掩码轨迹上的嵌入序列变化和密度聚类识别后门样本,无需重训练,兼容多种架构。
Details
Motivation: 下游用户常使用来源不明的第三方预训练视觉编码器,面临后门攻击风险;现有防御方法多需重训练或无法零样本适用。 Method: 基于注意力劫持与恢复现象,对输入图像逐步掩码,提取不同掩码比例下的图像嵌入序列,应用DBSCAN等密度聚类算法检测嵌入序列是否形成多个簇,若超过一个簇则判定为后门样本。 Result: BackdoorIDS在多种攻击类型、数据集和模型家族上持续优于现有防御方法,具备零样本、无需重训练、即插即用特性,兼容CNN、ViT、CLIP及LLaVA-1.5等架构。 Conclusion: BackdoorIDS是一种高效、通用且实用的零样本后门检测方法,为部署第三方视觉编码器提供了可靠安全保障。 Abstract: Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.[112] Linking Perception, Confidence and Accuracy in MLLMs
Yuetian Du,Yucheng Wang,Rongyu Zhang,Zhijie Xu,Boyu Yang,Ming Kong,Jie Liu,Qiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于置信度驱动的强化学习方法(CDRL)和置信度感知的测试时扩展方法(CA-TTS),以解决多模态大语言模型(MLLMs)中普遍存在的置信度校准不良问题,并在多个基准上实现了8.8%的一致性能提升。
Details
Motivation: 现有MLLMs虽提升了视觉感知精度,但缺乏对自身不确定性的认知能力,即‘知道自己不知道’的能力,存在严重的置信度校准偏差问题。 Method: 提出置信度驱动的强化学习(CDRL),利用原始-噪声图像对和新型置信度奖励函数;进一步设计置信度感知的测试时扩展(CA-TTS),通过专家模型动态协调自一致性、自反思和视觉自检模块。 Result: 在四个基准上实现一致8.8%的性能提升;消融实验验证了各模块及扩展策略的有效性与优越性。 Conclusion: 置信度校准不仅是训练优化手段,更可作为免费午餐赋能测试时扩展;所提框架为提升MLLMs可靠性与鲁棒性提供了新范式。 Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.[113] PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
Haohua Chen,Tianze Zhou,Wei Zhu,Runqi Wang,Yandong Guan,Dejia Song,Yibo Chen,Xu Tang,Yao Hu,Lu Sheng,Zhiyong Wu
Main category: cs.CV
TL;DR: 本文提出PROMO框架,基于Flow Matching DiT架构与潜在多模态条件拼接,结合自参考机制,实现高质量、高效率的虚拟试穿(VTON),在保真度与推理速度间取得更好平衡。
Details
Motivation: 解决现有扩散模型VTON方法结构复杂、采样慢、保真度与效率难以兼顾的问题;将VTON建模为结构化图像编辑任务,挖掘其配对数据对通用图像编辑的监督价值。 Method: 提出PROMO框架:采用Flow Matching DiT作为主干网络,引入潜在空间中的多模态条件拼接(如姿态、衣着、人体掩码等),并设计自参考机制加速推理;训练策略强调主体保持、纹理忠实迁移与无缝融合。 Result: 在标准VTON基准上,PROMO在视觉保真度上超越先前VTON方法及通用图像编辑模型,同时显著降低推理开销,实现质量与速度的更好权衡。 Conclusion: Flow Matching Transformer结合潜在多模态条件与自参考机制,是高效、高质量虚拟试穿的有效范式,并具备向通用图像编辑迁移的潜力。 Abstract: Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.[114] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Xuanlang Dai,Yujie Zhou,Long Xing,Jiazi Bu,Xilin Wei,Yuhong Liu,Beichen Zhang,Kai Chen,Yuhang Zang
Main category: cs.CV
TL;DR: 本文提出EndoCoT框架,通过迭代式思维引导和终端思维锚定模块,激活多模态大语言模型(MLLM)的链式思维能力,并将其与扩散变换器(DiT)的去噪过程动态对齐,显著提升空间推理等复杂任务的准确率。
Details
Motivation: 现有MLLM作为扩散模型文本编码器存在两大缺陷:一是单步编码无法激发链式思维,导致推理深度不足;二是解码过程中指导信号恒定,无法支持逐步分解复杂指令。 Method: 提出Endogenous Chain-of-Thought(EndoCoT)框架,包含两个核心模块:(1)迭代思维引导模块,通过反复优化潜在思维状态来激活MLLM的深层推理能力,并将其映射到DiT的去噪过程;(2)终端思维锚定模块,将最终思维状态与真实答案对齐,确保推理轨迹受文本监督。 Result: 在Maze、TSP、VSP、Sudoku等多个空间推理基准上平均准确率达92.1%,较最强基线提升8.3个百分点。 Conclusion: EndoCoT有效弥合了MLLM推理能力与扩散模型生成过程之间的鸿沟,实现了复杂任务的逐步、可解释、高精度求解。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.[115] UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
Cao Thien Tan,Phan Thi Thu Trang,Do Nghiem Duc,Ho Ngoc Anh,Hanyang Zhuang,Nguyen Duc Dung
Main category: cs.CV
TL;DR: 本文提出UCAN,一种轻量级混合CNN-Transformer网络,通过统一卷积与注意力机制、引入Hedgehog Attention和蒸馏大核模块,并采用跨层参数共享,在保持高精度的同时显著降低计算开销,适用于资源受限设备的图像超分辨率任务。
Details
Motivation: 现有混合CNN-Transformer模型在图像超分辨率中虽性能优异,但扩大注意力窗口或卷积核会显著增加计算成本,难以部署于资源受限设备。 Method: 提出UCAN网络:1)结合窗口空间注意力与Hedgehog Attention以兼顾局部纹理与长程依赖;2)设计蒸馏式大核模块保留高频结构且避免高计算开销;3)采用跨层参数共享进一步压缩模型复杂度。 Result: 在Manga109(4×)上UCAN-L达31.63 dB PSNR,仅需48.4G MACs;在BSDS100上达27.79 dB,优于参数量大得多的模型;实验证明其在精度、效率与可扩展性间取得更优平衡。 Conclusion: UCAN通过高效融合卷积与注意力机制,在不牺牲性能的前提下大幅降低计算负担,为实际高清图像重建提供了轻量、可扩展的解决方案。 Abstract: Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.[116] PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures
Chi Chen,Tianle Jiang,Xiaodong Wei,Yanming Wang
Main category: cs.CV
TL;DR: 本文提出PolyCrysDiff框架,基于条件潜在扩散模型实现可控、可计算的三维多晶微观结构生成,在形态、取向分布和空间相关性上表现优异,并通过CPFEM验证其物理有效性,助力结构-性能关系研究与材料设计。
Details
Motivation: 真实、可控地构建多晶材料三维微观结构对揭示结构-性能关系至关重要,但目前仍具挑战性。 Method: 提出基于条件潜在扩散模型(conditional latent diffusion)的PolyCrysDiff框架,实现端到端生成可计算的3D多晶微观结构,并通过CPFEM仿真验证其物理有效性。 Result: PolyCrysDiff在晶粒形貌、取向分布和三维空间相关性上高度保真,晶粒属性(如尺寸、球形度)控制R²超0.972,优于MRF和CNN等主流方法;生成结构经CPFEM验证具备计算可行性与物理合理性。 Conclusion: PolyCrysDiff为多晶材料结构-性能关系解析及数据驱动的加速优化与设计提供了关键工具。 Abstract: The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.[117] COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection
Guillem González,Guillem Alenyà,Sergi Foix
Main category: cs.CV
TL;DR: 本文提出COTONET,一种增强型自定义YOLO11模型,融合多种注意力机制(如SimAM、PHAM、CARAFE、SE块等),专用于棉花蒴果多阶段精准检测,兼顾轻量化与高精度,适用于边缘计算与农业机器人。
Details
Motivation: 棉花采摘过程中物理操作易导致纤维降解,需模仿人工轻柔抓取;自动化采摘依赖于对不同生育期棉花蒴果的鲁棒识别。 Method: 提出COTONET模型:基于YOLO11,引入Squeeze-and-Excitation块、Content Aware Reassembly of Features(CARAFE)、Simple Attention Modules(SimAM)和Parallel Hybrid Attention Mechanisms(PHAM),并在非可学习梯度操作中增强形状与特征提取能力。 Result: COTONET参数量7.6M、计算量27.8 GFLOPS,mAP50达81.1%,mAP50-95达60.6%,优于标准YOLO基线,在轻量级模型中实现高性能。 Conclusion: COTONET通过多维度注意力增强与结构优化,在保证低资源部署能力的同时显著提升棉花蒴果检测精度与鲁棒性,为智能棉花采收提供了可行的视觉感知方案。 Abstract: Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.[118] Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction
Ammar Kheder,Helmi Toropainen,Wenqing Peng,Samuel Antão,Zhi-Song Liu,Michael Boy
Main category: cs.CV
TL;DR: CRAN-PM是一种双分支视觉Transformer,通过跨分辨率注意力机制高效融合全球气象数据(25 km)与本地高分辨率PM2.5数据(1 km),引入高程感知自注意力和风向引导的交叉注意力,提升物理一致性与预测精度,在欧洲2900万像素空气质量图上实现1.8秒单卡推理,并在RMSE和复杂地形偏差上显著优于基线。
Details
Motivation: Vision Transformer在时空预测中表现优异,但在超高清、大陆尺度(如欧洲1km分辨率空气质量图含2900万像素)场景下受限于自注意力计算复杂度,难以扩展;同时,现有方法未充分融入物理先验以提升预测的物理一致性。 Method: 提出CRAN-PM双分支Vision Transformer:1)跨分辨率注意力融合25 km气象数据与1 km当前PM2.5;2)高程感知自注意力建模地形影响;3)风导向交叉注意力建模污染物传输;整体结构内存高效、端到端可训练。 Result: 在2022年欧洲每日PM2.5预测任务(362天,2971个EEA站点)中,相比最优单尺度基线,T+1和T+3预测RMSE分别降低4.7%和10.7%,复杂地形偏差降低36%;单GPU完成2900万像素全图预测仅需1.8秒。 Conclusion: CRAN-PM通过物理信息嵌入的跨分辨率注意力机制,有效解决了超大规模时空预测中的可扩展性与物理一致性难题,为环境监测等实际应用提供了高效、精准、可部署的深度学习方案。 Abstract: Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.[119] VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On
Xiaoye Liang,Zhiyuan Qu,Mingye Zou,Jiaxin Liu,Lai Jiang,Mai Xu,Yiheng Zhu
Main category: cs.CV
TL;DR: 本文提出VTEdit-Bench基准和VTEdit-QA评估器,系统评测通用多参考图像编辑模型在虚拟试穿(VTON)任务中的性能,发现其在常规任务中表现接近专用模型,但在复杂多衣物条件场景下仍有挑战。
Details
Motivation: 现有专用VTON模型泛化能力有限,而通用多参考图像编辑模型虽具潜力,但缺乏系统性评估基准,导致其在VTON任务中的优势与局限尚不明确。 Method: 构建包含24,220对图像、覆盖五类渐进复杂VTON任务的VTEdit-Bench基准;提出基于参考感知视觉语言模型的VTEdit-QA评估器,从模型一致性、衣物一致性和整体图像质量三方面量化评估。 Result: 对8种通用编辑模型和7种专用VTON模型的系统评测表明:顶尖通用模型在常规VTON任务上媲美专用模型,在困难场景中泛化更稳定,但在多衣物条件等复杂参考配置下仍表现不足。 Conclusion: 通用多参考图像编辑模型是构建灵活VTON系统的可行路径,但需进一步提升对复杂参考关系(尤其是多衣物)的建模能力;VTEdit-Bench与VTEdit-QA为后续研究提供了标准化评估基础。 Abstract: As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.[120] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
Dingcheng Zhen,Xu Zheng,Ruixin Zhang,Zhiqi Jiang,Yichao Yan,Ming Tao,Shunshun Yin
Main category: cs.CV
TL;DR: 本文提出Neighbor Forcing与ConvKV内存机制,解决自回归扩散模型在长时序人类动画生成中的学习信号不一致与历史状态无界增长问题,实现小时级实时生成与20FPS流式推理。
Details
Motivation: 现有自回归扩散模型在小时级实时人类动画任务中面临两个关键挑战:一是强制策略导致扩散状态不匹配、学习信号不稳定;二是历史表征无界增长且缺乏结构,限制推理效率。 Method: 提出Neighbor Forcing——一种扩散步一致的自回归建模方法,以相同噪声条件下的相邻帧作为潜在邻居传播;并设计结构化ConvKV内存机制,将因果注意力中的键值压缩为定长表示,实现常数内存推理。 Result: LiveAct在小时级真实人类动画生成中实现20 FPS实时流式推理(仅需2块H100/H200 GPU),在唇动同步精度、动画质量与情感表现力上达到SOTA,并具备最低推理成本。 Conclusion: Neighbor Forcing与ConvKV共同解决了AR扩散模型在长序列生成中的稳定性与效率瓶颈,为无限视频生成提供了可扩展、高保真、低开销的新范式。 Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.[121] Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
Chenyangguang Zhang,Botao Ye,Boqi Chen,Alexandros Delitzas,Fangjinhua Wang,Marc Pollefeys,Xi Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏3D手部关节点控制的运动可控视频生成框架,解决了现有方法在自视角(egocentric)场景下因遮挡导致的3D不一致、伪影及跨具身泛化能力差的问题。
Details
Motivation: 现有方法依赖2D轨迹或隐式姿态,在严重自视角遮挡下易导致运动不一致、幻觉伪影,且难以泛化到机器人手等不同具身形态。 Method: 提出一种新型框架:以单帧参考图像为输入,利用稀疏3D手部关节点作为具身无关的控制信号;设计高效控制模块,通过抑制被遮挡关节的不可靠视觉信号、引入3D加权机制处理动态遮挡,并将3D几何嵌入直接注入潜在空间以保证结构一致性;构建含百万级视频片段与精确手部轨迹的数据集及跨具身评测基准。 Result: 在自视角视频生成任务上显著超越SOTA方法,生成高保真、真实交互的视频,并展现出优异的跨具身泛化能力(如泛化至机器人手)。 Conclusion: 稀疏3D关节点作为具身无关、语义与几何清晰的控制信号,结合遮挡感知与3D几何约束的控制模块,是提升自视角运动可控视频生成质量与泛化性的有效途径。 Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.[122] HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification
Marjan Stoimchev,Boshko Koloski,Jurica Levatić,Dragi Kocev,Sašo Džeroski
Main category: cs.CV
TL;DR: 本文提出HELM框架,通过层次特定类令牌、图卷积网络和自监督分支,解决遥感图像中多路径层次结构和未标记数据利用不足的问题,在多个数据集上达到SOTA性能。
Details
Motivation: 现有方法难以处理遥感图像中实例属于多个分支的多路径层次结构,且很少利用未标记数据。 Method: HELM框架包含三部分:(i) 在Vision Transformer中使用层次特定类令牌捕获标签交互;(ii) 利用图卷积网络显式编码层次结构并生成层次感知嵌入;(iii) 引入自监督分支有效利用未标记遥感影像。 Result: 在UCM、AID、DFC-15和MLRSNet四个遥感图像数据集上,HELM在监督和半监督设置下均超越强基线,尤其在低标签场景中表现突出。 Conclusion: HELM成功克服了多路径层次建模与未标记数据利用的挑战,为遥感图像的层次多标签分类提供了新范式。 Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.[123] Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder
Alaa Yasser,Kittipat Phunjanna,Marcos Escudero Viñolo,Catarina Barata,Jenny Benois-Pineau
Main category: cs.CV
TL;DR: 本文提出了一种机制性公平审计方法,通过结合投影残差流分解、零样本概念激活向量和偏差增强的TextSpan分析,在视觉Transformer中定位个体注意力头级别的性别与年龄偏见,并在CLIP ViT-L-14模型上验证了该方法对性别偏见具有较好可定位性,而年龄偏见则更弥散。
Details
Motivation: 标准公平性审计只能量化模型是否存在偏见,无法指出偏见在网络内部的具体位置;本文旨在实现偏见的机制性定位,以支持有针对性的干预。 Method: 融合投影残差流分解(projected residual-stream decomposition)、零样本概念激活向量(zero-shot Concept Activation Vectors)和偏差增强的TextSpan分析,对CLIP ViT-L-14编码器在FACET基准的42个职业类别上进行性别与年龄偏见的注意力头级定位审计。 Result: 在性别偏见上,识别出4个末层注意力头,其消融使全局偏见(Cramer's V)从0.381降至0.362且准确率微升0.42%;单个头主导最刻板职业类别的偏见缓解;年龄偏见则未呈现一致可定位性,提示其编码更弥散。 Conclusion: 注意力头级偏见定位在判别式视觉编码器中是可行的,但不同受保护属性(如性别vs.年龄)的局部化程度存在差异。 Abstract: Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness[124] Intrinsic Concept Extraction Based on Compositional Interpretability
Hanyu Shi,Hong Tao,Guoheng Huang,Jianbin Jiang,Xuhang Chen,Chi-Man Pun,Shanhu Wang,Pan Pan
Main category: cs.CV
TL;DR: 本文提出CI-ICE新任务,旨在从单张图像中提取可组合、可解释的内在概念(对象级与属性级),并设计HyperExpress方法,利用双曲空间建模与概念级优化实现高精度解耦与重构。
Details
Motivation: 现有无监督概念提取方法难以提取可组合的内在概念,限制了概念的解释性与实用性。 Method: 提出HyperExpress方法:1)利用双曲空间的层次建模能力实现概念解耦并保持层次结构与关系依赖;2)引入概念级优化,映射概念嵌入空间以维持复杂概念间关系并保障可组合性。 Result: 在单图像中成功提取出具有组成性与可解释性的内在概念,性能优异。 Conclusion: CI-ICE任务及HyperExpress方法为图像概念理解提供了新范式,显著提升了概念的可组合性与可解释性。 Abstract: Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.[125] OSM-based Domain Adaptation for Remote Sensing VLMs
Stefan Maria Ailuro,Mario Markov,Mohammad Mahdi,Delyan Boychev,Luc Van Gool,Danda Pani Paudel
Main category: cs.CV
TL;DR: OSMDA是一种无需人工标注和外部大模型的遥感视觉-语言模型自适应框架,利用基础VLM结合OpenStreetMap数据自动生成高质量图文对进行微调,在10个基准测试中达到SOTA性能且成本更低。
Details
Motivation: 遥感领域高质量图像-文本标注稀缺且昂贵,现有伪标签方法依赖大型教师模型,导致成本高、可扩展性差、性能受限于教师模型。 Method: 提出OSMDA框架:利用基础VLM自身能力,将航拍图像与渲染的OpenStreetMap(OSM)瓦片配对,通过OCR和图表理解能力,结合OSM丰富的辅助元数据生成图文描述;仅用卫星图像对该模型进行微调,得到OSMDA-VLM。 Result: 在10个图像-文本-文本任务基准上全面评估,对比9种基线方法;当与真实数据等比例混合训练时,达到最先进(SOTA)性能,且训练成本显著低于依赖教师模型的方法。 Conclusion: 基于强基础模型,与众包地理数据(如OSM)对齐是实现遥感领域自适应的一种实用、可扩展路径;OSMDA无需人工标注和更强外部模型,具备开源潜力(数据集与模型权重将公开)。 Abstract: Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.[126] CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing
Yue Shi,Rui Shi,Yuxuan Xiong,Bingbing Ni,Wenjun Zhang
Main category: cs.CV
TL;DR: 本文提出CEI-3D,一种面向编辑的3D重建流程,通过协同显式-隐式重建与物理属性解耦,实现更真实、细粒度且高效的3D编辑。
Details
Motivation: 现有3D编辑方法因重建网络高度耦合,导致结果不真实、不精细。 Method: 提出协同显式-隐式重建(SDF网络+可微采样handler点)、物理属性解耦模块(含双扩散-反照率网络)、以及基于跨视角传播的3D分割的空间感知编辑模块。 Result: 在真实与合成数据集上均优于SOTA方法,编辑结果更真实、更精细,且编辑耗时更少。 Conclusion: CEI-3D通过显隐协同表征与属性解耦,显著提升了3D编辑的质量与可控性,为高质量交互式3D内容创作提供了新范式。 Abstract: Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on https://github.com/shiyue001/CEI-3D.[127] Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning
Robin Peretzke,Marlin Hanstein,Maximilian Fischer,Lars Badhi Wessel,Obada Alhalabi,Sebastian Regnery,Andreas Kudak,Maximilian Deng,Tanja Eichkorn,Philipp Hoegen Saßmannshausen,Fabian Allmendinger,Jan-Hendrik Bolten,Philipp Schröter,Christine Jungk,Jürgen Peter Debus,Peter Neher,Laila König,Klaus Maier-Hein
Main category: cs.CV
TL;DR: RICE-NET 是一种结合纵向MRI与放疗剂量图的3D深度学习模型,用于区分胶质母细胞瘤治疗后的肿瘤复发与放射性增强,在92例患者中F1达0.92,证实放疗剂量图对分类至关重要。
Details
Motivation: 临床难以区分胶质母细胞瘤治疗后肿瘤复发与放射性增强;现有方法缺乏扩散MRI(临床稀缺)或未利用日益受重视的放疗剂量图。 Method: 提出RICE-NET——一种融合纵向T1加权MRI与放疗剂量分布的多模态3D深度学习模型,并通过消融实验和基于遮挡的可解释性分析验证各输入贡献及模型关注区域。 Result: 在92例患者的独立测试集上F1分数达0.92;消融实验证明放疗剂量图是分类可靠性的关键;可解释性分析显示模型聚焦于临床相关区域。 Conclusion: 多模态深度学习(尤其整合放疗剂量图)有望显著提升神经肿瘤学中诊断准确性与临床决策支持能力。 Abstract: The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.[128] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding
Jiahao Li,Qingwang Zhang,Qiuyu Chen,Guozhan Qiu,Yunzhong Lou,Xiangdong Zhou
Main category: cs.CV
TL;DR: 本文提出FutureCAD,一种结合大语言模型(LLM)与B-Rep定位Transformer(BRepGround)的文本到CAD生成框架,可生成可执行CadQuery脚本,并通过自然语言实现对B-Rep几何元素的选择与定位,显著提升复杂工业产品AI驱动CAD建模能力。
Details
Motivation: 现有CAD生成方法分为参数化建模和B-Rep合成两类,二者割裂,而实际工业CAD系统中二者紧密耦合;这一范式鸿沟限制了AI在复杂产品设计中的应用。 Method: 提出FutureCAD框架:1)利用LLM生成CadQuery脚本并以自然语言描述几何选择;2)引入BRepGround Transformer将自然语言查询精准定位到B-Rep原始几何体;3)基于真实CAD数据集,先监督微调LLM,再用强化学习优化泛化能力。 Result: FutureCAD在CAD生成任务上达到当前最优性能(SOTA),能生成高保真、可执行的参数化CAD模型,并支持自然语言驱动的几何选择与编辑。 Conclusion: 统一参数化建模与B-Rep表示是推动AI驱动CAD发展的关键路径;FutureCAD验证了LLM与几何感知模块协同建模的有效性,为智能CAD系统提供了新范式。 Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.[129] A Decade of Generative Adversarial Networks for Porous Material Reconstruction
Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani
Main category: cs.CV
TL;DR: 本文综述了2017年至2026年初发表的96篇论文,系统分析了生成对抗网络(GAN)在多孔材料数字重建中的演进与应用,归纳出六类GAN架构,并评估其在孔隙率精度、渗透率预测和重建体积等方面的进展与挑战。
Details
Motivation: 传统方法(如微CT和统计重建)存在局限性,而深度学习尤其是GAN为多孔材料高保真、高效重建提供了新路径,亟需系统梳理其发展脉络与适用场景。 Method: 对96篇同行评议论文进行系统性文献综述,按架构将GAN分为六类,并定量分析其在孔隙率、渗透率预测及重建尺度等关键指标上的性能表现。 Result: 总结出六类GAN架构;孔隙率误差控制在1%以内,渗透率预测平均相对误差降低达79%,最大重建体积由64³提升至2200³体素;识别出计算效率、内存限制及2D-3D结构连续性等关键挑战。 Conclusion: 该综述构建了面向具体应用需求的GAN架构选型框架,为多孔材料数字重建研究与工程实践提供理论指导和技术参考。 Abstract: Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.[130] ZeroSense:How Vision matters in Long Context Compression
Yonghan Gao,Zehong Chen,Lijian Xu,Jingzhi Chen,Jingwei Guan,Xingyu Zeng
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉-文本压缩(VTC)质量评估框架,通过解耦多模态大语言模型(MLLMs)的能力,并设计ZeroSense基准来消除语义相关性与上下文依赖,从而更准确地衡量文本保真度;实验表明现有下游任务指标无法真实反映VTC质量。
Details
Motivation: 现有VTC方法(如DeepSeek-OCR)虽在高token压缩比上表现优异,但其评估严重依赖下游任务性能,受MLLM固有语言先验干扰,无法准确衡量文本保真度。 Method: 提出解耦MLLM能力的新型VTC评估框架,并构建ZeroSense基准——该基准采用低语义相关性的测试样本,消除上下文依赖,使评估结果仅反映VTC本身的压缩质量。 Result: 在多个数据集上的实验表明,VTC质量与下游任务准确率显著偏离,验证了传统评估方式的偏差及所提框架的必要性与有效性。 Conclusion: 下游任务性能不能替代对VTC文本保真度的直接评估;本文提出的解耦评估框架和ZeroSense基准为VTC研究提供了更可靠、更本质的质量衡量标准。 Abstract: Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.[131] Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration
Zhaocheng Yu,Xiang Chen,Runzhe Li,Zihan Geng,Guanglu Sun,Haipeng Li,Kui Jiang
Main category: cs.CV
TL;DR: 本文提出Derain-Agent,一种即插即用的动态去雨精炼框架,通过规划网络和强度调制机制,实现对真实雨景中复杂退化(如噪声、模糊、色偏)的自适应、区域特定修复,显著提升现有去雨模型在合成与真实数据上的性能。
Details
Motivation: 现有单图像去雨模型采用静态推理范式,无法适应真实雨景中噪声、模糊、色偏等耦合退化,导致复原图像存在残余伪影和感知质量不一致问题。 Method: 提出Derain-Agent框架,包含两个核心组件:1)规划网络,为每个样本智能调度最优的修复工具序列;2)强度调制机制,以空间自适应强度应用这些工具,实现高效、精准的区域修正。 Result: 该方法展现出强泛化能力,在合成与真实世界基准上均能持续提升当前最优去雨模型的性能。 Conclusion: Derain-Agent成功将去雨任务从静态处理转向动态、基于智能体的修复范式,为解决真实场景复杂退化提供了新思路。 Abstract: While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.[132] Single-View Rolling-Shutter SfM
Sofía Errázuriz Muñoz,Kim Kiehn,Petr Hruby,Kathlén Kohn
Main category: cs.CV
TL;DR: 本文提出了一种针对滚动快门(RS)相机的单视图几何建模方法,系统分析了从单张RS图像中可恢复的运动与场景参数,并推导出若干最小重建问题,通过实验验证了其可行性与实际限制。
Details
Motivation: 滚动快门相机广泛存在,但其结构光度法(RS SfM)尚未被完全解决,亟需对其单视图几何特性进行建模与分析。 Method: 刻画RS相机下单个视图中世界点/线的几何关系,据此分析可恢复的运动与场景参数,并系统推导最小重建问题;设计并评估多个典型情形的原理验证求解器。 Result: 明确了单张RS图像中可解的参数子集,推导出若干最小重建问题,并通过实验展示了部分问题的可行性及当前方法的实际局限性。 Conclusion: RS单视图几何为RS SfM提供了理论基础,所提出的最小问题及其求解器验证了单图像RS重建的部分可行性,但也揭示了固有模糊性与数值敏感性等挑战。 Abstract: Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.[133] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
InSpatio Team,Xiaoyu Zhang,Weihong Pan,Zhichao Ye,Jialin Liu,Yipeng Chen,Nan Wang,Xiaojun Xiang,Weijian Xie,Yifu Wang,Haoyu Ji,Siji Pan,Zhewen Le,Jing Guo,Xianbin Liu,Donghui Shen,Ziqiang Zhao,Haomin Liu,Guofeng Zhang
Main category: cs.CV
TL;DR: InSpatio-WorldFM 是一种开源实时帧模型,通过独立生成每帧并结合显式3D锚点与隐式空间记忆,实现低延迟、多视角一致的空间智能推理。
Details
Motivation: 解决视频式世界模型因窗口级序列处理导致的高延迟问题,满足实时空间推理需求。 Method: 采用帧独立生成范式;引入显式3D锚点和隐式空间记忆保障多视角一致性;设计三阶段渐进训练流程(图像扩散模型→可控帧模型→少步蒸馏实时生成器)。 Result: 在消费级GPU上支持交互式探索,实现强多视角一致性与低延迟实时生成。 Conclusion: InSpatio-WorldFM为实时世界模拟提供了高效替代方案,优于传统视频式世界模型。 Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.[134] PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation
Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qin,Michele Magno
Main category: cs.CV
TL;DR: PicoSAM3是一种轻量级、可提示的视觉分割模型,专为边缘和传感器端实时部署设计,参数仅1.3M,在COCO和LVIS上分别达65.45%和64.01% mIoU,支持IMX500传感器上11.82ms低延迟INT8推理。
Details
Motivation: 满足智能眼镜、IoT等对低延迟和隐私保护要求高的实时端侧分割需求。 Method: 结合密集CNN架构、ROI提示编码、高效通道注意力机制,并通过知识蒸馏从SAM2/SAM3中学习。 Result: 在COCO和LVIS上mIoU分别为65.45%和64.01%,优于同类边缘模型;INT8量化后在IMX500上实现11.82ms实时推理,内存与算子约束完全兼容;消融显示蒸馏带来最高+14.5% mIoU增益。 Conclusion: 高质量、空间灵活的可提示分割可在传感器端直接实现,PicoSAM3为边缘AI视觉任务提供了高效可行的新范式。 Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.[135] Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments
Pankaj Deoli,Karthik Ranganath,Karsten Berns
Main category: cs.CV
TL;DR: 本文评估了经典和深度学习方法在林区越野应用中的RGB-NIR图像配准性能,发现NeMAR存在几何一致性问题,MURF在大尺度对齐上表现良好但细节处理不足,需进一步改进以实现鲁棒的多尺度配准。
Details
Motivation: RGB-NIR图像配准在传感器融合、图像增强和越野自主系统中至关重要,尤其在复杂林区越野场景下缺乏针对性评估。 Method: 对经典方法和深度学习方法(包括NeMAR和MURF)进行实验评估,其中NeMAR采用6种不同配置训练,MURF用于测试林区越野数据上的特征对齐能力。 Result: NeMAR在GAN损失不稳定下部分成功;MURF在大尺度特征对齐上表现良好,但在密集植被区域难以捕捉精细细节。 Conclusion: 当前方法尚不能满足越野林区多尺度鲁棒配准需求,亟需进一步优化与改进。 Abstract: RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.[136] AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies
Jennifer Nolan,Travis Driver,John Christian
Main category: cs.CV
TL;DR: 本文提出AstroSplat,一种基于物理的高斯点绘框架,融合行星反射模型,提升小天体表面从原位图像中自主重建与光度表征的精度,验证显示其优于传统球谐参数化方法。
Details
Motivation: 现有基于高斯点绘的表面重建方法依赖纯外观的球谐强度参数化,未显式建模材料属性和光-面相互作用,难以满足小天体任务对物理一致性和科学表征的需求。 Method: 提出AstroSplat框架,将行星反射模型(如Hapke模型)嵌入高斯点绘表示中,实现物理驱动的辐射度建模与优化;在NASA黎明号任务真实影像上进行端到端训练与验证。 Result: 在黎明号真实数据上,AstroSplat相比标准球谐参数化显著提升了渲染保真度与表面几何重建精度,同时支持光度反演与材质特性估计。 Conclusion: AstroSplat证明了将物理反射模型融入神经辐射场类方法的有效性,为深空探测中小天体的自主感知与科学分析提供了更可靠、可解释的视觉重建新范式。 Abstract: Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.[137] Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
Junhyeong Byeon,Jeongyeol Kim,Sejoon Lim
Main category: cs.CV
TL;DR: 本文提出了一种用于野外视频情感识别的多模态框架,结合冻结的CLIP(视觉)和Wav2Vec 2.0(音频)模型、时序卷积网络(TCN)建模面部动态、双向交叉注意力融合模块,以及基于CLIP文本特征的对比学习目标,显著提升了ABAW 10th EXPR任务性能。
Details
Motivation: 单一模态(如面部或语音)难以应对野外视频中外观、姿态、光照、背景噪声及情感动态性等复杂变化,需多模态协同建模。 Method: 采用冻结的CLIP和Wav2Vec 2.0分别提取视觉与音频特征;用TCN建模固定长度视频窗口内的时序面部变化;引入双向交叉注意力实现视听对称交互融合;添加基于CLIP文本嵌入的文本引导对比损失以增强语义对齐。 Result: 在ABAW 10th EXPR基准上,该框架优于单模态方法,提供了强多模态基线,验证了时序视觉建模、音频表征学习与跨模态融合的有效性。 Conclusion: 融合预训练多模态表征、时序建模与对称交叉注意力的框架,能更鲁棒地识别真实场景中的复杂情感,为野外情感识别提供了有效新思路。 Abstract: Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.[138] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Jiayue Pu,Zhongxiang Sun,Zilu Zhang,Xiao Zhang,Jun Xu
Main category: cs.CV
TL;DR: 本文提出了HomeSafe-Bench基准和HD-Guard安全监控架构,用于评估和提升家用场景下视觉语言模型对不安全行为的实时检测能力。
Details
Motivation: 现有安全评估方法难以应对家庭环境中动态、不可预测的安全风险,尤其受限于感知延迟和常识缺失,且缺乏针对家庭场景的动态不安全动作检测基准。 Method: 构建了融合物理仿真与先进视频生成的HomeSafe-Bench基准(含438个案例、六类功能区、多维细粒度标注);提出分层流式架构HD-Guard,包含高频轻量FastBrain与异步大模型SlowBrain协同工作。 Result: HD-Guard在延迟与性能间取得更优权衡;分析揭示了当前VLM在家庭安全检测中的关键瓶颈。 Conclusion: HomeSafe-Bench和HD-Guard为家庭机器人安全提供了新基准与实用架构,推动VLM在真实动态家居环境中的可靠部署。 Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.[139] Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation
Chongyang Xu,Yixian Zou,Ziliang Feng,Fanman Meng,Shuaicheng Liu
Main category: cs.CV
TL;DR: Ada3Drift 提出一种训练时漂移场方法,在单步生成(1 NFE)下恢复扩散策略的多模态动作保真度,兼顾实时性与物理可行性,在仿真与真实机器人任务中达到SOTA,计算量仅为扩散方法的1/10。
Details
Motivation: 扩散模型虽能建模多模态动作分布,但推理延迟高;流匹配与一致性模型虽快,却导致模态坍缩、轨迹不物理;而机器人场景中训练可离线、推理需实时,因此应将迭代优化从推理前移到训练阶段。 Method: 提出Ada3Drift:1)训练时学习一个‘漂移场’,吸引预测动作靠近专家模态、排斥其他样本;2)引入sigmoid调度损失,从粗粒度分布学习渐进过渡到细粒度模态锐化;3)采用多尺度场聚合,捕捉不同空间粒度的动作模态;输入为3D点云。 Result: 在Adroit、Meta-World、RoboTwin三个仿真基准及真实机器人操作任务上,性能达SOTA,且仅需1次函数评估(1 NFE),推理计算量比扩散模型低10倍。 Conclusion: 通过将多模态精炼迁移至训练阶段,Ada3Drift成功在单步生成中兼顾高保真多模态建模与实时性,为具身智能提供了高效可靠的视觉运动策略新范式。 Abstract: Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.[140] CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation
Ziqi Ye,Ziyang Gong,Ning Liao,Xiaoxing Hu,Di Wang,Hongruixuan Chen,Chen Huang,Yiguo He,Yuru Jia,Xiaoxing Wang,Haipeng Wang,Xue Yang,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出了CrossEarth-SAR,首个面向合成孔径雷达(SAR)图像跨域语义分割的十亿级视觉基础模型,采用物理引导的稀疏混合专家(MoE)架构,并构建了大规模数据集CrossEarth-SAR-200K和首个统一SAR领域泛化基准套件(22个子基准)。
Details
Motivation: SAR图像因成像机制多样,存在显著的传感器与地域间域偏移,严重制约其语义泛化能力。 Method: 提出物理引导的稀疏Mixture-of-Experts(MoE)架构,融合物理描述符;构建弱监督与全监督结合的大规模数据集CrossEarth-SAR-200K;设计涵盖8类域差距的22子基准统一评估套件。 Result: 在22个基准中的20个上达到SOTA性能,多域迁移下部分基准mIoU提升超10%。 Conclusion: CrossEarth-SAR验证了大规模预训练与物理先验结合可显著提升SAR图像跨域语义分割泛化能力,为SAR视觉基础模型提供了新范式。 Abstract: Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.[141] Pano360: Perspective to Panoramic Vision with Geometric Consistency
Zhengdong Zhu,Weiyi Xue,Zuyuan Yang,Wenlve Zhou,Zhiheng Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于3D光束法平差空间和Transformer架构的全景图像拼接新方法,通过在3D空间中进行全局对齐与多特征联合优化,显著提升了弱纹理、大视差等挑战场景下的拼接精度与视觉质量。
Details
Motivation: 现有全景拼接方法依赖两两图像间的2D特征匹配,缺乏跨多视角的几何一致性建模,导致在弱纹理、大视差和重复纹理等复杂场景中严重失真和错位。 Method: 将图像对齐扩展至3D摄影测量空间;采用新型Transformer架构实现3D感知与全局信息聚合;利用相机位姿引导3D空间中的图像形变以实现全局对齐;引入多特征联合优化策略计算拼接缝;构建了大规模真实场景数据集用于训练与评测。 Result: 在对齐精度和感知质量上显著优于现有方法,尤其在弱纹理、大视差和重复模式等挑战性场景下表现突出。 Conclusion: 在3D空间中建模多视角几何一致性并结合Transformer进行全局优化,是提升全景拼接鲁棒性与质量的有效范式。 Abstract: Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.[142] Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era
Nicholas Schaub,Andriy Kharchenko,Hamdah Abbasi,Sameeul Samee,Hythem Sidky,Nathan Hotaling
Main category: cs.CV
TL;DR: Nyxus是一个为大规模2D/3D图像数据设计的可扩展、支持内存外计算的特征提取库,覆盖多生物医学领域,提供多种接口(Python、CLI、Napari插件、OCI容器),并支持程序化调优以适配机器学习应用。
Details
Motivation: 现代成像仪器产生海量图像数据,传统特征提取算法在效率、鲁棒性和准确性上难以兼顾;深度学习虽提升了分割精度,但跨学科特征库缺乏统一评估标准,亟需高效、可扩展、易用且标准化的新工具。 Method: 从零设计Nyuxs特征提取库,支持2D/3D图像的可扩展、内存外(out-of-core)计算;全面覆盖放射组学与细胞分析等生物医学特征;提供Python包、命令行工具、Napari插件和OCI容器等多种部署方式;支持程序化调优特征集以平衡计算效率与特征覆盖度。 Result: Nyxus通过严格基准测试验证了其性能与准确性;实现跨CPU/GPU的计算可扩展性;已集成至多种用户工作流(开发、可视化、云/超算),显著提升大图像数据特征提取的效率与灵活性。 Conclusion: Nyxus填补了大规模图像特征提取中高效性、标准化与易用性之间的关键空白,为生物医学图像分析及下游AI建模提供了坚实、可复现且可扩展的基础设施。 Abstract: Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.[143] Single Pixel Image Classification using an Ultrafast Digital Light Projector
Aisha Kanwal,Graeme E. Johnstone,Fahimeh Dehkhoda,Johannes H. Herrnsdorf,Robert K. Henderson,Martin D. Dawson,Xavier Porte,Michael J. Strain
Main category: cs.CV
TL;DR: 本文提出了一种结合单像素成像(SPI)与低复杂度机器学习模型(如ELM和浅层DNN)的超高速图像分类方法,实现多kHz帧率下的实时MNIST数字分类,并绕过传统图像重建步骤,适用于自动驾驶等实时视觉任务及超快异常检测。
Details
Motivation: 自动驾驶等应用需要在动态环境中实时处理并分类复杂视觉信息,传统图像采集与处理流程存在速度与计算开销瓶颈。 Method: 采用microLED-on-CMOS数字光投影仪实现超快单像素成像(SPI),结合极低复杂度的极端学习机(ELM)和轻量级反向传播深度神经网络,在不进行图像重建的前提下直接对时空编码的单像素测量数据进行分类。 Result: 在MNIST基准任务上实现了多kHz帧率的实时分类;ELM与轻量DNN分类精度相当,且推理延迟与图像编码时间相当;ELM作为二分类器展现出高效超快异常检测潜力。 Conclusion: 基于SPI与低复杂度模型的端到端分类范式可有效规避图像重建开销,为超高速、低延迟机器视觉任务(如自动驾驶感知与在线异常检测)提供了新路径。 Abstract: Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.[144] Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
Chiyuan He,Zihuan Qiu,Fanman Meng,Runtong Zhang,Linfeng Xu,Qingbo Wu,Hongliang Li
Main category: cs.CV
TL;DR: 本文提出SeGP-CL方法,在无样本持续学习中通过保持跨模态语义几何结构来缓解预训练视觉语言模型的灾难性遗忘,尤其关注新旧语义交界处的易漂移区域,利用对抗锚点引导几何蒸馏与文本几何正则化,显著提升稳定性与前向迁移能力。
Details
Motivation: 现有持续学习方法在适配新任务时未显式保持预训练及先前阶段继承的跨模态语义几何结构,导致新任务监督引发几何畸变,尤其在新旧语义交界面附近的脆弱区域发生显著语义漂移。 Method: 提出Semantic Geometry Preservation for Continual Learning(SeGP-CL):1)用双目标投影梯度下降(DPGD)构建紧凑对抗锚点集以探测易漂移区域;2)通过锚点引导的跨模态几何蒸馏(ACGD)保持跨模态结构;3)引入轻量级文本语义-几何正则化(TSGR)稳定文本参考系;4)训练后基于锚点估计原始空间漂移,迁移旧视觉原型并融合双路径推理。 Result: 在五个持续学习基准上实验表明,SeGP-CL持续提升稳定性与前向迁移能力,达到当前最优性能,并更有效地保持VLMs的语义几何结构。 Conclusion: 维持跨模态语义几何一致性是缓解VLMs持续学习中灾难性遗忘的关键,SeGP-CL通过几何感知的无样本机制实现了对语义结构的有效保护与可迁移推理。 Abstract: Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.[145] Coarse-Guided Visual Generation via Weighted h-Transform Sampling
Yanghao Wang,Ziqi Jiang,Zhen Wang,Long Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的粗到细视觉生成方法,利用h-transform在扩散模型采样过程中施加条件引导,通过引入噪声水平感知的调度策略平衡引导强度与生成质量。
Details
Motivation: 现有基于训练的方法存在高成本和泛化性差的问题;而无训练方法要么依赖已知的前向变换算子,要么难以兼顾引导效果与生成质量。 Method: 提出基于h-transform的引导方法,在扩散模型采样过程的每一步修改转移概率,通过添加漂移项引导生成方向,并设计噪声水平感知的权重衰减调度以缓解近似误差。 Result: 在多种图像和视频生成任务上验证了该方法的有效性和强泛化能力。 Conclusion: 所提方法在不依赖配对数据和预设前向算子的前提下,实现了高质量、高灵活性的粗到细视觉生成。 Abstract: Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.[146] NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction
David Svitov,Mahtab Dahaghin
Main category: cs.CV
TL;DR: NBAvatar是一种结合定向平面基元与神经渲染的新方法,用于高质量、真实感地渲染包含手脸交互等非刚性形变的头部虚拟形象。
Details
Motivation: 现有方法难以在手脸交互等复杂非刚性形变下同时保持几何一致性与外观细节,尤其在新视角和新姿态渲染中表现不足。 Method: 提出NBAvatar,融合显式的定向平面基元建模与隐式的神经渲染,实现时间与姿态一致的几何建模及精细外观合成,并隐式学习手脸交互引起的颜色变化。 Result: 在高分辨率(兆像素)渲染下,相比基于高斯的虚拟人方法,LPIPS降低最多30%,PSNR与SSIM提升;相比InteractAvatar,在结构相似性上更优。 Conclusion: NBAvatar通过显隐结合表征有效提升了含手脸交互的头部虚拟人渲染质量,兼顾几何一致性与外观保真度。 Abstract: We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.[147] Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos
Shuo Sun,Unal Artan,Malcolm Mielle,Achim J. Lilienthaland,Martin Magnusson
Main category: cs.CV
TL;DR: 本文提出了一种用于多自由移动相机下稠密动态场景重建与相机位姿估计的两阶段优化框架,通过构建时空连接图和宽基线初始化策略提升鲁棒性,并在新提出的MultiCamRobolab真实数据集上验证了其优越性。
Details
Motivation: 现有方法仅支持单相机输入或需刚性校准的相机阵列,难以适用于多自由移动相机捕获共享事件的实际场景。 Method: 采用两阶段优化框架:第一阶段扩展单相机视觉SLAM至多相机设置,构建时空连接图并引入基于前馈重建模型的宽基线初始化;第二阶段利用宽基线光流优化密集跨相机与单相机一致性以精化深度与位姿。 Result: 在合成与真实世界基准上显著优于当前最优前馈模型,且内存占用更低;并在新构建的MultiCamRobolab真实数据集(含动捕真值位姿)上完成验证。 Conclusion: 该方法有效解决了多自由移动相机下的动态场景重建与位姿估计难题,兼具鲁棒性、精度与实用性。 Abstract: We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.[148] Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing
Simone Cammarasana
Main category: cs.CV
TL;DR: 本文系统性地提出了五类扩展或替代标准卷积的算子,涵盖分解型、自适应加权型、基自适应型、积分/核型及注意力型,并从结构特性、适用任务和计算维度进行对比分析与展望。
Details
Motivation: 标准卷积作为CNN核心,虽具平移等变性和高效性,但其固定、线性、局部平均结构难以建模低秩性、自适应基表示和非均匀空间依赖等信号结构特性。 Method: 提出五大家族替代卷积算子:(i) 分解型(SVD/张量分解)、(ii) 自适应加权型(位置/内容驱动权重调制)、(iii) 基自适应型(联合优化分析基与网络权重)、(iv) 积分与核型(推广至位置相关/非线性核)、(v) 注意力型(完全放松局部性假设);并对每类给出形式化定义、结构对比与任务适配性分析。 Result: 构建了首个系统性卷积替代算子分类体系,完成跨维度(线性性、局部性、等变性、计算开销、图像到图像/标签任务适配性)的定量与定性比较,并指明开放挑战与未来方向。 Conclusion: 标准卷积存在固有局限,五类替代算子在不同建模需求下各有优势;统一分类框架有助于推动更灵活、结构感知的图像处理模型发展。 Abstract: The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.[149] Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
Zhaoyang Jiang,Zhizhong Fu,David McAllister,Yunsoo Kim,Honghan Wu
Main category: cs.CV
TL;DR: LoV3D是一种用于纵向脑MRI分析的3D视觉-语言模型训练流程,可实现区域级解剖评估、纵向对比及三类诊断(正常、轻度认知障碍、痴呆),并通过临床加权验证器和直接偏好优化提升诊断准确性与可靠性。
Details
Motivation: 现有深度学习工具在纵向脑MRI分析中存在碎片化问题:分类器仅输出标签,体积分割方法缺乏解释性,视觉-语言模型易产生幻觉;亟需一种能兼顾诊断准确性、生物学合理性和临床可解释性的端到端方案。 Method: 提出LoV3D分步式3D视觉-语言建模流程,包含纵向T1 MRI输入、区域级解剖评估、与前序扫描的纵向比较、三类诊断输出及诊断摘要生成;引入临床加权Verifier,基于标准化体积指标自动评分候选输出,并驱动无需人工标注的直接偏好优化(DPO)。 Result: 在ADNI测试集上三类诊断准确率达93.7%(较无约束基线提升34.8%),二类诊断达97.2%(较SOTA提升4%),区域解剖分类达82.6%(较VLM基线提升33.1%);零样本迁移至MIRIAD和AIBL数据集仍保持高准确率(95.4%和82.9%),验证跨中心、跨设备、跨人群泛化能力。 Conclusion: LoV3D通过结构化推理路径、纵向一致性约束与生物合理性引导,显著降低幻觉风险,在保持高诊断性能的同时增强临床可信度与可解释性,为神经退行性疾病纵向评估提供了新范式。 Abstract: Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.[150] Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
Hiran Sarkar,Liming Kuang,Yordanka Velikova,Benjamin Busam
Main category: cs.CV
TL;DR: Node-RF 结合神经常微分方程(NODE)与动态神经辐射场(NeRF),实现对场景动态的连续时空建模,支持长时外推且内存开销恒定。
Details
Motivation: 现有方法仅能在观测边界内建模场景动态,难以外推到训练序列之外;需一种能泛化至未见运动模式的连续时空表示方法。 Method: 将神经ODE与动态NeRF结合,通过ODE求解器隐式演化场景状态(特征嵌入),再由NeRF渲染器合成任意视角图像,支持长程外推;在多个共享动力学的运动序列上联合训练。 Result: Node-RF 能在无显式物理模型前提下刻画抽象系统行为,识别对未来预测关键的动力学临界点,并实现优于现有方法的长时动态外推。 Conclusion: Node-RF 提供了一种内存高效、泛化能力强的连续时空表征框架,为视觉驱动的动态场景建模与预测开辟了新路径。 Abstract: Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.[151] Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis
Xiaolong Qian,Qi Jiang,Yao Gao,Lei Sun,Zhonghua Yi,Kailun Yang,Luc Van Gool,Kaiwei Wang
Main category: cs.CV
TL;DR: 本文提出UniCAC大规模摄影镜头基准和光学退化评估器(ODE),系统评估24种CAC算法,揭示先验利用、网络架构与训练策略是影响性能的三大关键因素。
Details
Motivation: 现有计算像差校正(CAC)方法泛化能力差、需为新镜头重训练;缺乏覆盖广泛光学像差的综合基准,且影响CAC性能的关键因素尚不明确。 Method: 构建基于自动光学设计的大规模UniCAC基准;提出光学退化评估器(ODE)量化像差难度;对24种图像恢复与CAC算法开展系统实验与对比分析。 Result: 识别出先验利用、网络架构和训练策略是影响CAC性能的三个最主要因素,并深入分析了各自影响机制;提供了可复现的基准、数据集、代码与Zemax文件。 Conclusion: UniCAC基准与ODE框架为跨镜头CAC研究提供了可靠评估基础,所揭示的关键因素为未来通用CAC方法设计提供了重要指导。 Abstract: Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.[152] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation
Yan Li,Ning Liao,Xiangyu Zhao,Shaofeng Zhang,Xiaoxing Wang,Yifan Yang,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出EvoTok,一种通过残差进化过程在共享潜在空间中统一视觉理解和生成的图像分词器,解决了现有方法中因粒度差异导致的干扰或不一致问题。
Details
Motivation: 现有统一多模态大语言模型(MLLMs)面临视觉理解与生成之间粒度鸿沟的根本挑战:理解需高层语义抽象,而生成需像素级细粒度表示;现有方法要么将两种监督强加于同一表征,导致干扰,要么解耦于不同特征空间,引发不一致。 Method: 提出EvoTok,采用残差向量量化将图像编码为级联的残差token序列,在共享潜在空间中构建从低层细节到高层语义的演化轨迹,而非维护分离的像素/语义token空间。 Result: 仅用1300万图像训练,EvoTok在ImageNet-1K上256×256分辨率下达到0.43 rFID重建质量;集成大语言模型后,在9个视觉理解基准中的7个表现优异,并在GenEval和GenAI-Bench等图像生成基准上取得显著结果。 Conclusion: 将视觉表征建模为演化轨迹是一种有效且原理清晰的统一视觉理解与生成的方法。 Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.[153] Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
Agniv Sharma,Xianghui Xie,Tom Fischer,Eddy Ilg,Gerard Pons-Moll
Main category: cs.CV
TL;DR: Hoi3DGen是一个从文本生成高质量、高保真3D人-物交互模型的框架,通过多模态大语言模型构建高质量交互数据,并设计端到端文本到3D生成流程,显著提升文本一致性与3D质量。
Details
Motivation: 现有基于图像扩散模型分数蒸馏的方法存在Janus问题且难以忠实遵循文本提示,主因是高质量人-物交互3D数据稀缺。 Method: 提出Hoi3DGen框架:首先利用多模态大语言模型构建真实、高质量的人-物交互数据集;再构建端到端文本到3D生成流水线,直接生成带纹理的3D网格。 Result: 在文本一致性上超越基线4–15倍,在3D模型质量上提升3–7倍,具备对多种类别和交互类型的强泛化能力,同时保持高质3D生成效果。 Conclusion: Hoi3DGen有效解决了文本驱动3D人-物交互生成中的保真度与数据瓶颈问题,为AR/XR/游戏等应用提供了可靠技术路径。 Abstract: Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.[154] HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
Rui Shao,Ruize Gao,Bin Xie,Yixing Li,Kaiwen Zhou,Shuai Wang,Weili Guan,Gongwei Chen
Main category: cs.CV
TL;DR: 本文提出HATS框架,通过硬度感知的轨迹合成方法解决GUI代理训练中语义模糊动作导致的泛化能力差问题,包含硬度驱动探索和对齐引导精炼两个模块,显著提升代理在GUI环境中的性能。
Details
Motivation: 现有GUI代理轨迹合成流程忽视了语义模糊动作(如上下文依赖、序列依赖或视觉模糊的动作),导致代理泛化能力差和指令-执行语义错位。 Method: 提出HATS(Hardness-Aware Trajectory Synthesis)框架,定义‘硬度’为动作语义模糊程度,并设计两个闭环模块:(1)硬度驱动探索,聚焦采集高难度但信息丰富的交互;(2)对齐引导精炼,迭代验证并修复指令与执行间的语义对齐。 Result: 在多个基准GUI环境中,HATS训练的代理持续优于当前最优基线方法。 Conclusion: HATS通过显式建模和处理语义模糊性,有效提升了GUI代理的鲁棒性与泛化能力,为高质量轨迹数据合成提供了新范式。 Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.[155] O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Mengfei Duan,Hao Shi,Fei Teng,Guoqiang Zhao,Yuheng Zhang,Zhiyong Li,Kailun Yang
Main category: cs.CV
TL;DR: O3N是一个首个纯视觉、端到端的全向开放词汇占用预测框架,通过极螺旋Mamba模块、占用代价聚合模块和自然模态对齐机制,实现360°连续空间建模与几何-语义一致性重建,在多个基准上达到SOTA并具备强泛化能力。
Details
Motivation: 现有3D占用预测方法受限于窄视角输入和预定义训练分布,难以满足具身智能体在开放世界探索中对全景、安全场景感知的需求。 Method: 提出O3N框架,包含三个核心模块:1)极螺旋Mamba(PsM)模块,以极螺旋拓扑嵌入全向体素,支持连续空间表征与长程上下文建模;2)占用代价聚合(OCA)模块,统一几何与语义监督;3)自然模态对齐(NMA)模块,实现像素-体素-文本三元一致表征。 Result: 在QuadOcc和Human360Occ基准上达到SOTA性能,展现出优异的跨场景泛化能力和语义可扩展性。 Conclusion: O3N为通用3D世界建模提供了新范式,推动具身智能与自主代理向更全面、安全、开放的环境理解迈进。 Abstract: Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.[156] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Quanhao Li,Zhen Xing,Rui Wang,Haidong Cao,Qi Dai,Daoguo Dong,Zuxuan Wu
Main category: cs.CV
TL;DR: 本文提出FlashMotion框架,用于少步长轨迹可控视频生成,通过多步训练、蒸馏和混合微调策略,在保证高质量视频的同时提升轨迹准确性。
Details
Motivation: 现有轨迹可控视频生成方法依赖多步去噪过程,导致计算开销大;而直接应用视频蒸馏方法会显著降低视频质量和轨迹精度。 Method: FlashMotion包含三阶段:1)在多步视频生成器上训练轨迹适配器;2)将生成器蒸馏为少步版本;3)采用扩散与对抗目标结合的混合策略微调适配器,使其适配少步生成器。同时构建新基准FlashBench评估长序列轨迹可控视频生成性能。 Result: 在两种适配器架构上,FlashMotion在视频视觉质量与轨迹一致性方面均优于现有视频蒸馏方法及多步模型。 Conclusion: FlashMotion有效解决了少步长下轨迹可控视频生成的质量与精度权衡问题,为高效可控视频生成提供了新范式。 Abstract: Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.[157] EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
Ye Pan,Chi Kit Wong,Yuanhuiyi Lyu,Hanqian Li,Jiahao Huo,Jiacheng Chen,Lutao Jiang,Xu Zheng,Xuming Hu
Main category: cs.CV
TL;DR: 本文提出了EgoIntent基准,用于评估多模态大语言模型在自我中心视频中细粒度步骤级意图理解能力,涵盖局部意图(What)、全局意图(Why)和下一步计划(Next)三个维度,实验表明当前MLLMs在此任务上性能仍十分有限。
Details
Motivation: 现有基准仅关注片段级意图推理,忽视了更精细的步骤级意图理解;而智能助手、机器人模仿学习和增强现实指导等应用亟需理解人在每一步‘做什么、为什么做、接下来做什么’。 Method: 构建EgoIntent基准:包含3014个步骤、覆盖15种室内外日常场景;每个视频片段截断于关键动作发生前,避免未来帧泄露;评估维度包括局部意图(What)、全局意图(Why)和下一步计划(Next)。 Result: 在15个主流MLLM(含闭源与开源)上的评测显示,最优模型在三维度平均得分仅为33.31,表明该任务极具挑战性。 Conclusion: 步骤级自我中心视频意图理解仍是未被充分探索且极具难度的问题,亟需新方法与进一步研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.[158] GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
Zexuan Yan,Jiarui Jin,Yue Ma,Shijian Wang,Jiahui Hu,Wenxiang Jiao,Yuan Lu,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出GlyphBanana方法,通过无需训练的智能体工作流,结合辅助工具将字形模板注入潜在空间和注意力图,提升文本到图像模型对复杂文字与数学公式的渲染精度。
Details
Motivation: 当前生成模型在处理分布外提示时指令遵循能力有限,导致复杂文本和数学公式渲染不准确。 Method: 提出GlyphBanana,采用无需训练的智能体工作流,利用辅助工具将字形模板注入潜在空间和注意力图,实现图像迭代优化。 Result: 该方法可即插即用地适配多种Text-to-Image模型,在复杂字符与公式渲染任务上显著优于现有基线。 Conclusion: GlyphBanana是一种通用、高效且无需训练的文本渲染增强方案,为提升T2I模型对结构化符号内容的理解与生成能力提供了新思路。 Abstract: Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.[159] LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
Haiying Xu,Zihan Wang,Song Dai,Zhengxuan Zhang,Kairan Dou,Xuming Hu
Main category: cs.CV
TL;DR: 本文提出LatentGeo框架,通过学习连续潜在视觉表示来内化辅助几何构造,避免像素级渲染和外部执行器;结合三阶段课程学习与潜变量感知的强化学习方法LaGDPO,在新基准GeoAux及MathVerse上显著提升几何推理性能。
Details
Motivation: 现有方法在表示辅助几何构造时存在空间关系建模不准确、符号与几何结构表征失配、依赖外部工具阻碍端到端优化等问题。 Method: 提出LatentGeo框架:1)学习连续潜在视觉表示以隐式编码辅助构造;2)设计三阶段课程学习进行潜在表征对齐与内化;3)引入潜变量感知的强化学习算法LaGDPO稳定训练并提升任务正确率。 Result: 在新构建的GeoAux基准和MathVerse数据集上,LatentGeo在需辅助构造的几何推理任务中取得显著性能提升;消融实验验证各模块有效性。 Conclusion: LatentGeo为多模态大模型处理几何推理中辅助构造问题提供了高效、端到端可优化的潜在表征范式,克服了显式构造方法的固有局限。 Abstract: Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.[160] BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Jingyang Ke,Weihan Li,Amartya Pradhan,Jeffrey Markowitz,Anqi Wu
Main category: cs.CV
TL;DR: 本文提出BehaviorVLM,一种无需任务特定微调、仅需极少人工标注的统一视觉-语言框架,用于自由移动动物的姿态估计与行为理解,通过显式、可验证的推理步骤引导预训练视觉-语言模型。
Details
Motivation: 现有姿态估计与行为理解方法严重依赖人工标注或不稳定的无监督流程,限制了可扩展性与可复现性。 Method: 提出BehaviorVLM框架:姿态估计部分结合量子点标记数据与多阶段时空跨视角推理,并通过几何校验(如重投影误差)暴露低置信度标签;行为理解部分融合深度嵌入聚类、VLM视频逐片段描述和LLM语义整合推理,无需关键点即可进行行为分割。 Result: 显著降低人工标注需求,生成可验证、可修正、可用于下游模型微调的姿态标签;实现端到端、无关键点依赖的行为发现与语义标注;支持多动物、可扩展、可解释、少标签的行为分析。 Conclusion: BehaviorVLM为神经科学研究中自由行为分析提供了更鲁棒、高效且可解释的新范式,推动从神经活动到自然行为的可复现建模。 Abstract: Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.[161] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出ForensicZip,一种无需训练的视觉令牌压缩框架,从伪造驱动的角度出发,通过建模时间令牌演化为出生-死亡最优传输问题,并结合高频先验,有效保留伪造检测所需的物理不连续性线索,在大幅压缩下仍保持高性能。
Details
Motivation: 现有视觉令牌剪枝方法多基于语义驱动,易丢弃包含伪造痕迹(如高频异常、时序抖动)的背景区域,难以兼顾加速与 forensic 检测性能。 Method: 提出ForensicZip框架:将时间维度令牌演化建模为带松弛虚拟节点的Birth-Death Optimal Transport问题,量化物理不连续性;融合传输驱动的新颖性评分与高频先验,实现伪造证据与语义内容的分离。 Result: 在深度伪造与AIGC基准上,10%令牌保留率下实现2.97倍加速与超90% FLOPs降低,同时维持SOTA检测性能。 Conclusion: ForensicZip证明了伪造驱动的令牌压缩优于传统语义驱动方法,为高效、可解释的多媒体取证提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.[162] RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images
Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Yaoqi Sun,Sam Kwong
Main category: cs.CV
TL;DR: 本文提出RDNet网络,通过引入SwinTransformer替代CNN主干,并设计三个关键模块(DAD、FCE、RPL)来提升遥感图像显著目标检测中对尺度变化的鲁棒性和定位精度。
Details
Motivation: 遥感图像显著目标检测面临目标尺寸变化大、自注意力计算开销高、CNN难以建模全局上下文和长程依赖等挑战,现有固定卷积核方法难以适应多尺度目标,导致细节丢失或无关特征聚合。 Method: 提出Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network(RDNet):用SwinTransformer替代CNN主干以建模全局上下文;设计Dynamic Adaptive Detail-aware(DAD)模块(依据区域比例动态选择卷积核)、Frequency-matching Context Enhancement(FCE)模块(结合小波变换与注意力增强上下文)、Region Proportion-aware Localization(RPL)模块(含交叉注意力与Proportion Guidance块辅助DAD)。 Result: RDNet在遥感图像显著目标检测任务上展现出对尺度变化更强的鲁棒性与更精确的定位能力,在多个指标上优于当前最先进方法。 Conclusion: RDNet通过动态感知区域比例与多尺度上下文建模,有效缓解了遥感图像中显著目标尺度差异大、全局信息利用不足的问题,为该领域提供了新思路与高效架构。 Abstract: Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.[163] Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Görkay Aydemir,Fatma Güney,Weidi Xie
Main category: cs.CV
TL;DR: 本文提出Verifier元模型,用于评估多预训练跟踪器轨迹的可靠性,指导生成高质量伪标签,从而提升无标注真实视频上的长期点跟踪性能。
Details
Motivation: 现有长期点跟踪模型在合成数据上训练,但在真实视频中因特性差异和缺乏密集真值标注而性能下降;自训练虽被探索,但伪标签质量依赖教师模型的可靠性,而该可靠性在不同帧和场景中变化较大。 Method: 提出Verifier元模型,接收多个预训练跟踪器产生的候选轨迹,逐帧评估并选择最可信的预测,生成高质量伪标签轨迹,用于后续微调。 Result: 在四个真实世界基准上实验表明,该方法在更少数据下达到当前最优性能。 Conclusion: Verifier能有效提升伪标签质量,实现高效、数据经济的真实世界适应。 Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r[164] A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Jiajun Sun,Zhe Gao
Main category: cs.CV
TL;DR: 本文提出了一种两阶段音视频双模态模型,用于解决ABAW竞赛中无约束视频中的八类面部表情帧级识别问题,通过DINOv2视觉编码器、PadAug增强、MoE分类头、多尺度重裁剪、Wav2Vec 2.0音频特征及门控融合等技术,显著提升了识别性能。
Details
Motivation: 解决ABAW挑战赛中无约束视频下表情识别面临的面部定位不准、姿态与尺度变化大、运动模糊、帧间时序不稳定等问题。 Method: 构建两阶段双模态模型:第一阶段采用DINOv2 ViT-L/14作为视觉主干,引入PadAug图像填充增强策略和混合专家(MoE)训练头;第二阶段进行多尺度人脸重裁剪以提升视觉鲁棒性,提取帧对齐的Wav2Vec 2.0音频特征,并通过轻量门控融合模块融合双模态特征,最后施加推理时时间平滑。 Result: 在ABAW官方验证集上Macro-F1达0.5368,在5折交叉验证下为0.5122±0.0277,均优于官方基线。 Conclusion: 所提两阶段音视频融合框架有效缓解了野外场景下的表情识别难点,验证了视觉鲁棒表征学习与互补音频线索融合对提升性能的关键作用。 Abstract: This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.[165] HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers
Andy Li,Aiden Durrant,Milan Markovic,Georgios Leontidis
Main category: cs.CV
TL;DR: 本文提出了一种名为Hierarchical Auto-Pruning (HiAP) 的端到端连续松弛剪枝框架,通过多粒度(宏观与微观)随机门控机制,在单阶段训练中自动发现高效子网络,无需人工设定稀疏目标或重要性启发式,显著简化部署流程并提升边缘设备上的效率-精度权衡。
Details
Motivation: Vision Transformers在边缘设备上部署受限于高计算资源和内存带宽需求;现有结构化剪枝方法存在单粒度、多阶段、依赖后验阈值等缺陷。 Method: 提出HiAP框架:引入多粒度Gumbel-Sigmoid随机门(宏观门剪注意力头/FFN块,微观门剪头内维度/FFN神经元),联合优化;采用含结构可行性惩罚与解析FLOPs的损失函数实现稳定收敛。 Result: 在ImageNet上实验表明,HiAP能自动发现高效架构,在DeiT-Small等模型上达到与复杂多阶段方法相当的精度-效率Pareto前沿,同时大幅简化部署流程。 Conclusion: HiAP是一种简洁、统一、端到端的结构化剪枝方法,兼顾内存与计算瓶颈,在理论与实践上均优于传统多阶段剪枝范式。 Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.[166] SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Jun Luo,Jiaxiang Tang,Ruijie Lu,Gang Zeng
Main category: cs.CV
TL;DR: 本文提出SceneAssistant,一种基于视觉反馈的智能体,用于开放词汇的文本到3D场景生成,通过结合3D生成模型与视觉语言模型(VLM)的空间推理能力,利用原子操作迭代优化场景,实现高质量、多样化的3D场景合成与编辑。
Details
Motivation: 现有文本到3D场景生成方法受限于特定领域或预定义空间关系,难以支持开放词汇、无约束的3D场景合成。 Method: 提出SceneAssistant框架,利用现代3D对象生成模型与视觉语言模型(VLM),赋予VLM一组原子操作(如缩放、旋转、聚焦),并基于渲染图像的视觉反馈进行多步交互式场景构建与优化。 Result: 实验表明该方法能生成多样化、开放词汇、高质量的3D场景;定性与定量人工评估均优于现有方法;并支持基于自然语言对已有场景进行编辑。 Conclusion: SceneAssistant通过视觉反馈驱动的VLM代理机制,有效提升了开放词汇3D场景生成的灵活性、空间一致性与语义对齐能力,为数字内容创作提供了新范式。 Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant[167] BiGain: Unified Token Compression for Joint Generation and Classification
Jiacheng Liu,Shengkun Tang,Jiacheng Cui,Dongkuan Xu,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出BiGain框架,通过频域分离思想设计两种频率感知的token压缩操作(Laplacian-gated token merging和Interpolate-Extrapolate KV Downsampling),在加速扩散模型的同时兼顾生成质量与分类性能,首次实现生成与判别能力协同提升。
Details
Motivation: 现有扩散模型加速方法(如token合并或下采样)多侧重生成质量与计算效率的权衡,忽视其判别能力(如分类性能),缺乏对生成与判别双重目标的联合优化。 Method: 提出无训练、即插即用的BiGain框架,基于频域分离思想:(1)Laplacian-gated token merging——依据频谱平滑性控制token合并,保留高频边缘与纹理;(2)Interpolate-Extrapolate KV Downsampling——在保持query不变前提下,对key/value进行近邻与均值池化的可控插值-外推下采样,维持注意力精度。 Result: 在DiT/U-Net架构及多个数据集(ImageNet-1K/100、Oxford-IIIT Pets、COCO-2017)上,BiGain显著改善加速下的分类准确率与生成质量(如ImageNet-1K上70% token合并时,分类精度↑7.15%,FID↓0.34);分析表明均衡保留高低频信息是token压缩的有效设计准则。 Conclusion: BiGain是首个同时提升加速扩散模型生成质量与分类能力的通用框架,验证了频域感知压缩对多任务协同优化的有效性,为低成本部署提供新路径。 Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.[168] One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Dogyun Park,Anil Kag,Michael Vasilkovsky,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin
Main category: cs.CV
TL;DR: 本文提出Elastic Latent Interface Transformer(ELIT),一种轻量、即插即用的机制,通过引入可变长的潜在接口和轻量级读写交叉注意力层,解耦输入图像尺寸与计算量,并实现重要区域优先计算;训练时随机丢弃尾部潜在token以学习重要性排序表征,推理时可动态调整潜在token数量以适配算力约束,在多个数据集和DiT类架构上显著提升生成质量。
Details
Motivation: Diffusion transformers(DiTs)存在两个关键问题:1)计算量(FLOPs)被锁定于图像分辨率,难以在延迟与质量间进行原则性权衡;2)对所有空间token均匀分配计算资源,导致对不重要区域的资源浪费。 Method: 提出Elastic Latent Interface Transformer(ELIT):在DiT中插入一个可学习、可变长度的潜在接口token序列;设计轻量级Read和Write交叉注意力层,在空间token与潜在token之间传递信息并实现重要区域聚焦;训练时随机丢弃尾部潜在token,促使模型学习重要性有序表征(前部捕获全局结构,后部细化细节);推理时可动态调节潜在token数量以匹配算力限制;整体仅新增两个交叉注意力层,保持原有DiT结构、rectified flow目标不变。 Result: 在ImageNet-1K 512px上,ELIT平均提升FID 35.3%、FDD 39.6%;在多个数据集及DiT变体(U-ViT、HDiT、MM-DiT)上均取得一致性能增益;具备即插即用性与架构通用性。 Conclusion: ELIT是一种简洁高效、兼容性强的DiT增强机制,成功解耦图像分辨率与计算复杂度,支持动态计算分配与重要性感知建模,为扩散模型提供了更灵活、更高效的生成范式。 Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/[169] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
Xiangyu Zhao,Peiyuan Zhang,Junming Lin,Tianhao Liang,Yuchen Duan,Shengyuan Ding,Changyao Tian,Yuhang Zang,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出FIRM框架,通过构建高质量数据集、设计专用奖励模型和引入'基础+奖励'策略,显著提升图像编辑与文生图任务中强化学习的准确性和可靠性。
Details
Motivation: 现有奖励模型存在幻觉和评分噪声问题,导致强化学习优化过程被误导。 Method: 设计专门的数据筛选流程构建高质量评分数据集(FIRM-Edit-370K和FIRM-Gen-293K),训练专用奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),并提出'Base-and-Bonus'奖励策略(CME用于编辑,QMA用于生成),同时构建FIRM-Bench评测基准。 Result: FIRM奖励模型在人类判断对齐性上优于现有指标;集成后模型FIRM-Qwen-Edit和FIRM-SD3.5在保真度和指令遵循能力上取得显著突破,有效缓解幻觉问题。 Conclusion: FIRM为图像编辑与文生图任务提供了更可靠、忠实的强化学习奖励建模范式,建立了新的保真度与指令遵循标准。 Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.[170] DVD: Deterministic Video Depth Estimation with Generative Priors
Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Jing He,Zixin Zhang,Haodong Li,Yihao Liang,Kanghao Chen,Bin Ren,Xu Zheng,Shuai Yang,Kun Zhou,Yinchuan Li,Nicu Sebe,Ying-Cong Chen
Main category: cs.CV
TL;DR: DVD是一种新型视频深度估计框架,通过确定性地将预训练视频扩散模型转化为单次深度回归器,解决了生成式模型的几何幻觉与判别式模型依赖大量标注数据的问题。
Details
Motivation: 现有视频深度估计方法存在根本性权衡:生成式模型易出现随机几何幻觉和尺度漂移,而判别式模型则需要大量标注数据来解决语义歧义。 Method: DVD包含三个核心设计:(i) 将扩散时间步作为结构锚点以平衡全局稳定性与高频细节;(ii) 潜在流形校正(LMR),通过施加微分约束恢复清晰边界和连贯运动;(iii) 全局仿射一致性,用于限制窗口间发散,实现无需复杂时序对齐的长视频推理。 Result: DVD在多个基准上实现了零样本SOTA性能,并仅用领先基线1/163的任务特定数据即成功挖掘出视频基础模型中隐含的深层几何先验。 Conclusion: DVD首次实现了预训练视频扩散模型向确定性单通深度回归器的转化,显著降低了对标注数据的依赖,同时提升了深度估计的几何保真度与长视频一致性,并已完全开源。 Abstract: Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.[171] Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Baifeng Shi,Stephanie Fu,Long Lian,Hanrong Ye,David Eigen,Aaron Reite,Boyi Li,Jan Kautz,Song Han,David M. Chan,Pavlo Molchanov,Trevor Darrell,Hongxu Yin
Main category: cs.CV
TL;DR: AutoGaze是一种轻量级模块,通过自回归选择多尺度视觉块,在满足误差阈值前提下大幅减少视频输入的冗余视觉token,显著提升MLLMs处理长时高分辨率视频的效率与性能。
Details
Motivation: 现有MLLMs在处理长时、高分辨率视频时效率低下,因其视觉Transformer或LLM对所有像素一视同仁,忽略了视频中大量时空冗余。 Method: 提出AutoGaze模块,结合next-token预测与强化学习进行训练,自回归地选取最小必要多尺度图像块集合,以在指定误差内重建原始视频,从而剔除冗余信息。 Result: 视觉token减少4–100倍,ViT和MLLM推理加速最高达19倍;支持MLLM处理1K帧、4K分辨率视频,在VideoMME上达67.0%;在新构建的HLVid(5分钟4K视频QA基准)上相对基线提升10.1%,超越此前最优MLLM 4.5%。 Conclusion: AutoGaze有效缓解了MLLM处理长高分辨率视频的token瓶颈,为高效、可扩展的视频理解提供了新范式。 Abstract: Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.[172] Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Fangfu Liu,Diankun Wu,Jiawei Chi,Yimo Cai,Yi-Hsin Hung,Xumin Yu,Hao Li,Han Hu,Yongming Rao,Yueqi Duan
Main category: cs.CV
TL;DR: 本文提出Spatial-TTT,一种面向流式视觉空间智能的测试时训练(TTT)方法,通过动态更新快速权重、混合架构、滑动窗口注意力与3D时空卷积的空间预测机制,实现对长时序视频中全局3D空间信息的高效建模与组织,并在视频空间理解任务上达到SOTA。
Details
Motivation: 人类通过连续视觉观测理解真实世界空间,因此模型需具备在无限视频流中持续选择、组织和保留空间证据的能力,而不仅是扩展上下文窗口。 Method: 提出Spatial-TTT框架:采用测试时训练(TTT)动态更新部分参数(fast weights);设计混合架构,结合大块更新(large-chunk updates)与滑动窗口注意力;引入基于3D时空卷积的空间预测机制以增强几何对应与时间连续性建模;构建含密集3D空间描述的新数据集,引导fast weights结构化记忆全局3D空间信号。 Result: 在多个视频空间理解基准上取得SOTA性能,显著提升长时序空间理解能力。 Conclusion: Spatial-TTT验证了测试时自适应机制在流式空间智能中的有效性,为长时序视觉空间建模提供了新范式。 Abstract: Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.[173] DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
Yujie Wei,Xinyu Liu,Shiwei Zhang,Hangjie Yuan,Jinbo Xing,Zhekai Chen,Xiang Wang,Haonan Qiu,Rui Zhao,Yutong Feng,Ruihang Chu,Yingya Zhang,Yike Guo,Xihui Liu,Hongming Shan
Main category: cs.CV
TL;DR: 本文提出DreamVideo-Omni框架,通过两阶段训练实现多主体身份定制与全粒度运动控制,引入条件感知3D旋转位置编码、分层运动注入及群组/角色嵌入解决控制模糊与身份退化问题,并设计潜在身份奖励反馈机制提升身份保持能力。
Details
Motivation: 现有大模型在视频合成中难以同时精确控制多主体身份和多粒度运动,存在运动粒度有限、控制模糊和身份退化等问题。 Method: 提出两阶段训练范式:第一阶段融合外观、全局/局部运动、相机运动等多控制信号,引入条件感知3D旋转位置编码、分层运动注入和群组/角色嵌入;第二阶段构建潜在身份奖励模型,实现运动感知的身份保持反馈学习。 Result: 在自建大规模数据集和DreamOmni Bench评测基准上,DreamVideo-Omni在多主体与全粒度运动控制任务中显著优于现有方法,生成视频质量高、可控性强。 Conclusion: DreamVideo-Omni为多主体视频合成提供了统一、可控、鲁棒的解决方案,推动了视频生成中身份与运动协同控制的发展。 Abstract: While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.[174] Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Yiran Guan,Liang Yin,Dingkang Liang,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出Video Streaming Thinking (VST),一种支持边看边想的流式视频理解新范式,通过在视频流中同步激活推理、结合结构化微调与强化学习后训练,并利用视频知识图谱合成高质量流式QA数据,显著提升实时性与推理能力。
Details
Motivation: 现有在线视频大模型仅关注流式感知,缺乏同步的逻辑推理流;而直接应用测试时扩展方法会导致不可接受的响应延迟,亟需平衡实时性与深度推理能力。 Method: 提出VST范式,包含:1)边看边想机制,在视频流中动态激活推理以摊销LLM延迟;2)两阶段后训练:VST-SFT实现因果流式推理结构适配,VST-RL在多轮视频交互环境中端到端优化;3)基于视频知识图谱的自动化数据合成 pipeline,生成带实体-关系锚定的流式思维链QA对。 Result: VST-7B在StreamingBench达79.5%,OVO-Bench达59.3%;相比Video-R1响应快15.7倍,VideoHolmes提升+5.4%;同时在离线长视频/推理基准上保持竞争力。 Conclusion: VST成功实现了流式视频感知与逻辑推理的协同,兼顾实时响应与多证据连贯认知,在效率、准确性与泛化性上取得统一突破。 Abstract: Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.[175] GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
Mingxin Liu,Ziqian Fan,Zhaokai Wang,Leyao Gu,Zirun Zhu,Yiguo He,Yuchen Yang,Changyao Tian,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Qibing Ren,Zhihang Zhong,Xuanhe Zhou,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出了GRADE基准,用于评估图像编辑中学科知识与推理能力,涵盖10个学术领域共520个样本,并设计了多维评估协议,揭示了当前多模态模型在知识密集型编辑任务中的显著局限性。
Details
Motivation: 现有图像编辑基准局限于自然图像和浅层常识推理,难以有效评估统一多模态模型在结构化、领域特定约束下的联合理解、推理与生成能力。 Method: 构建GRADE基准(含520个跨10个学科的样本),并提出多维评估协议,综合衡量学科推理、视觉一致性与逻辑可读性;对20个前沿开源与闭源模型进行实验,并开展深入分析与消融研究。 Result: 实验表明当前模型在隐式、知识密集型图像编辑任务中存在显著性能差距,暴露出其在学科编辑约束下的诸多不足。 Conclusion: GRADE为统一多模态模型的发展指明了关键方向,推动学科导向的图像编辑与推理研究,并已开源基准与评估代码。 Abstract: Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.[176] OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
Yibin Yan,Jilan Xu,Shangzhe Di,Haoning Wu,Weidi Xie
Main category: cs.CV
TL;DR: 本文提出OmniStream,一种统一的流式视觉骨干网络,通过因果时空注意力和3D旋转位置编码实现帧级在线视频处理,并在多任务预训练下展现出跨语义、空间与时间推理的泛化能力。
Details
Motivation: 现代视觉智能体需要具备通用性、因果性和物理结构化的表征以适应实时流式环境,但现有视觉基础模型功能割裂,难以兼顾图像语义、时序建模与空间几何。 Method: 提出OmniStream模型,引入因果时空注意力机制与3D-RoPE位置编码,结合持久KV缓存支持帧级流式处理;采用融合静态/时序表征学习、流式几何重建与视觉-语言对齐的多任务预训练框架,在29个数据集上训练。 Result: 即使骨干网络完全冻结,OmniStream在图像/视频探针、流式几何重建、复杂时空推理及未见机器人操控任务中均达到与专用模型相当的性能。 Conclusion: 证明了单一、通用视觉骨干网络可有效统一语义、空间与时间推理能力,是迈向面向交互式与具身智能体的通用视觉理解的重要一步。 Abstract: Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.[177] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
Haozhan Shen,Shilin Yan,Hongwei Xue,Shuaiqi Lu,Xiaojun Tang,Guannan Zhang,Tiancheng Zhao,Jianwei Yin
Main category: cs.CV
TL;DR: 本文提出MM-CondChain基准,用于评估多模态大语言模型(MLLMs)在视觉 grounding 下的深层组合条件推理能力,并设计了基于智能体的合成流程生成可验证的多层推理链数据。
Details
Motivation: 现有基准聚焦于浅层组合或独立约束,难以评估MLLMs在GUI导航等真实视觉工作流中处理深度链式组合条件(如多对象、属性与关系联合判断)的能力。 Method: 提出MM-CondChain基准及配套的智能体合成流程:Planner分层生成组合条件,VPIR确保每层条件可机械验证,Composer整合为完整指令;覆盖自然图像、图表和GUI轨迹三类视觉域。 Result: 在多个MLLM上实验显示,最强模型路径F1仅53.33%,且在难负样本、推理深度和谓词复杂度增加时性能显著下降。 Conclusion: 深层组合条件推理仍是MLLMs的根本性挑战,MM-CondChain为该能力提供了严格、可扩展的评测基准。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.[178] EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Tianwei Xiong,Jun Hao Liew,Zilong Huang,Zhijie Lin,Jiashi Feng,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出EVATok框架,通过自适应视频分词器优化视频生成中的token分配,提升重建质量和生成效率。