Skip to content

Table of Contents

cs.CL [Back]

[1] Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Amirhossein Bozorgkhoo,Igor Molybog

Main category: cs.CL

TL;DR: 本文提出了一种理论框架,用于分析性地连接预训练大语言模型的关键超参数与基于推测解码(Speculative Decoding, SD)的推理系统吞吐效率,从而在预训练前预测吞吐最优的超参数配置。

Details Motivation: 以往通过实验方法优化推测解码推理流水线吞吐量需进行大模型训练,成本高昂;本文旨在建立可解析的理论,避免试错式训练。 Method: 提出一种理论分析方法,将预训练LLM的关键超参数与SD推理系统的吞吐效率进行解析建模与关联。 Result: 实现了在模型预训练前即可预测SD系统各组件的吞吐最优超参数,为高效推理系统设计提供理论指导。 Conclusion: 该理论为推测解码提供了可解释、低成本的超参数优化路径,有望显著降低LLM推理加速的研发开销。 Abstract: Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.

[2] Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

Jingtao Wang,Yucong Wang,Jun Ding,Rui Cai,Xun Wang

Main category: cs.CL

TL;DR: 本文提出ARACH,一种无需训练的推理时插件,通过自适应上下文中心聚合上下文并重分配注意力,提升大语言模型性能,且不更新参数。

Details Motivation: 现有训练后方法多为黑箱式输入/输出干预,缺乏对模型内部计算的即插即用干预机制。 Method: 提出ARACH(Attention Reallocation via an Adaptive Context Hub),在推理时引入自适应上下文中心,聚合上下文并动态重分配注意力。 Result: 在多个语言建模任务上实现稳定提升,推理开销小,无需参数更新;注意力分析显示其缓解了attention sink现象。 Conclusion: 对模型内部计算的工程化干预是一种区别于提示工程和训练式后训练的新颖推理时策略。 Abstract: Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.

[3] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Hanxu Hu,Yuxuan Wang,Maggie Huan,Jannis Vamvas,Yinya Huang,Zhijiang Guo,Rico Sennrich

Main category: cs.CL

TL;DR: 本文提出DeReason方法,通过基于难度的数据解耦策略,在通用STEM领域中优化监督微调(SFT)与强化学习(RL)的协同训练流程,显著提升大模型的复杂推理能力。

Details Motivation: 现有RLVR范式在通用STEM领域中SFT与RL的交互机制尚不清晰,直接对基模型应用RL样本效率低,而SFT与RL顺序组合效果更好,但数据分配方式影响性能。 Method: 提出DeReason:利用LLM打分估计问题的推理强度,将训练数据划分为推理密集型与非推理密集型子集;前者用于RL训练以增强复杂推理,后者用于SFT以夯实基础领域知识。 Result: 在多个通用STEM和数学基准上,DeReason显著优于SFT-only、RL-only及随机划分的SFT+RL基线,验证了难度感知数据解耦的有效性。 Conclusion: SFT与RL在通用推理中具有互补作用,合理依据推理难度分配训练数据可大幅提升模型性能,DeReason提供了一种通用、高效的后训练范式。 Abstract: Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.

[4] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Marco Brambilla,Piero Fraternali

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱的检索增强生成(RAG)框架MDER-DR,通过新索引方法MDER和检索机制DR提升多跳问答性能,在标准与领域特定基准上显著优于基线方法。

Details Motivation: 现有基于知识图谱的RAG方法在将文本索引为三元组时易丢失上下文细节,导致多跳问答等下游任务性能下降。 Method: 提出Map-Disambiguate-Enrich-Reduce(MDER)索引方法,生成上下文驱动的三元组描述并融合实体级摘要;并设计Decompose-Resolve(DR)检索机制,将用户查询分解为可解析三元组并通过迭代推理在KG中定位。 Result: 在标准及领域特定问答基准上,MDER-DR相较传统RAG基线最高提升66%,且具备跨语言鲁棒性。 Conclusion: MDER-DR是一种领域无关、LLM驱动的知识图谱问答框架,对稀疏、不完整和复杂关系数据具有强鲁棒性。 Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.

[5] Markovian Generation Chains in Large Language Models

Mingmeng Geng,Amr Mohamed,Guokan Shang,Michalis Vazirgiannis,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)在多次迭代推理中文本的演化规律,提出‘马尔可夫生成链’模型,发现输出可能收敛或持续创新,并受温度参数和初始输入影响多样性。

Details Motivation: 探究大语言模型反复处理文本时的演化规律,理解迭代推理对文本多样性的影响及其对多智能体LLM系统的启示。 Method: 定义并建模‘马尔可夫生成链’,开展迭代释义与往返翻译实验,结合句子级马尔可夫链建模与模拟数据分析。 Result: 迭代过程可能导致输出收敛至小的循环集合,或持续生成新句子;句子多样性可能增加或减少,取决于温度参数和初始输入。 Conclusion: 迭代LLM推理具有复杂动态特性,其多样性变化非单调,需在多智能体系统设计中谨慎考虑参数与初始化设置。 Abstract: The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.

[6] Artificial Intelligence for Sentiment Analysis of Persian Poetry

Arash Zargar,Abolfazl Moshiri,Mitra Shafaei,Shabnam Rahimi-Golkhandan,Mohamad Tavakoli-Targhi,Farzad Khalvati

Main category: cs.CL

TL;DR: 本研究利用BERT和GPT等大语言模型分析波斯诗人鲁米与帕尔文·埃特萨米的诗歌,探究其对波斯诗歌复杂性的理解能力,并考察诗作情感与格律之间的关联;结果表明GPT-4o可可靠用于波斯诗歌分析,鲁米诗歌整体情感更积极,且格律运用更丰富以表达多元情感。

Details Motivation: 探索现代大语言模型在理解波斯诗歌复杂性(如语义、情感与格律)方面的能力,并检验其在减少人为解释偏差的计算机辅助语义研究中的适用性。 Method: 采用多个基于BERT和GPT(特别是GPT-4o)的语言模型,对鲁米和帕尔文·埃特萨米的诗歌进行情感分析与格律使用比较分析。 Result: GPT-4o能可靠分析波斯诗歌;鲁米诗歌整体情感更积极;鲁米在格律运用上更丰富,能表达更广泛的情感;LLMs可用于无须人工干预的客观语义研究。 Conclusion: 大语言模型(尤其是GPT-4o)可有效应用于波斯诗歌的自动化语义分析,有助于降低人为偏见,推动计算人文研究的发展。 Abstract: Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.

[7] ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi,Saiph Savage

Main category: cs.CL

TL;DR: 本文提出了ThReadMed-QA——首个基于真实患者-医生在线对话的多轮医学问答基准,揭示当前主流大模型在多轮医学问答中性能显著退化,尤其随对话轮次增加错误率激增,并提出CCS与EPR两个新指标刻画其一致性与错误传播问题。

Details Motivation: 现有医学问答基准多为单轮问答,无法反映真实医患咨询中反复澄清、多轮交互的特点;缺乏基于真实对话、经医生验证的数据集。 Method: 构建ThReadMed-QA数据集(2437个完整对话线程,8204个QA对,最多9轮),采用LLM-as-a-judge方式(基于医师标注真值校准)评估5个SOTA大模型在分层测试集(238个对话)上的表现,并提出Conversational Consistency Score (CCS) 和 Error Propagation Rate (EPR) 量化多轮失败模式。 Result: GPT-5表现最优但仅41.2%全对;所有模型在第0轮到第2轮间性能显著下降(p<0.001),错误率约增至3倍;强模型初始得分高但衰减剧烈(如GPT-5降16.2分),弱模型反而趋稳;CCS显示Claude Haiku近1/3对话答案正误剧烈波动;EPR表明单次错误使后续错误概率提升1.9–6.1倍。 Conclusion: 当前大模型在单轮医学问答上能力尚可,但在真实多轮临床对话场景下可靠性严重不足,凸显了对话一致性与错误鲁棒性的关键挑战,亟需面向多轮交互的建模与评估新范式。 Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.

[8] Temporal Text Classification with Large Language Models

Nishat Raihan,Marcos Zampieri

Main category: cs.CL

TL;DR: 本文首次系统评估了主流闭源和开源大语言模型在时间文本分类(TTC)任务上的表现,发现闭源模型(如GPT-4o、Claude 3.5)在少样本提示下效果优异,而开源模型经微调后虽有提升,仍不及闭源模型。

Details Motivation: 尽管大语言模型(LLMs)近年发展迅速,但其在自动文本断代(即时间文本分类,TTC)任务上的性能尚未被系统研究。 Method: 在三个历史语料库(两份英文、一份葡萄牙文)上,对领先闭源(Claude 3.5、GPT-4o、Gemini 1.5)和开源(LLaMA 3.2、Gemma 2、Mistral、Nemotron 4)LLMs进行零样本、少样本提示及微调三种设置下的系统性评估。 Result: 闭源模型(尤其GPT-4o等)在少样本提示下表现优异;开源模型经微调后性能显著提升,但仍明显落后于闭源模型。 Conclusion: 当前闭源LLMs在TTC任务上具有明显优势,开源模型需进一步优化才能达到同等水平;少样本提示是高效利用闭源模型的关键策略。 Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.

[9] Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Aria Nourbakhsh,Salima Lamsiyah,Adelaide Danilov,Christoph Schommer

Main category: cs.CL

TL;DR: 本文提出了一种在Transformer-based seq2seq模型中系统评估可解释性(XAI)方法的新框架,利用教师模型生成的归因图指导学生模型,并通过BLEU等指标量化不同归因方法的有效性;结果表明Attention、Value Zeroing和Layer Gradient×Activation效果最优,且提出了能重建归因图的Attributor Transformer模型。

Details Motivation: 现有XAI方法在seq2seq模型中的系统化、自动化评估不足,缺乏统一、可量化的评估框架。 Method: 以教师模型生成的归因图为监督信号,将其通过四种组合算子(加法、乘法、平均、替换)注入学生Transformer模型的注意力机制;使用Inseq库提取源-目标序列对的归因得分,并在多语言对上评估下游任务性能提升;进一步设计Attributor Transformer模型学习重建教师归因图。 Result: Attention、Value Zeroing和Layer Gradient×Activation在BLEU和chrF指标上带来最显著且稳定的提升;而其他梯度类方法(如Saliency、Integrated Gradients等)效果较弱且不一致;Attributor重建归因图的准确性与下游任务增益正相关。 Conclusion: 不同归因方法捕获的信息存在本质差异;基于注意力机制的归因更能反映seq2seq中源-目标表征对齐;归因图重建质量可作为其解释效用的有效代理指标。 Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.

[10] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin

Main category: cs.CL

TL;DR: 本文提出了一种“坚持或切换”(stick-or-switch)评估框架,用于评估17个大语言模型(LLMs)在多轮临床诊断对话中的推理能力;研究发现多轮交互会显著降低模型性能(即‘对话税’),模型常放弃初始正确诊断或安全拒答,盲目跟随错误用户建议。

Details Motivation: 尽管当前大语言模型在静态诊断推理基准上表现优异,但其在更贴近现实的多轮临床对话场景下的诊断推理能力尚缺乏系统研究。 Method: 构建了'stick-or-switch'评估框架,用于量化模型在多轮对话中的信念坚定性(如坚持正确诊断或安全拒答)与灵活性(如识别并采纳后续出现的正确建议);在三个临床数据集上对17个LLM进行实验评估。 Result: 发现显著的'对话税'现象——多轮交互持续削弱模型性能;模型常放弃初始正确诊断或安全拒答以迎合错误用户建议;部分模型存在'盲目切换'行为,无法区分信号与错误建议。 Conclusion: 当前LLMs在多轮临床对话中诊断推理能力受限,亟需提升其信念稳定性与辨别力,以保障真实医疗场景下的安全可靠应用。 Abstract: Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

[11] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects

Amani Maina-Kilaas,Roger Levy

Main category: cs.CL

TL;DR: 本文探讨了语言处理中的意外性理论(surprisal theory)及其局限性,指出大语言模型(LLM)虽能较好预测阅读时间,但在结构预期被违反时系统性低估难度,暗示结构歧义表征在句子加工中起因果作用;作者引入粒子滤波模型(particle filter models),显式表征结构假设,并证明其算法特性(如放大花园路径效应),尤其揭示重采样(resampling)机制会自然导致实时‘挖坑效应’(digging-in effects),且该效应强度随粒子数量减少而增强。

Details Motivation: 现有基于大语言模型的surprisal预测虽跨语言稳健,但无法解释结构预期违反时的加工困难,暗示其缺乏结构歧义表征可能造成理论缺陷,需引入能显式建模歧义的计算框架。 Method: 采用粒子滤波模型作为替代框架,显式维护有限数量的结构假设(粒子);通过理论推导与模拟分析,证明其关键算法性质(如花园路径效应放大、重采样引发挖坑效应),并量化挖坑效应与粒子数的反比关系。 Result: 证明粒子滤波模型中重采样操作必然导致实时挖坑效应(即歧义区域越长,后续消歧越难),且该效应强度随粒子数减少而增强;完全并行模型(无限粒子)则不产生此效应。 Conclusion: 结构歧义的显式表征对解释人类句子加工现象(尤其是违反预期时的困难和挖坑效应)具有因果必要性;surprisal理论需扩展以纳入歧义动态追踪机制,粒子滤波为符合认知约束的可行建模路径。 Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.

[12] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Michiko Yoshitake,Yuta Suzuki,Ryo Igarashi,Yoshitaka Ushiku,Keisuke Nagato

Main category: cs.CL

TL;DR: MaterialFigBench 是一个面向材料科学领域的多模态大语言模型评测基准,聚焦于模型对相图、应力-应变曲线等关键图表的理解与定量解读能力;实验表明当前多模态LLMs仍严重依赖记忆而非真实视觉推理,暴露出数值精度、有效数字处理和视觉推理的显著短板。

Details Motivation: 现有基准多依赖文本,难以评估模型对材料科学中不可或缺的图表(如相图、衍射图谱等)的真实理解能力;需构建领域专用、图驱动的评测基准以揭示多模态LLM在视觉推理与定量分析上的真实瓶颈。 Method: 构建包含137道大学材料科学教材改编题目的MaterialFigBench数据集,覆盖晶体结构、相图、力学性能等核心主题,每题均需结合特定图表作答;为应对图像读数模糊性,引入专家定义的答案容差范围;系统评测多个SOTA多模态LLM(如ChatGPT、GPT系列)在各题型与版本上的表现。 Result: 模型整体准确率随版本更新有所提升,但普遍未能真正解析图表——正确答案多源于记忆知识而非图像阅读;在定量解读(如读取坐标值、斜率、相区边界)、有效数字处理及复杂图式(如三元相图、电子能带结构)理解上表现薄弱;部分简单图表(如线性Arrhenius图)识别能力有所改善。 Conclusion: MaterialFigBench揭示了当前多模态大语言模型在材料科学图解推理中的根本性缺陷,强调必须加强视觉-数值联合建模能力;该基准为推动领域专用多模态AI发展提供了可复现、细粒度的评测标准与改进方向。 Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.

[13] BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion

Varun Iyer,Cornelia Caragea

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的解码干预方法BLooP,通过鼓励大语言模型生成源文档中出现的二元组(bigram)来提升摘要的忠实度和质量。

Details Motivation: 现有大语言模型在无微调情况下进行抽象式摘要时,常遗漏关键信息并引入无关内容,需提升其生成摘要的忠实度与准确性。 Method: 提出BLooP(Bigram Lookahead Promotion),一种基于哈希表查找的训练-free解码干预方法,在每一步解码中优先选择能构成源文档中已存在bigram的token,无需训练、微调或修改模型。 Result: 在多个模型(Llama-3.1-8B-Instruct等)和数据集(CNN/DM、CCSum等)上,BLooP显著提升了ROUGE和BARTScore指标;人工评估证实其显著提高摘要忠实度,且不损害可读性。 Conclusion: BLooP是一种轻量、通用、即插即用的解码策略,有效增强大模型摘要的忠实性,为训练-free摘要优化提供了新思路。 Abstract: Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at https://github.com/varuniyer/BLooP

Yuzhi Liang,Lixiang Ma,Xinrong Zhu

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型先验与统计因果发现的增强型因果推理框架,用于提升法律判决预测的准确性和鲁棒性,通过细粒度法律要素提取与因果结构消歧,显著优于现有方法。

Details Motivation: 现有基于预训练语言模型的法律判决预测方法依赖统计相关性,缺乏对法律构成要件和因果逻辑的显式建模,易学得虚假相关、鲁棒性差;而现有因果方法在真实法律文本中面临法律因子提取噪声大、因果结构发现因稀疏特征导致不确定性高等瓶颈。 Method: 提出融合LLM先验与统计因果发现的增强因果框架:1)设计粗到细混合提取机制(统计采样+LLM语义推理)精准识别并净化法律构成要素;2)引入LLM辅助因果结构消歧机制,利用LLM作为约束性先验知识库,对模糊因果方向进行概率评估与剪枝,生成合法合规候选因果图;3)基于因果图显式约束文本注意力强度,构建因果感知判决预测模型。 Result: 在LEVEN、QA、CAIL等多个基准数据集上实验表明,该方法在预测精度和鲁棒性上显著超越SOTA基线,尤其在区分易混淆罪名方面表现突出。 Conclusion: 将LLM的语义理解能力与因果推理有机结合,可有效缓解法律文本中要素提取噪声与结构不确定性问题,为可解释、鲁棒的法律AI提供新范式。 Abstract: Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.

[15] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CL

TL;DR: 本文提出Tool-DC框架,通过'尝试-检查-重试'范式提升大语言模型在工具调用任务中的性能,分为无需训练(TF)和需训练(TB)两种变体,在多个基准上显著优于基线方法。

Details Motivation: 现有方法在面对大规模、高噪声候选工具的长上下文工具调用任务时表现不佳,限制了实际应用。 Method: 提出Tool-DC分而治之框架,包含训练自由的Tool-DC(TF)和训练驱动的Tool-DC(TB),均基于'Try-Check-Retry'范式以降低推理难度并增强模型自省能力。 Result: Tool-DC(TF)在BFCL和ACEBench上平均提升达+25.10%;Tool-DC(TB)使Qwen2.5-7B达到甚至超越OpenAI o3和Claude-Haiku-4.5等闭源模型性能。 Conclusion: Tool-DC有效提升了LLM在复杂工具调用场景下的鲁棒性与实用性,兼顾灵活性与推理效率。 Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.

[16] Tiny Aya: Bridging Scale and Multilingual Depth

Alejandro R. Salamanca,Diana Abagyan,Daniel D'souza,Ammar Khairi,David Mora,Saurabh Dash,Viraat Aryabumi,Sara Rajaee,Mehrnaz Mofakhami,Ananya Sahu,Thomas Euyang,Brittawnya Prince,Madeline Smith,Hangyu Lin,Acyr Locatelli,Sara Hooker,Tom Kocmi,Aidan Gomez,Ivan Zhang,Phil Blunsom,Nick Frosst,Joelle Pineau,Beyza Ermis,Ahmet Üstün,Julia Kreutzer,Marzieh Fadaee

Main category: cs.CL

TL;DR: Tiny Aya 是一个仅含 3.35B 参数的小型多语言大模型,在70种语言上训练并经区域感知后训练优化,在翻译质量、多语言理解与目标语言生成方面达到SOTA,同时发布基础模型、全局指令微调模型及三个区域专用模型。

Details Motivation: 探索高效、语言性能均衡、便于实际部署的多语言AI扩展路径,弥补小模型在多语言能力上的不足。 Method: 在70种语言数据上预训练,并采用区域感知(region-aware)的后训练策略;发布基础模型、全局平衡指令微调模型及面向非洲、南亚、欧洲、亚太和西亚的三个区域专业化模型。 Result: 在翻译质量、多语言理解和目标语言生成方面达到当前最优(SOTA),同时保持仅3.35B参数规模。 Conclusion: Tiny Aya 证明了小型多语言模型可通过精细化训练策略(如区域感知后训练)实现高性能与高实用性,为多语言AI提供了一条以效率和语言公平性为导向的新扩展路径。 Abstract: Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

[17] Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

Sanchit Pandey

Main category: cs.CL

TL;DR: 本文研究了7B参数以下的小型语言模型在检索增强生成(RAG)中的表现,发现其主要瓶颈在于无法有效利用检索到的信息,而非检索质量差;即使提供完美检索结果(oracle),小模型仍大量忽略上下文、答错或被干扰,导致RAG反而损害性能。

Details Motivation: 探究7B及更小参数规模的语言模型能否有效利用RAG中检索到的信息,厘清性能瓶颈究竟来自检索质量还是模型对上下文的利用能力。 Method: 在360M–8B五种规模、SmolLM2/Qwen2.5/Llama3.1三种架构上,对比无检索、BM25、E5-large-v2稠密检索、oracle检索四种条件;提出参数化知识划分方法,区分模型本可回答与必须依赖外部知识的问题,以解耦利用失败与检索失败。 Result: 1)即使oracle检索下,≤7B模型在需外部知识的问题上错误率达85–100%;2)加入检索上下文后,模型原本能答对的问题错误率上升42–100%,表明存在显著干扰效应;3)错误分析显示主导失败模式是‘无关生成’(忽略上下文)。现象跨提示模板和检索方法稳定。 Conclusion: 对<7B模型而言,RAG的主要限制是上下文利用能力不足,而非检索质量;在此规模部署RAG在标准评估下常带来净负收益。 Abstract: Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.

[18] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Mayank Saini Arit Kumar Bishwas

Main category: cs.CL

TL;DR: 本文提出了一种用于自主多模态查询处理的智能体AI框架,通过动态分解查询、调度模态专用工具并自适应融合结果,在时间、交互轮次和成本上显著优化,同时保持准确率。

Details Motivation: 解决现有多模态AI系统中工具调度僵化、跨模态协同效率低、部署成本高等问题,提升端到端查询处理的效率与经济性。 Method: 构建以中央Supervisor为核心的代理式框架:对文本查询采用RouteLLM学习路由;对非文本模态使用小型语言模型(SLM)辅助模态分解;支持图像、音频、视频、文档等多模态专用工具的动态调用与结果合成。 Result: 在2847个查询、15类任务上评估,相比分层基线方法,实现准确回答耗时降低72%、对话返工减少85%、成本下降67%,且准确率持平。 Conclusion: 集中式智能编排可显著改善多模态AI的实际部署效益,是提升系统经济性与实用性的关键路径。 Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.

[19] Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

Zhenxu Tian,Yi Su,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出DapQ方法,通过位置感知的伪查询模拟解码阶段的注意力模式,实现与生成过程对齐的KV缓存压缩,在严格内存限制下保持近乎无损性能。

Details Motivation: 现有KV缓存压缩方法仅基于prefill阶段的输入注意力模式评估token重要性,无法反映解码阶段真实关注的token;而解码阶段的真实查询在推理时不可知,需构建有效伪查询。 Method: 提出DapQ框架:利用位置信息主导的伪查询来近似解码阶段的query,构建与生成过程对齐的观察窗口,从而更准确评估token重要性并进行轻量级缓存淘汰。 Result: 在多个基准和LLM上验证,DapQ在严苛内存约束下(如仅3% KV缓存预算)仍实现高达99.5%的NIAH任务性能,显著优于现有方法。 Conclusion: 位置信息比语义内容更能有效建模解码阶段的注意力行为;DapQ通过位置感知伪查询实现了更精准、更高效的KV缓存压缩,为长上下文LLM推理提供了实用解决方案。 Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).

[20] Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin,Jeon Haesung,Lianbo Liu,Hao Shi,Mengjie Zhao,Yusuke Fujita,Yui Sudo

Main category: cs.CL

TL;DR: Hikari是一种无需策略、端到端的同步语音到文本翻译模型,通过概率性WAIT机制和解码器时间膨胀技术提升质量与延迟权衡,在多语言任务上达到SOTA性能。

Details Motivation: 传统同步机器翻译依赖离线模型与人工设计或学习的策略,缺乏端到端、策略无关的统一建模方法。 Method: 提出Hikari模型:1)用概率性WAIT token编码READ/WRITE决策,实现策略无关的流式语音转译与转录;2)引入Decoder Time Dilation降低自回归开销并平衡训练分布;3)采用监督微调策略,使模型具备延迟恢复能力。 Result: 在英→日/德/俄任务上,Hikari在低延迟与高延迟场景均取得新的BLEU SOTA结果,显著优于近期基线。 Conclusion: Hikari证明了无需外部策略、完全端到端建模同步语音翻译的可行性与优越性,为低延迟高质量实时翻译提供了新范式。 Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

[21] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Ofir Marom

Main category: cs.CL

TL;DR: 本文提出UtilityMax Prompting框架,使用形式化数学语言替代自然语言来定义大语言模型(LLM)任务,将任务建模为影响图并以效用函数优化LLM输出,在多目标电影推荐任务中显著提升精度和NDCG。

Details Motivation: 自然语言提示存在固有歧义性,难以同时满足多个目标,限制了LLM在复杂任务中的性能。 Method: 构建基于影响图的任务表示,将LLM输出设为唯一决策变量,并定义作用于图中条件概率分布的效用函数,指导LLM最大化期望效用。 Result: 在MovieLens 1M数据集及Claude Sonnet 4.6、GPT-5.4、Gemini 2.5 Pro三个前沿模型上,该方法在多目标电影推荐任务中一致优于自然语言提示基线,提升了精度与NDCG。 Conclusion: 形式化、数学化的提示方法(UtilityMax Prompting)能更精确引导LLM优化多目标任务,缓解自然语言提示的模糊性问题。 Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

[22] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai,Singo Sakashita,Shumpei Ishikawa,Shogo Watanabe,Anna Matsuoka,Mikio Sakurai,Yasuto Fujimoto,Yoshiyuki Takahara,Atsushi Ohara,Hirohiko Miyake,Genichiro Ishii

Main category: cs.CL

TL;DR: 本文评估了七种开源大语言模型(LLM)在日语病理报告撰写中的性能,涵盖结构化诊断文本生成与信息提取、日语病理报告错字纠正、以及病理医生和临床医生对模型生成解释性文本的主观评价三方面。结果显示,具备推理能力的思维模型和医学专用模型在结构化报告和错字纠正任务中表现更优,但解释性文本偏好因评审者而异;总体表明开源LLM可在有限但具临床意义的场景中辅助日语病理报告撰写。

Details Motivation: 大型语言模型(LLM)在日语病理报告撰写中的性能尚未被探索,亟需评估其在该特定临床语言场景下的实用性。 Method: 评估七种开源LLM,从三方面展开:(A) 按预定义格式生成和提取病理诊断文本;(B) 纠正日语病理报告中的错别字;(C) 由病理医生和临床医生对模型生成的解释性文本进行主观评分。 Result: 思维类模型和医学专用模型在结构化报告生成与错字纠正任务中表现更优;但在解释性文本生成上,不同评审者的偏好差异显著。 Conclusion: 尽管LLM效用因任务而异,但开源LLM可在有限但临床相关的场景中有效辅助日语病理报告撰写。 Abstract: The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.

[23] QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Jihao Zhao,Daixuan Li,Pengfei Li,Shuaishuai Zu,Biao Qin,Hongyan Liu

Main category: cs.CL

TL;DR: 本文提出QChunker,通过理解-检索-增强范式改进RAG中的文本分块质量,结合多智能体辩论框架与新评估指标ChunkScore,提升语义完整性与信息粒度。

Details Motivation: 现有RAG受限于知识库中文本块的语义完整性与信息粒度;传统分块方法缺乏逻辑连贯性,且评估依赖下游问答任务、效率低。 Method: 提出QChunker:将分块建模为文本切分+知识补全的复合任务;设计四智能体辩论框架(问题提纲生成器、文本切分器、完整性审查员、知识补全器);构建45K高质量分块数据集并蒸馏至小模型;提出直接评估指标ChunkScore,并用于多路径采样下的最优分块选择。 Result: ChunkScore被理论与实验验证可直接高效判别分块质量;在四个异构领域实验中,QChunker显著提升分块的逻辑连贯性与信息丰富度,从而增强RAG效果。 Conclusion: QChunker通过引入‘理解先行’机制与可量化评估,从根本上优化RAG的知识输入质量,为文本分块提供了新范式与实用工具。 Abstract: The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.

[24] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Junjie Wu,Xuan Kan,Zihao He,Shunwen Tan,Bo Pan,Kaitai Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为MT-RL-Judge的多任务强化学习框架,用于提升多模态大语言模型(MLLM)作为评判者(Judge)在多种视觉任务中的泛化能力与判断一致性。

Details Motivation: 现有MLLM-as-a-Judge模型多为单任务优化,难以在多样化、分布外任务中可靠泛化,亟需提升其跨任务鲁棒性与人类偏好对齐能力。 Method: 提出MT-RL-Judge框架,通过多任务强化学习联合优化MLLM评判器,在多个视觉评判任务上共享策略并利用RL增强泛化能力。 Result: 在多个强基线上,MT-RL-Judge显著提升了判断一致性及与人类偏好的相关性,并在分布外任务上展现出优异的泛化性能。 Conclusion: 多任务强化学习是提升MLLM作为通用评判者能力的有效范式,为构建更可靠、可泛化的自动评估系统提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

[25] A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

María Isabel Rivas Ginel,Janiça Hackenbuchner,Alina Secară,Ralph Krüger,Caroline Rossi

Main category: cs.CL

TL;DR: 本文探讨了自动化语言与翻译行业中价值的构建与协商,指出技术效率已成为基础期望,而人类价值则通过专业技能、监督、问责和情境判断重新定位,适应性成为连接人机价值的核心中介。

Details Motivation: 探究在日益自动化的语言与翻译行业中,人类价值与技术价值如何被构建、协商与共存。 Method: 基于LT-LiDER项目中对29位行业利益相关者的访谈数据,运用Chesterman翻译伦理框架进行分析。 Result: 发现效率导向的技术价值已成自动化生产环境中的基准预期;人类价值未被取代,而是以专业知识、监督、问责与情境判断形式嵌入技术流程;适应性成为关键中介价值,体现译员持续调整技能、角色与身份的能力。 Conclusion: 自动化并非取代翻译价值,而是重塑其形态,形成技术效率赋能人类交际工作的相互依存关系。 Abstract: This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.

[26] In the LLM era, Word Sense Induction remains unsolved

Anna Mosolova,Marie Candito,Carlos Ramisch

Main category: cs.CL

TL;DR: 本文探讨了词义归纳(WSI)在缺乏标注数据时的评估方法问题,提出基于SemCor的新评估数据集,并系统评估了预训练嵌入、聚类算法及LLM方法;发现‘每词一簇’启发式方法仍是最强基线,LLM直接用于WSI效果有限,但结合Wiktionary的数据增强可提升性能并超越此前SOTA。

Details Motivation: 当前WSI评估存在方法学问题,尤其在低资源或领域特定场景下缺乏合理、具代表性的评估基准,且对预训练模型和LLM在词义归纳中的实际能力认识不足。 Method: 构建尊重原始语料多义性与频率分布的SemCor派生评估数据集;系统评测预训练词向量与聚类算法(按词性分组);提出并评估基于LLM的WSI方法;探索多种数据增强来源(LLM生成、语料库、词典)及半监督设置(利用Wiktionary提供must-link约束、簇数先验等)。 Result: 1) 无监督方法(包括本文方法及以往工作)均未超越‘每词一簇’(1cpl)启发式基线;2) 最优方法随词性变化;3) LLM直接执行WSI表现不佳;4) 数据增强有效,尤其是利用Wiktionary的半监督设置,使性能提升3.3%,超越此前SOTA。 Conclusion: WSI问题尚未解决,需更深入地整合结构化词典知识与LLM的词汇语义建模能力,推动二者协同而非替代。 Abstract: In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.

[27] SemBench: A Universal Semantic Framework for LLM Evaluation

Mikel Zubillaga,Naiara Perez,Oscar Sainz,German Rigau

Main category: cs.CL

TL;DR: 本文提出SemBench框架,利用词典义项定义和句子编码器自动生成语义理解评估基准,无需人工标注例句,具备跨语言、可扩展、数据高效的特点,并在多语言和多模型上验证了其有效性与稳定性。

Details Motivation: 传统语义理解评估基准(如WiC)构建成本高、依赖高资源语言,难以满足大语言模型(LLM)跨语言、规模化评估需求。 Method: 提出SemBench框架,仅基于词典义项定义和句子编码器自动生成合成语义评测数据,无需人工编写上下文例句。 Result: 在英语、西班牙语和巴斯克语三种语言及多种LLM上验证,SemBench生成的模型排名与标准WiC高度相关,且仅需少量样本即可获得稳定排名。 Conclusion: SemBench是一种轻量级、可适配、数据高效的跨语言语义理解评估框架,为LLM语义能力评测提供了新范式。 Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.

[28] Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Assaf Siani,Anna Kernerman,Ilan Kernerman

Main category: cs.CL

TL;DR: 本文提出了一种用于英语到希伯来语质量估计(QE)的半合成平行数据集构建方法,并基于该数据集训练了BERT和XLM-R等神经QE模型,探讨了数据规模、分布平衡与错误分布对模型性能的影响。

Details Motivation: 为解决低资源语言对(尤其是形态丰富语言)的质量估计(QE)系统准确率低、适应性差和可靠性不足的问题,由于缺乏平行语料及语言特异性因素(如复杂的形态句法)影响。 Method: 构建半合成英-希伯来语QE数据集:基于典型语言模式生成英文句子,经多个MT引擎翻译为希伯来语,通过BLEU筛选;人工标注质量得分,并加入高分专业译文;有针对性地注入性别与数的一致性错误;在该数据集上训练BERT和XLM-R等神经QE模型。 Result: 实验揭示了数据集规模、分布均衡性及错误类型分布对QE模型性能具有显著影响;所提方法提升了低资源、形态丰富语言对的QE建模能力。 Conclusion: 本研究为低资源、形态复杂语言对的QE任务提供了可复现的数据构建范式与模型评估结果,推动了面向真实应用场景的QE系统发展。 Abstract: Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.

[29] Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

Konstantin Krestnikov

Main category: cs.CL

TL;DR: 本文提出“压缩-一致性原则”,认为语言模型在下一词预测中倾向于选择能更简洁、更一致地描述训练数据的假设;当错误选项在结构上更难压缩时,模型才会表现出对正确陈述的偏好。

Details Motivation: 解释为何语言模型即使在混合质量数据上训练,仍有时偏好正确陈述。 Method: 基于小型GPT-2风格字符级Transformer(3.5M–86M参数),在控制正确与错误规则比例的合成数学语料上进行实验,并设计随机错误与系统性错误等不同设置。 Result: 在随机错误设定下,模型在平衡数据中达83.1%正确率,即使正确规则仅占10%也保持67.0%;而系统性错误下准确率接近随机;自然语言类合成任务中仍有57.7%准确率;嵌入验证步骤或增加一致规则数可提升正确率。 Conclusion: 模型对‘真实’的偏好本质上是压缩压力与内部一致性偏好的副产品,而非内在追求真理的倾向。 Abstract: Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.

Yaocong Li,Qiang Lan,Leihan Zhang,Le Zhang

Main category: cs.CL

TL;DR: 本文提出了Legal-DC基准数据集和LegRAG框架,以解决中文法律场景中检索增强生成(RAG)系统缺乏专用评估资源与难以适配法律条文结构化特性的两大问题。

Details Motivation: 现有中文法律RAG基准缺乏对检索器与生成器联合评估的支持,且主流RAG系统难以适配法律条文的结构化特性。 Method: 构建了含480份法律文件和2475个问答对的Legal-DC基准数据集(带条款级引用标注);提出LegRAG框架,融合法律自适应索引(按条款边界切分)与双路径自反思机制;引入面向高可靠性需求的自动化大模型评估方法。 Result: LegRAG在关键指标上较现有最优方法提升1.3%–5.6%;开源代码与数据。 Conclusion: 本研究为中文法律RAG提供了专用基准、实用框架与实证洞见,推动其实际落地与发展。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.

[31] Trust Oriented Explainable AI for Fake News Detection

Krzysztof Siwek,Daniel Stankowski,Maciej Stodolski

Main category: cs.CL

TL;DR: 本文探讨了可解释人工智能(XAI)在基于NLP的虚假新闻检测中的应用,比较了SHAP、LIME和Integrated Gradients三种解释方法,结果表明XAI在保持高检测准确率的同时提升了模型透明度与可解释性,但也存在计算开销大和参数敏感等局限。

Details Motivation: 提升虚假新闻检测系统的可靠性与可信度,解决深度学习模型‘黑箱’问题,增强用户对检测结果的信任。 Method: 采用SHAP、LIME和Integrated Gradients三种XAI技术,结合多种神经网络架构,在虚假新闻检测任务上进行模型训练与解释性分析。 Result: XAI方法显著提升了模型可解释性且未明显降低检测准确率;SHAP提供精细局部归因,LIME生成简洁直观解释,Integrated Gradients在卷积模型中效率更优;但存在计算成本高和参数敏感等限制。 Conclusion: 将XAI与NLP融合是提升虚假新闻检测系统透明性、可靠性与实用性的有效路径,未来需优化计算效率并增强鲁棒性。 Abstract: This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.

[32] Large Language Models for Biomedical Article Classification

Jakub Proboszcz,Paweł Cichosz

Main category: cs.CL

TL;DR: 本文系统研究了大语言模型(LLM)在生物医学文献分类任务中的文本分类能力,涵盖多种开源与闭源模型、不同提示方式、输出处理方法及少样本设置,并与传统分类器(如朴素贝叶斯、随机森林和微调Transformer)对比,结果表明LLM在零样本和少样本下性能接近传统方法,验证其在复杂领域中的实用性。

Details Motivation: 探索大语言模型作为文本分类器在非平凡领域(如生物医学文献分类)中的实际效用,弥补以往研究在配置广度(如提示类型、输出处理、少样本策略)上的不足。 Method: 采用多个小型至中型开源及部分闭源大语言模型,系统评估不同提示方式(零样本/少样本)、输出处理方法(用于生成类别及类别概率)、少样本示例数量与选择策略;并与朴素贝叶斯、随机森林及微调Transformer等传统方法对比性能。 Result: 在15个具挑战性的生物医学数据集上,零样本平均PR AUC达0.4以上,少样本接近0.5;该性能接近朴素贝叶斯(0.5)、默认随机森林(0.5)、调参后随机森林(0.55)及微调Transformer(0.5);证实LLM可作为有效分类器,尤其推荐使用输出token概率进行类别概率预测。 Conclusion: 大语言模型在生物医学文本分类任务中展现出与传统机器学习方法相当的实用性能,尤其在少样本设定下表现突出;研究为LLM在专业领域文本分类中的落地提供了可复现、可推广的最佳实践配置建议。 Abstract: This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.

[33] DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Yutong Yan,Raphael Tang,Zhenyu Gao,Wenxi Jiang,Yao Lu

Main category: cs.CL

TL;DR: 本文提出DatedGPT系列模型,通过严格按年份划分训练数据并进行指令微调,有效避免金融回测中的前瞻性偏差,同时在标准基准测试中表现具有竞争力。

Details Motivation: 解决大型语言模型在金融回测中因预训练数据包含未来信息而引入的前瞻性偏差问题。 Method: 构建12个1.3B参数的模型DatedGPT,每个模型均从头开始、基于2013–2024年间严格按年切分(无跨年数据泄露)的约1000亿token数据训练,并分别在通用与金融领域指令数据上进行时间对齐的微调;采用困惑度探测验证知识截止性。 Result: 各模型的知识范围被有效限制在其训练数据截止年份内,在标准基准测试中性能媲美同规模现有模型,并提供可交互的跨年代模型对比网页演示。 Conclusion: DatedGPT为时间敏感型任务(如金融预测)提供了可控时间边界的大语言模型范式,兼顾真实性与实用性。 Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.

[34] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

Remigiusz Kinas,Paweł Kiszczak,Sergio P. Perez,Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej

Main category: cs.CL

TL;DR: 本文提出了Bielik-Minitron-7B,一种基于Bielik-11B-v3.0的压缩版7.35B参数模型,专为欧洲语言优化;采用两阶段压缩(结构化混合剪枝+知识蒸馏)降低33.4%参数量,并通过SFT、DPO-P和GRPO对齐提升性能,最终恢复约90%基线性能并实现最高50%推理加速。

Details Motivation: 为欧洲等代表性不足的语言高效构建高质量、低成本部署的语言模型。 Method: 采用受NVIDIA Minitron启发的两阶段压缩方法:第一阶段用NVIDIA Model Optimizer进行结构化混合剪枝,第二阶段用NVIDIA NeMo Framework进行logit级知识蒸馏;随后通过监督微调(SFT)、直接偏好优化(DPO-P)和强化学习(GRPO)进行对齐优化。 Result: 模型参数从11.04B压缩至7.35B(降幅33.4%),恢复约90%原始模型性能,并获得最高50%推理速度提升。 Conclusion: 该两阶段压缩与对齐流程为资源受限语言提供了兼顾质量与效率的模型压缩范式,显著降低推理部署成本。 Abstract: This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.

[35] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan

Main category: cs.CL

TL;DR: 本文提出了一种新的多模态基准数据集CoMMET,用于评估大语言模型(LLMs)在多轮对话场景下的心智理论(ToM)能力,涵盖更广泛的心理状态和道德判断任务,是首个此类多模态多轮ToM评测数据集。

Details Motivation: 现有ToM评测基准局限于文本输入和信念相关任务,无法全面评估LLM在真实社交交互中所需的社会推理能力。 Method: 构建了名为CoMMET的多模态、多轮心智状态与道德评估基准数据集,灵感源自ToM手册任务,并对不同家族和规模的LLM进行了系统性评测。 Result: 通过实验揭示了当前LLM在ToM能力上的优势与不足,明确了其在多模态、多轮社会推理任务中的表现边界。 Conclusion: CoMMET为评估和提升LLM的社会认知能力提供了新工具和新视角,推动其向更自然、可靠的人机交互发展。 Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.

[36] PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Minjia Wang,Yunfeng Wang,Xiao Ma,Dexin Lv,Qifan Guo,Lynn Zheng,Benliang Wang,Lei Wang,Jiannan Li,Yongwei Xing,David Xu,Zheng Sun

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)代理合成真实数字足迹的新方法,从结构化用户画像出发生成多样化、合理的用户事件序列及对应数字产物(如邮件、日程等),实验证明其生成数据更具多样性与真实性,并在真实分布外任务中提升下游模型性能。

Details Motivation: 数字足迹研究受限于多样且易获取数据的稀缺性。 Method: 基于结构化用户画像,利用大语言模型(LLM)代理生成多样化、合理的用户事件序列及对应的数字产物(如邮件、消息、日历条目、提醒等)。 Result: 内在评估显示生成数据比现有基线更丰富、更真实;在真实世界分布外任务上,用该合成数据微调的模型优于其他合成数据训练的模型。 Conclusion: LLM代理可有效合成高质量、多样化的数字足迹数据,为行为建模、个性化应用与机器学习提供新数据源。 Abstract: Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.

[37] CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote,Korbinian Randl,Ioanna Miliou,Athanasios Lakes,Panagiotis Papapetrou

Main category: cs.CL

TL;DR: CHiL(L)Grader 是首个将校准置信度估计融入人机协同流程的自动评分框架,通过后验温度缩放、基于置信度的选择性预测和持续学习,在保证专家级评分质量(QWK ≥ 0.80)的同时,将不确定样本交由人工处理,并随教师反馈持续提升能力。

Details Motivation: 指令微调大模型在教育评估中常过度自信,且随课程更新可靠性下降,难以在高风险场景中全自动部署;亟需能识别预测可信度的可靠AI评分方法。 Method: 提出CHiL(L)Grader框架,结合后验温度缩放进行置信度校准、基于置信度的选择性预测实现人机分工,并引入持续学习机制以适应动态评分标准和新题型。 Result: 在三个简答题评分数据集上,自动评分覆盖35–65%作答,达到专家级质量(QWK ≥ 0.80);接受与拒绝样本间QWK差达0.347,验证了置信路由有效性;每次人工修正均增强模型评分能力。 Conclusion: 不确定性量化是实现可靠AI辅助评分的关键,CHiL(L)Grader为人机协同、安全可演化的智能教育评估提供了可行路径。 Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

[38] BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

Ilias Aarab

Main category: cs.CL

TL;DR: 本文提出了BTZSC基准,系统比较了四种零样本文本分类方法(NLI交叉编码器、嵌入模型、重排序器和指令微调大语言模型),发现现代重排序器(如Qwen3-Reranker-8B)达到新SOTA,嵌入模型在精度与延迟间取得最佳平衡,指令微调LLM表现具竞争力但略逊于重排序器,而NLI模型存在性能瓶颈。

Details Motivation: 现有评估基准(如MTEB)常依赖有监督微调或探针,未能真实反映零样本能力;缺乏对多样化零样本方法的系统性、公平比较。 Method: 构建包含22个公开数据集的零样本文本分类基准BTZSC,覆盖情感、主题、意图和情绪等任务;在统一零样本设定下,系统评测4类共38个模型(NLI交叉编码器、嵌入模型、重排序器、指令微调LLM)。 Result: (i)Qwen3-Reranker-8B以macro F1=0.72创SOTA;(ii)GTE-large-en-v1.5等强嵌入模型精度接近最优且延迟最低;(iii)4–12B参数指令LLM最高达macro F1=0.67,主题分类突出;(iv)NLI模型随规模扩大性能趋于饱和;(v)缩放收益主要集中于重排序器和LLM,而非嵌入模型。 Conclusion: 重排序器是当前零样本文本分类最有效架构,嵌入模型最具实用性价比,LLM展现潜力但尚未全面超越专用模型;BTZSC为后续研究提供了公平、可复现的评估平台。 Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.

[39] Just Use XML: Revisiting Joint Translation and Label Projection

Thennal D K,Chris Biemann,Hans Ole Hatzel

Main category: cs.CL

TL;DR: 本文提出LabelPigeon框架,通过XML标签联合执行机器翻译与标签投影,在提升跨语言迁移效果的同时不损害翻译质量。

Details Motivation: 现有方法通常将标签投影作为机器翻译后的独立步骤,而联合建模的尝试导致翻译质量下降;本文旨在重新评估该结论并提出更优联合建模方案。 Method: 提出LabelPigeon框架,利用XML标签联合建模翻译与标签投影,并设计直接评估标签投影效果的方法。 Result: 在11种语言上标签投影性能优于基线且提升翻译质量;203种语言上的翻译质量评估显示一致改进;27种语言、3个下游任务中跨语言迁移F1最高提升39.9。 Conclusion: XML标记的标签投影能高效、有效地实现标签迁移,且不牺牲翻译质量。 Abstract: Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.

[40] Translationese as a Rational Response to Translation Task Difficulty

Maria Kunilovskaya

Main category: cs.CL

TL;DR: 本文提出翻译过程中的认知负荷是导致翻译腔(translationese)的根本原因,并通过信息论指标(如大语言模型的惊讶度)和传统句法/语义特征,量化翻译任务难度,验证其对翻译腔的预测能力;结果表明跨语言迁移难度比源文本复杂度影响更大,且句法复杂度与翻译解熵是最强预测因子。

Details Motivation: 现有研究将翻译腔归因于生产倾向、社会文化变量及语言对效应,但缺乏统一解释框架;本文旨在从认知负荷角度提供系统性解释。 Method: 基于英德双向语料库(含书面与口语子语料),使用自动分类器计算段级翻译度得分作为翻译腔指标;将翻译任务难度分解为源文本复杂度与跨语言迁移难度两部分,主要采用基于大语言模型惊奇度的信息论指标,并辅以传统句法和语义特征进行建模分析。 Result: 翻译腔可被翻译任务难度部分解释,尤其在英译德方向;跨语言迁移难度的贡献普遍大于源文本复杂度;信息论指标在书面语中表现等于或优于传统特征,但在口语中无优势;源文本句法复杂度和翻译解熵是跨语言对和语体中最稳定的强预测因子。 Conclusion: 翻译腔本质上反映了翻译任务固有的认知负荷,其可观测表现可通过量化任务难度有效预测,支持认知负荷假说,并为翻译质量评估与机器翻译改进提供了新路径。 Abstract: Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.

[41] To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Thomas Hikaru Clark,Carlos Arriaga,Javier Conde,Gonzalo Martínez,Pedro Reviriego

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)在句子层面心理语言学特征(如句子可记忆性和阅读时间)估计中的能力,发现微调后模型能较好拟合人类标注数据并超越基线预测器,但零样本/少样本提示效果不稳定,警示其作为人类认知指标代理的局限性。

Details Motivation: 探索LLMs是否能扩展应用于句子层面的心理语言学特征(如句子可记忆性、阅读时间),而不仅限于词级特征;并评估零样本/少样本提示与监督微调两种范式在此类任务上的有效性差异。 Method: 对LLMs进行监督微调,以预测人类实测的句子可记忆性和阅读时间;同时对比零样本和少样本提示下的表现,并与可解释的基线预测器(如长度、频率等)比较相关性和预测力。 Result: 微调后的LLMs能显著相关于人类标注的句子可记忆性和阅读时间,且预测性能优于传统基线;但零样本和少样本提示结果高度不稳定、相关性弱。 Conclusion: LLMs经监督微调后可有效建模句子级认知特征,蕴含丰富句法语义信息;但直接提示(尤其零/少样本)不能可靠替代人类认知测量,需谨慎使用。 Abstract: Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.

[42] SommBench: Assessing Sommelier Expertise of Language Models

William Brach,Tomas Bedej,Jacob Nielsen,Jacob Pichna,Juraj Bedej,Eemeli Saarensilta,Julie Dupouy,Gianluca Barmina,Andrea Blasi Núñez,Peter Schneider-Kamp,Kristian Košťál,Michal Ries,Lukas Galke Poech

Main category: cs.CL

TL;DR: 本文提出SommBench,一个面向多语言、多文化场景的品酒师专业能力评估基准,涵盖葡萄酒理论问答、风味特征补全和餐酒搭配三类任务,用于检验大语言模型通过文本学习感官判断的能力。

Details Motivation: 现有文化评估基准主要关注可语言编码的基础文化知识,而缺乏对依赖嗅觉与味觉等感官体验的专业领域(如品酒)的评估;需构建能区分模型语言能力与领域专业知识的多语言基准。 Method: 构建多语言品酒师能力评估基准SommBench,包含WTQA、WFC、FWP三类任务,覆盖8种语言;数据由专业侍酒师与母语者协作构建;在主流闭源与开源大模型上进行评测。 Result: 最先进模型在葡萄酒理论问答上准确率达97%,但在风味特征补全(最高65%)和餐酒搭配(MCC 0–0.39)上表现显著更差,表明文本训练难以充分支撑感官推理。 Conclusion: SommBench是一个新颖且具挑战性的基准,揭示了当前大语言模型在依赖感官经验的专业领域中的局限性,并为多语言、多模态对齐研究提供新方向。 Abstract: With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.

[43] Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Tae-Eun Song

Main category: cs.CL

TL;DR: 本文提出了一种名为Cross-Context Review(CCR)的新方法,通过在无原始上下文的新会话中进行审查,显著提升了大语言模型对自身输出错误的检测能力。实验表明,CCR在F1指标上优于多种基线方法,其优势源于上下文隔离,而非简单重复审查。

Details Motivation: 大语言模型在同一会话中难以发现自身输出中的错误,亟需一种无需额外训练或基础设施、简单有效的审查机制。 Method: 提出Cross-Context Review(CCR):在全新会话中、不访问原始生成对话历史的前提下进行审查;与Same-session Self-Review(SR)、Repeated Self-Review(SR2)和Context-aware Subagent Review(SA)进行对照实验。 Result: 在360次审查中,CCR达到F1=28.6%,显著优于SR(24.6%,p=0.008)、SR2(21.7%,p<0.001)和SA(23.8%,p=0.004);SR2未显著优于SR(p=0.11),证实优势来自上下文分离而非重复。 Conclusion: 上下文隔离是提升LLM自我审查效果的关键;CCR通用、轻量、零成本改造,适用于任意模型。 Abstract: Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.

[44] LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Feiyu Duan,Xuanjing Huang,Zhongyu Wei

Main category: cs.CL

TL;DR: 本文提出LifeSim用户模拟器和LifeSim-Eval基准,用于评估大语言模型在个性化助手任务中的表现,尤其关注隐式意图理解和长期用户偏好建模能力。

Details Motivation: 现有个性化助手评测基准与真实人机交互脱节,无法反映外部环境和用户认知状态的复杂性。 Method: 基于信念-欲望-意图(BDI)模型构建LifeSim用户模拟器,在物理环境中生成连贯生活轨迹并模拟意图驱动的交互行为;据此构建覆盖8个生活领域、1200个场景的多轮交互式基准LifeSim-Eval。 Result: 实验表明当前大语言模型在隐式意图识别和长期用户偏好建模方面存在显著不足。 Conclusion: LifeSim-Eval为个性化AI助手提供了更贴近现实的评测框架,揭示了现有LLMs的关键短板,并为未来研究指明方向。 Abstract: The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.

[45] QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Jiayin Lei,Ming Ma,Yunxi Duan,Chenxi Li,Tianming Yang

Main category: cs.CL

TL;DR: 本文提出QAQ框架,通过反向互信息(RMI)评估合成代码数据质量,从答案预测查询(Q|A)角度筛选高质量样本,在 WarriorCoder 数据集上仅用25%数据即达全量训练效果。

Details Motivation: 现有基于指令遵循难度(IFD)的数据选择方法难以区分合成数据中的任务固有难度与模型幻觉,导致噪声难检、选择不准。 Method: 提出QAQ框架,定义反向互信息(RMI)衡量答案对查询的预测能力;分析RMI双极端(过低/过高)对应语义错位与缺陷模式;引入强弱模型分歧策略筛选既有效又具挑战性的样本。 Result: 在WarriorCoder数据集上,仅选取25%数据进行分层RMI筛选,性能媲美全量训练,显著优于IFD等现有方法。 Conclusion: 双向语义一致性(Q↔A)是合成数据质量的关键指标;QAQ为高效、低成本的代码模型训练提供了可扩展的数据筛选新范式。 Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.

[46] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Łukasz Borchmann,Jordy Van Landeghem,Michał Turski,Shreyansh Padarha,Ryan Othniel Kearns,Adam Mahdi,Niels Rogge,Clémentine Fourrier,Siwei Han,Huaxiu Yao,Artemis Llabrés,Yiming Xu,Dimosthenis Karatzas,Hao Zhang,Anupam Datta

Main category: cs.CL

TL;DR: 本文提出MADQA基准测试,用于评估多模态代理在复杂文档工作流中的战略推理能力,发现当前最佳代理虽能匹配人类搜索准确率,但依赖暴力搜索而非策略性规划。

Details Motivation: 探究多模态代理是否具备真正的战略推理能力,而非仅靠随机试错。 Method: 构建包含2250个人工编写问题、基于800份异构PDF文档的MADQA基准;依据经典测试理论设计以增强区分度;提出衡量准确率-努力权衡的新评估协议。 Result: 当前最优代理在准确率上可媲美人类搜索者,但解决的问题集不同,且依赖暴力搜索弥补策略规划薄弱;未能缩小近20%的oracle性能差距,常陷入无效循环。 Conclusion: 现有多模态代理尚未实现高效、校准的战略推理,需从暴力检索转向更智能的推理范式;作者开源数据集与评估工具以推动该方向发展。 Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

[47] Long-Context Encoder Models for Polish Language Understanding

Sławomir Dadas,Rafał Poświata,Marek Kozłowski,Małgorzata Grębowiec,Michał Perełkiewicz,Paweł Klimiuk,Przemysław Boruta

Main category: cs.CL

TL;DR: 本文提出了一种支持8192长上下文的高质量波兰语编码器模型,通过两阶段训练(位置编码适配+全参数持续预训练)及知识蒸馏压缩变体,在25项任务(含KLEJ、FinBench及长文档理解任务)上显著优于现有波兰语和多语言模型,尤其在长文本任务中表现突出,同时保持短文本性能。

Details Motivation: 经典编码器(如BERT)上下文窗口短,难以处理长文档;而波兰语高质量长上下文编码器尚属空白。 Method: 采用两阶段训练:先进行位置编码适配,再进行全参数持续预训练;并基于知识蒸馏构建轻量压缩变体。 Result: 在25项任务(含KLEJ、FinBench及长文档理解任务)上平均性能最优,长上下文任务显著优于竞品,短文本性能相当。 Conclusion: 该工作填补了波兰语长上下文编码器的空白,验证了扩展上下文能力与保持判别性能可兼顾,为资源受限语言的高效编码器设计提供了新范式。 Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

[48] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai,Qian Dong,Ting Jiang,Xin Lv,Zhengxiao Du,Aohan Zeng,Jie Tang,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出IndexCache,通过在多层稀疏注意力中复用部分层的索引结果,显著减少DeepSeek Sparse Attention(DSA)中 indexer的计算开销,同时几乎不损失模型性能。

Details Motivation: DSA虽降低了注意力复杂度,但其indexer仍为O(L^2)且每层独立运行,而实际各层top-k索引高度相似,存在跨层冗余可挖掘。 Method: 将模型层划分为Full层(运行独立indexer)和Shared层(复用最近Full层的top-k索引);提出训练无关的贪婪搜索法(基于校准集loss最小化选层)与训练相关的多层蒸馏损失(使保留的indexer拟合所服务层的平均注意力分布)。 Result: 在30B DSA模型上移除75% indexer计算,语言质量几乎无损,prefill加速1.82×,decode加速1.48×;在GLM-5上也验证有效。 Conclusion: IndexCache是一种高效、通用且易于部署的优化方案,显著提升长上下文稀疏注意力推理效率,适用于大规模生产模型。 Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

[49] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Alexandre Le Mercier,Thomas Demeester,Chris Develder

Main category: cs.CL

TL;DR: 本文提出CLASP模型,通过分析Mamba状态空间模型的块输出嵌入(BOEs)并结合XGBoost分类器,在token级别高效检测隐藏状态中毒攻击(HiSPA),在简历筛选等真实场景中实现了高F1分数和强泛化能力,且计算开销低,适合实际部署。

Details Motivation: 隐藏状态中毒攻击(HiSPA)严重威胁状态空间模型(如Mamba)及其混合架构的安全性,亟需轻量、高效、泛化性强的防御方法。 Method: 将HiSPA检测建模为token级二分类任务;提取Mamba模型块输出嵌入(BOEs)作为特征,采用XGBoost分类器识别恶意token;在真实简历筛选场景下评估,并进行多种交叉验证以检验泛化性。 Result: 在9.5M token的2483份简历数据集上,token级F1达95.9%,文档级F1达99.3%;leave-one-out交叉验证下文档级F1为96.9%,结构新颖触发器下仍达91.6%;推理速度1032 tokens/s,显存占用<4GB。 Conclusion: CLASP是一种与下游模型无关、低开销、高鲁棒性的HiSPA检测方案,可作为SSM及混合架构的实用前端防御机制。 Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

[50] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Priyanka Kargupta,Shuhaib Mehri,Dilek Hakkani-Tur,Jiawei Han

Main category: cs.CL

TL;DR: 本文提出Idea-Catalyst框架,旨在通过系统识别跨学科洞见来增强人类与大语言模型的创造性推理能力,提升科研突破潜力。

Details Motivation: 现有AI辅助科研方法多聚焦于快速实验设计与自动化发现,忽视了驱动创造性跨学科突破所需的探索性、协作性推理过程;同时,学术研究仍普遍受限于单一学科壁垒。 Method: Idea-Catalyst从抽象研究目标出发,在头脑风暴阶段避免过早锁定具体方案;其核心步骤包括:分解目标为领域内关键研究问题 → 将挑战转化为领域无关的概念性问题 → 跨学科检索类比解决方案(如从心理学、社会学等)→ 综合并重构洞见以评估各源领域的跨学科潜力。 Result: 实证表明,该方法使产出的新颖性提升21%、洞察力提升16%,且始终锚定原始研究问题,保持问题相关性。 Conclusion: Idea-Catalyst并非替代科学家,而是通过模拟元认知层面的跨学科推理机制,有效增强人与AI协同下的创造性科学发现能力,为打破学科壁垒提供了可扩展的方法论支持。 Abstract: Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.

[51] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen,Yilun Zhao,Chengye Wang,Rilyn Han,Manasi Patwardhan,Arman Cohan

Main category: cs.CL

TL;DR: 本文提出synthesis-and-reground框架,用于构建大规模、忠实且真实的科学多模态文档推理数据集SciMDR(300K QA对)及其评测基准SciMDR-Eval(专家标注),显著提升模型在复杂文档级科学问答任务上的性能。

Details Motivation: 构建科学多模态文档推理数据集面临规模、忠实性和真实性之间的固有权衡,现有方法难以兼顾三者。 Method: 提出两阶段的synthesize-and-reground框架:第一阶段为Claim-Centric QA Synthesis,生成聚焦于局部段落的忠实QA对及推理链;第二阶段为Document-Scale Regrounding,将这些QA对程序化地重嵌入完整文档中,以恢复真实文档复杂性。基于此构建SciMDR训练集和SciMDR-Eval评测集。 Result: SciMDR包含300K带显式推理链的QA对,覆盖20K科学论文;SciMDR-Eval为专家标注的全长度科学工作流评测基准;实验表明,基于SciMDR微调的模型在多个科学QA基准上显著提升,尤其在需复杂文档级推理的任务中。 Conclusion: synthesize-and-reground框架有效缓解了多模态科学文档数据集构建中的规模-忠实-真实三难困境,SciMDR及其评测基准为跨模态科学理解提供了高质量资源与可靠评估标准。 Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

cs.CV [Back]

[52] RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

Shijie Zhou,Bin Zhu,Jiarui Yang,Xiangyu Zhao,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出Robot-Conditioned Normalizing Flow(RC-NF),一种用于机器人异常检测与干预的实时监控模型,通过解耦任务感知的机器人与物体状态,在仅需正样本的无监督训练下实现高精度异常评分,并在仿真与真实场景中验证其有效性与低延迟响应能力。

Details Motivation: 现有基于模仿学习的视觉-语言-动作(VLA)模型在动态环境和分布外(OOD)条件下鲁棒性差,难以可靠运行。 Method: 提出Robot-Conditioned Normalizing Flow(RC-NF),在归一化流中解耦处理任务相关的机器人状态与物体运动轨迹;仅用正样本进行无监督训练;利用概率密度函数实时计算异常分数;并构建LIBERO-Anomaly-10仿真异常评测基准。 Result: RC-NF在LIBERO-Anomaly-10所有异常类型上达到SOTA性能;真实实验中作为即插即用模块(如集成于pi0)可提供<100ms延迟的OOD信号,支持状态级回滚或任务级重规划。 Conclusion: RC-NF显著提升了VLA驱动机器人系统在动态环境中的鲁棒性与适应性,为实时异常监控与干预提供了有效解决方案。 Abstract: Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.

[53] GGPT: Geometry Grounded Point Transformer

Yutong Chen,Yiming Wang,Xucong Zhang,Sergey Prokudin,Siyu Tang

Main category: cs.CV

TL;DR: 本文提出Geometry-Grounded Point Transformer (GGPT),通过引入基于稀疏几何引导的Transformer架构,将显式多视图几何约束融入前馈式稀疏视角3D重建中,在保持高效性的同时显著提升几何一致性与细节完整性。

Details Motivation: 现有前馈网络在稀疏视角3D重建中虽能直接预测稠密点云,但因缺乏显式多视角几何约束,常出现几何不一致和细节精度不足的问题。 Method: 提出两阶段方法:1)改进的Structure-from-Motion流程,利用稠密特征匹配与轻量几何优化获取准确相机位姿与稀疏点云;2)设计几何引导的3D点Transformer,以优化后的稀疏几何作为显式监督信号进行稠密点云精化。 Result: 在ScanNet++上仅用VGGT预测训练,GGPT在域内与跨域设置下均显著超越当前最优前馈式3D重建方法,重建结果兼具几何一致性、空间完整性,并能恢复无纹理区域的细部结构。 Conclusion: GGPT为融合几何先验与前馈预测提供了一种原理清晰、泛化性强的有效范式,推动了稀疏视角重建向高保真、高鲁棒方向发展。 Abstract: Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.

[54] Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction

Jingxing Zhong,Qingtao Pan,Xuchang Zhou,Jiazhen Lin,Xinguo Zhuang

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的乳腺肿瘤分割模型TextBCS,通过分阶段的视觉-语言交互和证据学习,提升低对比度和边界模糊场景下的肿瘤分割精度。

Details Motivation: 现有基于深度学习的乳腺肿瘤分割方法在低对比度和边界模糊情况下难以准确定位肿瘤轮廓,而文本提示信息有望改善分割效果。 Method: 提出TextBCS模型,包含分阶段视觉-语言交互机制(在下采样各阶段实现图文特征互信息)和证据学习(采用变分狄利克雷分布量化分割不确定性,尤其针对模糊边界)。 Result: 在公开数据集上实验表明,TextBCS优于其他分割网络,实现了当前最优的乳腺肿瘤分割性能。 Conclusion: 文本引导与不确定性建模相结合可有效提升乳腺MRI图像中肿瘤分割的鲁棒性和精度,为临床辅助诊断提供了新思路。 Abstract: Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.

[55] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Haihua Luo,Xuming Ran,Jiangrong Shen,Timo Hämäläinen,Zhonghua Chen,Qi Xu,Fengyu Cong

Main category: cs.CV

TL;DR: 本文提出了一种简单高效的增量学习框架SimE,利用带适配器的视觉-语言模型(如CLIP),揭示了适配器连接数与增量学习能力之间存在非线性关系,并在TinyImageNet和CIFAR-100上显著超越现有方法。

Details Motivation: 现有基于预训练视觉-语言模型的增量学习方法存在训练效率低、依赖记忆库、需强骨干网络三大问题。 Method: 提出SimE框架,采用带定制适配器的视觉-语言模型;发现并利用适配器连接方式(跨块 vs 块内)对增量学习性能的非线性影响;系统探索提升CLIP零样本能力的策略,如替换为更大规模数据(LAION2B)和更强架构(ViT-L/14)训练的CLIP编码器。 Result: SimE在TinyImageNet上比传统方法高9.6%,在CIFAR-100上比其他CLIP-based方法高5.3%;验证了跨transformer块添加适配器有益,而小步长下块内增加适配器连接反而损害性能。 Conclusion: 适配器结构设计对视觉-语言模型用于增量学习至关重要;合理利用大规模预训练CLIP的零样本能力可显著提升增量学习性能,无需记忆库或复杂训练机制。 Abstract: Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).

[56] Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Yuehao Song,Shaoyu Chen,Hao Gao,Yifan Zhu,Weixiang Yue,Jialv Zou,Bo Jiang,Zihao Lu,Yu Wang,Qian Zhang,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出Senna-2,一种通过三阶段一致性训练范式显式对齐视觉语言模型(VLM)高层决策与端到端(E2E)低层规划的新型驾驶策略,显著提升双系统一致性与驾驶安全性。

Details Motivation: 现有VLM-E2E驾驶策略忽视VLM高层决策与E2E低层规划之间的双系统一致性,导致轨迹与决策不匹配,削弱自上而下的指导与决策遵循能力。 Method: 提出Senna-2,采用一致性导向的三阶段训练:1)驾驶预训练,通过决策适配器将VLM决策以隐式嵌入形式传递给E2E策略;2)开环设置下对齐VLM与E2E策略;3)在3DGS环境中通过自底向上的分层强化学习进行闭环对齐,增强安全与效率。 Result: 实验表明,Senna-2在双系统一致性上F1分数提升19.3%,开环设置下最终位移误差(FDE)降低5.7%,闭环设置下平均碰撞率(AF-CR)降低30.6%。 Conclusion: Senna-2通过显式对齐VLM与E2E策略,有效解决了双系统不一致问题,在保持语义推理能力的同时显著提升了驾驶策略的安全性与可靠性。 Abstract: Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).

[57] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Qingtao Pan,Zhihao Dou,Shuo Li

Main category: cs.CV

TL;DR: 本文提出FMVR,一种即插即用、极简的频率调制视觉恢复策略,通过分离和调制视觉表征的高低频分量,在减少视觉token数量的同时保留并恢复视觉语义,显著降低计算开销而不牺牲性能。

Details Motivation: 大型多模态模型(LMMs)因视觉token过多而难以适应不同计算预算;现有token压缩方法会不可避免地损失视觉语义。 Method: FMVR将少量视觉token的表征通过AvgPool和MaxPool解耦为低频与高频分量,并用轻量可学习参数进行调制:AvgPool高频作为显著性滤波器增强显著语义,MaxPool低频作为反显著性滤波器强化弱语义;进一步结合Matryoshka表示学习实现推理时弹性调整token数量。 Result: 在10个图像和4个视频基准上,FMVR-LLaVA将LLaVA-1.5-7B的FLOPs降低89%,同时保持近100%原始精度。 Conclusion: FMVR是一种高效、通用且易于集成的视觉token压缩与语义恢复方法,显著提升LMMs在不同计算预算下的推理效率与鲁棒性。 Abstract: Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

[58] When Slots Compete: Slot Merging in Object-Centric Learning

Christos Chatzisavvas,Panagiotis Rigas,George Ioannakis,Vassilis Katsouros,Nikolaos Mitianoudis

Main category: cs.CV

TL;DR: 本文提出了一种名为'slot merging'的轻量级操作,用于在基于槽(slot)的对象中心学习中合并重叠的槽,从而提升对象分解和分割质量。

Details Motivation: 现有基于槽的方法通常固定槽的数量,导致多个槽竞争同一物体的重叠区域,而非聚焦于不同区域。 Method: 通过Soft-IoU度量槽注意力图之间的重叠程度,并采用重心更新方式合并选定槽对;合并策略固定,阈值由重叠统计推断得出,无需额外可学习模块。 Result: 在DINOSAUR特征重建框架中集成该方法后,在对象发现与分割基准上优于其他自适应方法,提升了对象分解能力和掩码质量。 Conclusion: Slot merging是一种即插即用、无需额外参数的改进机制,有效缓解了槽间冗余竞争问题,增强了对象中心表征能力。 Abstract: Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.

[59] Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models

Benedikt Schwab,Thomas H. Kolbe

Main category: cs.CV

TL;DR: 本文提出了一种利用多源移动激光雷达(LiDAR)数据提取语义3D城市模型中物体表面辐射指纹的方法,以推断其材料特性,并在A2D2数据集上验证了该方法对6368个LOD3级城市对象的有效性。

Details Motivation: 现有语义3D城市模型缺乏材料信息,而材料及其物理属性对城市数字孪生的应用拓展至关重要;同时,重复的移动LiDAR扫描蕴含大量受表面材料影响的辐射观测数据,亟待有效利用。 Method: 提出基于距离、入射角、环境条件、传感器类型及扫描批次等多维变量对LiDAR反射观测进行聚类,构建语义对象表面的辐射指纹;并实现A2D2数据中3.12亿条激光束与CityGML 3.0标准构建的LOD3级语义城市模型(含4条街道、6368个对象)的自动关联。 Result: 成功提取了大量对象表面的辐射指纹,发现同类对象内部存在可复现的辐射模式,表明其具有类主导材料特性;开源了语义模型、方法代码及新型地理数据库系统3DSensorDB。 Conclusion: 辐射指纹可作为连接LiDAR辐射测量与城市模型材料语义的桥梁,为城市数字孪生提供可扩展的材料感知能力,推动三维城市建模向物理可解释方向发展。 Abstract: Although semantic 3D city models are internationally available and becoming increasingly detailed, the incorporation of material information remains largely untapped. However, a structured representation of materials and their physical properties could substantially broaden the application spectrum and analytical capabilities for urban digital twins. At the same time, the growing number of repeated mobile laser scans of cities and their street spaces yields a wealth of observations influenced by the material characteristics of the corresponding surfaces. To leverage this information, we propose radiometric fingerprints of object surfaces by grouping LiDAR observations reflected from the same semantic object under varying distances, incident angles, environmental conditions, sensors, and scanning campaigns. Our study demonstrates how 312.4 million individual beams acquired across four campaigns using five LiDAR sensors on the Audi Autonomous Driving Dataset (A2D2) vehicle can be automatically associated with 6368 individual objects of the semantic 3D city model. The model comprises a comprehensive and semantic representation of four inner-city streets at Level of Detail (LOD) 3 with centimeter-level accuracy. It is based on the CityGML 3.0 standard and enables fine-grained sub-differentiation of objects. The extracted radiometric fingerprints for object surfaces reveal recurring intra-class patterns that indicate class-dominant materials. The semantic model, the method implementations, and the developed geodatabase solution 3DSensorDB are released under: https://github.com/tum-gis/sensordb

[60] Towards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery

Yu Chung Lee,David G. Black,Ryan S. Yeung,Septimiu E. Salcudean

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB图像的自动化患者注册与解剖信息引导的初始探头放置(PIPG)框架,用于辅助新手或机器人在远程超声中准确定位探头。

Details Motivation: 心脏和肺部超声操作技术要求高,尤其在远程超声中,新手或机器人缺乏现场专家指导,难以准确定位肋间声窗并完成标准切面导航。 Method: 利用混合现实(MR)头戴设备采集患者RGB图像,边缘服务器重建患者体表与骨骼模型,并通过预测的骨性标志定位肋间区域,将探头引导姿态投影回体表;最终在真实超声扫描中叠加虚拟引导进行验证。 Result: 在健康志愿者上的试点实验表明,该方法可实现解剖学上可接受范围内的稳定初始探头放置,误差满足远程超声设置需求。 Conclusion: 该PIPG框架仅依赖RGB图像即可提供解剖信息驱动的探头初始定位指导,有望提升远程超声的操作可行性与标准化水平。 Abstract: Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup

[61] InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction

Dingqiang Ye,Jiacong Xu,Jianglu Ping,Yuxiang Guo,Chao Fan,Vishal M. Patel

Main category: cs.CV

TL;DR: InstantHDR是一种前馈神经网络,能从未经校准的多曝光LDR图像集合中单次前向传播重建3D HDR场景,无需已知相机位姿或密集点云初始化,显著提升速度并保持高质量合成效果。

Details Motivation: 现有HDR新视角合成方法依赖已知相机位姿、稠密点云初始化和耗时的逐场景优化;而现有前馈方法忽略HDR特性,假设外观与曝光无关,亟需兼顾效率与HDR建模能力的新方法。 Method: 提出InstantHDR:1)几何引导的多曝光外观建模用于融合;2)元网络实现可泛化的场景特定色调映射;3)构建含168个Blender渲染场景的HDR-Pretrain预训练数据集,覆盖多样光照与相机响应函数。 Result: InstantHDR在合成质量上媲美当前最优基于优化的HDR方法,在单次前向推理下提速约700倍,加入后优化后提速约20倍。 Conclusion: InstantHDR首次实现了高效、通用、前馈式的HDR新视角合成,解决了对位姿与初始化的依赖问题,并通过新数据集和网络设计推动了HDR NVS的发展。 Abstract: High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes from multi-exposure low dynamic range (LDR) images. Existing HDR pipelines heavily rely on known camera poses, well-initialized dense point clouds, and time-consuming per-scene optimization. Current feed-forward alternatives overlook the HDR problem by assuming exposure-invariant appearance. To bridge this gap, we propose InstantHDR, a feed-forward network that reconstructs 3D HDR scenes from uncalibrated multi-exposure LDR collections in a single forward pass. Specifically, we design a geometry-guided appearance modeling for multi-exposure fusion, and a meta-network for generalizable scene-specific tone mapping. Due to the lack of HDR scene data, we build a pre-training dataset, called HDR-Pretrain, for generalizable feed-forward HDR models, featuring 168 Blender-rendered scenes, diverse lighting types, and multiple camera response functions. Comprehensive experiments show that our InstantHDR delivers comparable synthesis performance to the state-of-the-art optimization-based HDR methods while enjoying $\sim700\times$ and $\sim20\times$ reconstruction speed improvement with our single-forward and post-optimization settings. All code, models, and datasets will be released after the review process.

[62] Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

Jun Yu,Yunxiang Zhang,Naixiang Zheng,Lingsi Zhu,Guoyuan Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于分层粒度对齐与状态空间模型的新型多模态框架,利用DINOv2和WavLM提取高质量视听特征,并通过动态对齐、Vision-Mamba建模长时序依赖及非对称交叉注意力实现细粒度音视频协同,显著提升野外环境下面部动作单元(AU)检测性能。

Details Motivation: 野外环境中AU检测面临空间-时间异质性强、姿态不受控、音视频依赖复杂等挑战;现有方法受限于编码器容量和浅层融合机制,难以建模细粒度语义变化和超长时序上下文。 Method: 采用DINOv2和WavLM作为基础模型提取视听表征;设计分层粒度对齐模块动态融合全局语义与局部活跃区域;引入Vision-Mamba替代传统TCN以线性复杂度建模超长时序;提出非对称交叉注意力机制深度同步副语言音频线索与细微视觉运动。 Result: 在Aff-Wild2数据集上显著超越现有方法,达到SOTA性能,并获第十届Affective Behavior Analysis in-the-wild竞赛AU检测赛道第一名。 Conclusion: 所提多模态框架通过强表征能力、动态粒度对齐与高效长时序建模,有效应对野外AU检测的核心难点,为复杂真实场景下的情感行为分析提供了新范式。 Abstract: Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.

[63] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Ziyao Wang,Chen Chen,Jingtao Li,Weiming Zhuang,Jiabo Huang,Ang Li,Lingjuan Lyu

Main category: cs.CV

TL;DR: 本文提出了一种名为UniCompress的统一视觉令牌压缩算法,通过可学习的全局元令牌引导的压缩与解压机制,在大幅减少视觉令牌数量的同时,保持图像理解和生成任务的性能,提升了推理速度和训练效率,适用于资源受限场景。

Details Motivation: 现有统一多模态模型因需大量视觉令牌而导致计算和内存开销大,难以部署于具身AI等资源受限场景。 Method: 提出UniCompress算法,引入可学习的全局元令牌作为指导,设计轻量、模块化的插件式令牌压缩与解压机制,无需全模型重训练即可集成到现有统一模型中。 Result: 视觉令牌最多减少4倍,显著降低推理延迟和训练成本,仅带来轻微性能下降。 Conclusion: UniCompress验证了高效令牌压缩对统一多模态建模的可行性与实用性,为现实世界多模态应用提供了新路径。 Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.

[64] UNet-AF: An alias-free UNet for image restoration

Jérémy Scanvic,Quentin Barthélemy,Julián Tachella

Main category: cs.CV

TL;DR: 本文提出了一种无混叠(alias-free)的UNet架构,通过选用具有平移等变性的先进层来提升模型对平移变换的等变性,并在图像恢复任务中验证了其有效性与更强的等变性。

Details Motivation: 传统UNet虽被假设为平移等变,但由于使用易产生混叠的层,实际等变性受限。 Method: 精心选取并组合当前先进的平移等变层,构建无混叠UNet;通过消融实验验证各组件作用。 Result: 在图像恢复任务上达到与非等变基线相当的性能,同时显著提升实测等变性。 Conclusion: 无混叠设计能有效增强UNet的平移等变性,且各改进模块均对等变性提升至关重要。 Abstract: The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at https://github.com/jscanvic/UNet-AF

[65] Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis

Zhenxuan Zhang,Peiyuan Jing,Ruicheng Yuan,Liwei Hu,Anbang Wang,Fanwen Wang,Yinzhe Wu,Kh Tohidul Islam,Zhaolin Chen,Zi Wang,Peter Lally,Guang Yang

Main category: cs.CV

TL;DR: 本文提出了一种可靠性感知的扩散模型ReDiff,用于低场到高场MRI图像合成,通过可靠性引导采样和不确定性感知的多候选选择策略,提升结构保真度并减少解剖不一致伪影。

Details Motivation: 现有扩散模型在低场到高场MRI合成中难以兼顾细节恢复与结构保真,易在结构模糊区域生成解剖不一致伪影(如虚假边缘、纹理异常),影响下游定量分析和临床信任。 Method: 提出ReDiff框架:1)可靠性引导的采样策略,在去噪过程中抑制不可靠响应;2)不确定性感知的多候选选择机制,提升最终预测的可靠性。 Result: 在多中心MRI数据集上实验表明,相比SOTA方法,ReDiff显著提升了结构保真度,减少了伪影。 Conclusion: ReDiff通过建模和利用生成过程中的可靠性与不确定性,实现了更空间可靠、解剖一致的MRI合成,为临床应用提供了更可信的生成结果。 Abstract: Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.

[66] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

Yuto Shibata,Kashu Yamazaki,Lalit Jayanti,Yoshimitsu Aoki,Mariko Isogawa,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: 本文提出AssistMimic,一种基于多智能体强化学习的方法,用于模仿人与人之间紧密互动、力交换的辅助动作,首次在标准基准上成功实现对辅助交互动作的跟踪。

Details Motivation: 现有通用运动跟踪方法难以应对辅助场景中需持续感知人类伙伴姿态与动态并快速适应的需求。 Method: 将辅助交互动作模仿建模为多智能体强化学习问题,在物理仿真器中联合训练支持者(助手)与接受者两个智能体的伙伴感知策略;引入基于单人运动控制器先验的策略初始化方案、动态参考重定向机制及促进接触的奖励函数。 Result: AssistMimic成为首个在标准基准上成功跟踪辅助交互动作的方法,验证了多智能体RL在具身化、社会感知人形控制中的有效性。 Conclusion: 多智能体强化学习框架结合伙伴感知策略设计与物理引导的奖励机制,可有效解决人机协作中动态、力耦合的运动模仿难题。 Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.

[67] DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao,Ruiping Liu,Junwei Zheng,Yufan Chen,Kedi Ying,M. Saquib Sarfraz,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文提出DriveXQA多模态数据集和MVX-LLM模型,用于提升自动驾驶中异常驾驶场景的理解能力,通过双交叉注意力机制融合多种视觉模态,在恶劣天气等挑战性条件下显著提升性能。

Details Motivation: 现有MLLMs未充分探索利用多传感器信息理解自动驾驶中的异常驾驶场景,存在研究空白。 Method: 构建包含102,505个问答对的DriveXQA多模态数据集,并设计MVX-LLM模型,采用双交叉注意力(DCA)投影器实现多视觉模态融合。 Result: DCA在雾天等挑战性条件下显著提升性能(GPTScore:53.5 vs. 基线25.1)。 Conclusion: DriveXQA数据集与MVX-LLM模型为多模态自动驾驶理解提供了新基准和有效方法,代码与数据将开源。 Abstract: Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

[68] High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping

Sehoon Tak,Keunhee Cho,Sangpil Kim,Jae-Sang Hyun

Main category: cs.CV

TL;DR: 本文提出一种高精度姿态估计方法,通过在移动DFP系统上增加一个固定且内参标定的全局投影仪,结合相位像素约束与PnP式重投影目标,实现无需特征提取的姿态估计,并验证了其在子采样下的不变性与亚毫米级精度。

Details Motivation: 传统ICP配准在大规模DFP点云中效率低、依赖降采样或特征提取,导致细节丢失和姿态精度下降;漂移校正方法无法解决密集DFP点云对采样的敏感性。 Method: 在移动DFP系统中引入固定、内参已知的全局投影仪,利用其相位解算的像素约束与PnP风格的重投影优化目标,在固定参考系中估计DFP系统姿态,不依赖确定性特征提取,并验证坐标保持型子采样的鲁棒性。 Result: 实验表明该方法达到亚毫米级姿态精度(含量化不确定度边界),在强子采样下具有高重复性,对均质表面和低重叠视角鲁棒,并能有效降低ICP轨迹的误差累积。 Conclusion: 该方法推动DFP向准静态场景(如检测与计量)中的高精度三维建图拓展,代价是需时分复用额外投影仪测量。 Abstract: Digital fringe projection (DFP) enables micrometer-level 3D reconstruction, yet extending it to large-scale mapping remains challenging because six-degree-of-freedom pose estimation often cannot match the reconstruction's precision. Conventional iterative closest point (ICP) registration becomes inefficient on multi-million-point clouds and typically relies on downsampling or feature-based selection, which can reduce local detail and degrade pose precision. Drift-correction methods improve long-term consistency but do not resolve sampling sensitivity in dense DFP point clouds.We propose a high-precision pose estimation method that augments a moving DFP system with a fixed, intrinsically calibrated global projector. Using the global projector's phase-derived pixel constraints and a PnP-style reprojection objective, the method estimates the DFP system pose in a fixed reference frame without relying on deterministic feature extraction, and we experimentally demonstrate sampling invariance under coordinate-preserving subsampling. Experiments demonstrate sub-millimeter pose accuracy against a reference with quantified uncertainty bounds, high repeatability under aggressive subsampling, robust operation on homogeneous surfaces and low-overlap views, and reduced error accumulation when used to correct ICP-based trajectories. The method extends DFP toward accurate 3D mapping in quasi-static scenarios such as inspection and metrology, with the trade-off of time-multiplexed acquisition for the additional projector measurements.

[69] DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Ravi Mosalpuri,Mohammed Abdelsamea,Ahmed Karam Eldaly

Main category: cs.CV

TL;DR: 本文提出DeepHistoViT,一种基于Vision Transformer的可解释框架,用于组织病理图像自动分类,在多个癌症数据集上达到接近100%的性能指标。

Details Motivation: 手动组织病理学检查耗时、劳动密集且存在观察者间差异,亟需可靠的计算机辅助诊断工具。 Method: 提出DeepHistoViT,一种定制化Vision Transformer架构,集成注意力机制以捕获细粒度细胞结构,并通过注意力定位实现可解释性。 Result: 在肺癌、结肠癌和急性淋巴细胞白血病三个公开数据集上均达到SOTA性能:肺癌和结肠癌数据集各项指标达100%,ALL数据集指标均超99.8%(含95%置信区间)。 Conclusion: Transformer架构在组织病理图像分析中非常有效,DeepHistoViT具备临床应用潜力,可作为可解释的辅助诊断工具支持病理医生决策。 Abstract: Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.

[70] Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

Nazia Tasnim,Keanu Nichols,Yuting Yang,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan A. Plummer

Main category: cs.CV

TL;DR: 本文提出了一个名为DORI的认知导向分层基准,专门用于评估视觉-语言模型对物体朝向的理解能力,发现当前最先进模型在此任务上表现不佳,揭示了朝向理解仍是多模态系统的一大未解挑战。

Details Motivation: 现有视觉-语言基准大多将朝向与位置及整体场景理解混淆,缺乏对物体朝向这一核心能力的独立、细粒度评估;而人类对朝向的理解具有渐进性与层次性,亟需符合认知规律的新基准。 Method: 提出DORI基准:基于人类朝向认知四阶段理论,将朝向分解为四个维度,每维包含粗粒度(分类)和细粒度(度量)两个层级;构建含13,652张图像、33,656道多选题的大规模数据集,采用包围框隔离、统一空间参考系和结构化提示来控制混杂变量。 Result: 在24个SOTA视觉-语言模型上的评测显示:模型在通用空间基准上表现良好,但在物体朝向任务上接近随机水平;最优模型粗/细粒度准确率仅54.2%/45.0%,尤其难以处理复合旋转和参照系转换;存在显著的粗-细粒度性能鸿沟。 Conclusion: 朝向理解是当前多模态AI的重大短板,DORI揭示了模型依赖类别启发式而非几何推理的本质缺陷,该问题被现有基准所掩盖;结果对机器人操作、3D场景重建与人机交互具重要启示。 Abstract: Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.

[71] Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Fatemeh Naeinian,Ali Hamza,Haoran Zhu,Anna Choromanska

Main category: cs.CV

TL;DR: 本文研究端到端自动驾驶模型在未见过城市的零样本跨城泛化能力,发现监督预训练主干网络易依赖城市特有线索,导致地理迁移性能严重下降;而自监督视觉表征(如I-JEPA、DINOv2、MAE)可显著缩小该泛化差距,在开环和闭环评估中均提升跨城鲁棒性。

Details Motivation: 现有端到端自动驾驶模型多在混合多城市数据上训练,其在真实地理域偏移(如不同道路结构或驾驶规则)下的泛化能力未被充分检验,可能掩盖实际部署中的失败模式。 Method: 将多种自监督视觉主干网络(I-JEPA、DINOv2、MAE)集成到端到端轨迹规划框架中,在nuScenes(开环)和NAVSIM(闭环)数据集上采用严格的地理划分进行零样本跨城评估。 Result: 监督主干在波士顿→新加坡迁移时L2位移比达9.77×、碰撞率19.43×;自监督预训练分别降至1.20×和0.75×;闭环评估中PDMS最高提升4%。 Conclusion: 自监督表征学习显著增强跨城规划鲁棒性,零样本地理迁移应成为评估端到端自动驾驶系统的必要基准。 Abstract: End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.

[72] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang,Zhe Wang,Xuyi Yang,Songchun Zhang,Xianghao Kong,Taiyi Wu,Xiaotong Zhao,Ran Zhang,Alan Zhao,Anyi Rao

Main category: cs.CV

TL;DR: 本文提出ShotVerse框架,通过数据驱动的'规划-控制'范式解决文本驱动视频生成中多镜头场景下的相机控制难题,利用VLM规划器和控制器协同生成高保真、跨镜头一致的多镜头视频。

Details Motivation: 现有文本驱动视频生成在多镜头场景下相机控制不精确且手动轨迹标注成本高、易失败,亟需一种兼顾自动化与精度的新范式。 Method: 提出'Plan-then-Control'框架:1)VLM-based Planner从文本生成全局对齐的电影级轨迹;2)Controller通过相机适配器将轨迹渲染为多镜头视频;3)构建自动化多镜头相机标定流程与ShotVerse-Bench数据集支撑训练与评估。 Result: ShotVerse显著提升多镜头视频的相机准确性与跨镜头一致性,在美学质量与可控性上优于现有方法,有效弥合了纯文本控制与手动绘图之间的鸿沟。 Conclusion: 数据为中心的对齐三元组(Caption, Trajectory, Video)建模是实现精准多镜头相机控制的关键路径,ShotVerse验证了该范式的有效性与可扩展性。 Abstract: Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

[73] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li,Xin Zhu,Zechao Guan,Peipeng Chen,Jian Yao

Main category: cs.CV

TL;DR: 本文提出R-MSD(可靠多样本蒸馏)框架,通过建模教师采样方差、构建任务自适应教师池、结合质量感知信号匹配与对抗蒸馏目标,提升LVLMs知识蒸馏的稳定性与效果,在多个视频理解基准上显著优于单样本蒸馏方法。

Details Motivation: 传统黑箱蒸馏依赖单个教师响应,导致高方差和格式不一致,尤其在多模态或时序场景下监督不可靠。 Method: 提出R-MSD框架:1)使用任务自适应教师池替代单一教师;2)引入质量感知信号匹配过滤噪声;3)设计对抗蒸馏目标以增强知识迁移鲁棒性。 Result: 在VideoMME、Video-MMMU、MathVerse等视频理解基准上,4B学生模型分别提升+1.5%、+3.2%、+3.6%;显著优于单样本蒸馏及SFT+RL基线。 Conclusion: 多教师样本建模与质量感知对抗蒸馏可有效缓解教师响应方差问题,为LVLMs高效稳定蒸馏提供了新范式。 Abstract: Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

[74] Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Seung Hyup Baek,Jimin Lee,Hyeongkeun Lee,Jae Won Cho

Main category: cs.CV

TL;DR: 本文提出了一种基于角色特异性查询的密集视频描述生成方法,通过分离定位与描述任务、对比对齐、重叠抑制和概念增强模块,显著提升了事件定位精度与描述语义丰富性。

Details Motivation: 现有基于查询的密集视频描述方法因共享查询导致定位与描述任务间干扰严重,且存在时间冗余问题。 Method: 引入角色特异性查询分别处理定位与描述;采用对比对齐保证语义一致性;设计重叠抑制机制惩罚查询间时间重叠;加入轻量级概念捕捉模块增强描述语义。 Result: 在YouCook2和ActivityNet Captions两个主流基准上验证了方法有效性,提升了定位准确性和描述质量。 Conclusion: 角色解耦、对比对齐、重叠抑制与概念增强协同作用,有效缓解多任务干扰与时间冗余,推动密集视频描述性能提升。 Abstract: Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.

[75] Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

Mehmet Kerem Turkcan

Main category: cs.CV

TL;DR: 本文提出DART框架,无需训练即可将SAM3转化为实时多类检测器,通过共享类无关视觉骨干计算和批量解码等优化,在不修改模型权重的情况下显著提升推理速度,并在COCO上达到高精度与高帧率。

Details Motivation: 现有SAM3等模型每次仅能处理一个文本提示,检测N个类别需N次独立前向传播,计算开销大,难以满足实时多类检测需求。 Method: 利用SAM3视觉骨干的类无关性,共享其图像特征计算;结合批量多类解码、仅检测模式推理及TensorRT FP16部署;进一步引入适配器蒸馏以降低极端低延迟场景下的骨干延迟。 Result: 在单张RTX 4080上,COCO val2017(4类,1008x1008)达55.8 AP与15.8 FPS;3类提速5.6倍,80类提速25倍;极低延迟下(13.9ms骨干)仍获38.7 AP。 Conclusion: DART是一种训练-free、权重不变的高效优化框架,显著提升开放词汇检测器的实时性与实用性,为视觉-语言模型的实际部署提供新范式。 Abstract: Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.

[76] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Seung hee Choi,MinJu Jeon,Hyunwoo Oh,Jihwan Lee,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出STaRC框架,通过引入基于真实标注的帧级显著性监督(highlight detection)和显著性引导的检索与字幕生成机制,显著提升密集视频字幕生成中时间分段的准确性与上下文相关性,在YouCook2和ViTT上达到SOTA。

Details Motivation: 现有稠密视频字幕(DVC)的检索增强方法依赖启发式时间分割策略,难以对齐真实事件边界,导致检索与生成质量受限。 Method: 提出STaRC框架:1)设计无需额外标注的二值化高亮检测模块,监督帧级显著性;2)将显著性分数作为统一时间信号,驱动显著性引导的时间分割与检索;3)在解码器中注入显著性提示(Saliency Prompts)以增强字幕生成的上下文一致性。 Result: 在YouCook2和ViTT基准上全面超越现有方法,多数指标达SOTA;显著性约束分割提升了时间片段与真实事件边界的对齐度,进而改善检索准确率与字幕生成质量。 Conclusion: 帧级显著性可作为有效且可学习的时间先验,显著性驱动的联合分割-检索-生成范式是提升DVC性能的有效途径;该方法无需额外人工标注,具备良好实用性与泛化性。 Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC

[77] INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Junqi Yang,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出INFACT基准,用于诊断视频大语言模型(Video-LLMs)在忠实性与事实性方面的幻觉问题,并评估其在多种干扰模式下的可靠性。

Details Motivation: 现有基准对事实性幻觉覆盖不足,且多仅在干净环境下评估模型,缺乏对模型在现实复杂场景中可靠性的系统评测。 Method: 构建包含9800个问答实例的INFACT基准,涵盖真实与合成视频,细粒度划分忠实性与事实性幻觉;设计四种评估模式(基础、视觉退化、证据污染、时序干预),并引入抵抗率(RR)与时间敏感性得分(TSS)量化可靠性。 Result: 实验表明高基础准确率不保证高可靠性;证据污染显著降低稳定性;时序干预导致最大性能下降;多个开源模型在事实性TSS上接近零,显示其对时序敏感问题存在严重时间惯性。 Conclusion: INFACT揭示了当前Video-LLMs在复杂干扰下可靠性严重不足,尤其在时序敏感的事实性推理方面存在根本缺陷,亟需针对性改进。 Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

[78] SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation

Xiaogang Du,Jiawei Zhang,Tongfei Liu,Tao Lei,Yingbo Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SPEGC的持续测试时自适应(CTTA)方法,用于解决医学图像分割中因训练与测试数据域差异导致的性能下降问题。该方法通过语义提示增强局部特征,并结合可微图聚类求解器构建高阶结构表征,从而提升模型在未知域上的鲁棒性与一致性。

Details Motivation: 现有CTTA方法依赖不可靠监督信号,易引发错误累积和性能崩溃;医学图像分割中训练与测试数据间存在显著域差距,阻碍预训练模型临床落地。 Method: 提出SPEGC框架:1)设计语义提示特征增强机制,利用解耦的共性/异质性提示池注入全局上下文信息;2)构建可微图聚类求解器,将边稀疏化建模为最优传输问题,端到端生成高阶结构表征;3)用该结构表征指导模型自适应,实现簇级预测一致性与动态边界调整。 Result: 在两个医学图像分割基准上,SPEGC显著优于当前最先进的CTTA方法。 Conclusion: SPEGC通过语义提示增强与图聚类结构建模,有效缓解了域偏移下的噪声敏感性和错误累积问题,提升了持续测试时自适应的鲁棒性与泛化能力。 Abstract: In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei-Z/SPEGC-for-MIS.

[79] OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi,Wenhua Wu,Fei Shen,Xiaogang Zhu,Kun Hu,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出OrthoEraser方法,利用稀疏自编码器(SAE)实现高分辨率特征解耦,并通过解析正交化投影进行概念擦除,在消除有害内容的同时保留生成流形的完整性,显著优于现有方法。

Details Motivation: 现有文本到图像模型的概念擦除方法在抑制敏感神经元时易损害良性属性,因敏感与良性语义在激活子空间中非正交叠加、相互纠缠。 Method: 提出OrthoEraser:先用稀疏自编码器(SAE)分解密集激活并分离敏感神经元;再通过耦合神经元检测识别易受干预的非敏感特征;最后采用解析梯度正交化策略,将擦除向量投影到耦合神经元的零空间,实现敏感概念与关键良性子空间的正交解耦。 Result: 实验表明OrthoEraser具有高擦除精度,能有效移除有害内容并保持生成流形完整性,在安全性任务上显著优于SOTA基线。 Conclusion: OrthoEraser通过正交化投影机制实现了更精细、更安全的概念擦除,在保障模型安全性的同时避免了对良性语义的损伤。 Abstract: Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.

[80] ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation

Lijun Guo,Qian Zhou,Zidi Shi,Hua Zou,Gang Ke

Main category: cs.CV

TL;DR: 本文提出ActiveFreq框架,结合主动学习与频域分析,通过AcSelect模块选择最具信息量的误标区域,并利用FreqFormer骨干网络引入傅里叶变换增强特征提取,在减少人工交互的同时提升医学图像交互式分割精度。

Details Motivation: 现有交互式分割方法未能充分利用用户交互知识,且忽视频域信息,对误标区域无差别处理,导致效率和性能受限。 Method: 提出ActiveFreq框架,包含AcSelect模块(基于主动学习选择高信息量误标区域)和FreqFormer骨干网络(集成傅里叶变换模块以挖掘频域特征)。 Result: 在ISIC-2017和OAI-ZIB数据集上NoC@90分别达3.74和9.27,较此前最优结果提升23.5%和12.8%;仅用2次点击即实现85.29%和75.76%的mIoU。 Conclusion: ActiveFreq有效融合主动学习与频域建模,显著降低人工干预需求,同时提升交互式医学图像分割的精度与鲁棒性。 Abstract: Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.

[81] Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices

Rambod Azimi,Yuri Grinberg,Dan-Xia Xu,Odile Liboiron-Ladouceur

Main category: cs.CV

TL;DR: 本文提出Gen-Fab,一种基于Pix2Pix的条件生成对抗网络(cGAN),用于预测硅光子器件制造中的纳米级工艺变化,输入为GDS版图,输出为类SEM图像,能建模不确定性并实现一对多映射;在精度(IoU达89.8%)和不确定性建模(KL散度、Wasserstein距离更低)上均优于多种U-Net基线方法,并具备对未见几何结构的良好泛化能力。

Details Motivation: 硅光子器件制造中存在非均匀的工艺偏差(如过刻蚀、欠刻蚀、拐角圆化等),其影响依赖于特征尺寸与形状,亟需高保真数字孪生模型来预测实际制造结果的分布范围。 Method: 提出Gen-Fab:基于Pix2Pix架构的条件生成对抗网络,在编码器-解码器瓶颈处注入随机噪声向量以支持一对多映射,输入GDS版图,输出高分辨率类SEM图像,模拟纳米级工艺变异。 Result: 在离分布测试集上,Gen-Fab取得最高IoU(89.8%),显著优于确定性U-Net(85.3%)、MC-Dropout U-Net(83.4%)和U-Net集成(85.8%);同时在KL散度和Wasserstein距离上更贴近真实制造结果分布,并在分布偏移分析中展现出对新几何结构的强泛化能力。 Conclusion: Gen-Fab是一种有效建模光子制造不确定性并生成高保真数字孪生的生成式方法,为设计鲁棒光子器件提供了可靠工具。 Abstract: Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.

[82] Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance

Zexi Jia,Pengcheng Luo,Zhengyao Fang,Jinchao Zhang,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出Manifold-Optimal Guidance (MOG)框架,通过将引导建模为局部最优控制问题,在流形上进行几何感知的黎曼更新,解决Classifier-Free Guidance中因欧氏外推导致的流形偏离问题;并进一步提出Auto-MOG动态调度机制,自动平衡能量、消除人工调参需求,显著提升生成质量且无额外计算开销。

Details Motivation: Classifier-Free Guidance (CFG)在高引导尺度下易导致过饱和、纹理伪影和结构崩溃,根本原因在于其在环境空间中进行欧氏外推,使采样轨迹偏离高密度数据流形。 Method: 提出Manifold-Optimal Guidance (MOG),将引导重新表述为局部最优控制问题,导出闭式、几何感知的黎曼更新;并设计Auto-MOG,一种基于动态能量平衡的自适应引导强度调度策略。 Result: MOG在保真度与条件对齐性上显著优于基线方法,几乎不增加计算开销;Auto-MOG消除了手动调节引导尺度的需要。 Conclusion: MOG从流形几何视角重构扩散引导机制,提供理论更严谨、实践更鲁棒的替代方案,Auto-MOG进一步提升易用性与泛化性。 Abstract: Classifier-Free Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.

[83] FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval

Chenchen Zhao,Jianhuan Zhuo,Muxi Chen,Zhaohua Zhang,Wenyu Jiang,Tianwen Jiang,Qiuyong Xiao,Jihong Zhang,Qiang Xu

Main category: cs.CV

TL;DR: 本文提出FBCIR方法来解释多模态模型在组合图像检索(CIR)任务中的注意力失衡问题,并设计了一种针对困难负样本的数据增强策略,以提升模型在挑战性场景下的鲁棒性与性能。

Details Motivation: 现有CIR模型在面对语义上与查询图像或文本高度对齐的困难负样本时性能显著下降,作者认为这是由于模型在图文模态间存在注意力失衡所致。 Method: 提出FBCIR——一种多模态注意力解释方法,用于识别影响检索决策的关键图文组件;并基于分析结果构建面向困难负样本的数据增强流程,以促进平衡的跨模态推理。 Result: FBCIR验证了现有CIR模型普遍存在注意力失衡现象,尤其在困难负样本场景下;所提数据增强方法在多个CIR模型上一致提升了困难场景下的检索准确率,同时不损害标准基准性能。 Conclusion: 注意力失衡是制约CIR模型鲁棒性的关键因素;通过可解释性分析驱动的数据增强,可有效提升模型对复杂语义干扰的抵抗能力,为CIR模型诊断与优化提供了新范式。 Abstract: Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.

[84] EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection

Shuo Jiang,Gaojia Zhang,Min Tan,Yufei Yin,Gang Pan

Main category: cs.CV

TL;DR: 本文提出了一种统一的无监督伪装目标检测(UCOD)框架,通过多线索原生感知模块、伪标签演化融合和局部伪标签优化等创新设计,显著提升了伪标签可靠性与特征保真度,在多个数据集上达到SOTA性能。

Details Motivation: 现有UCOD方法受限于目标与背景高度相似性及噪声伪标签,导致边界溢出、结构模糊或细节丢失;传统伪标签优化策略忽视内在感知线索,而无监督学习又易产生粗粒度特征。 Method: 提出统一UCOD框架,包含:1)Multi-Cue Native Perception模块,融合低层纹理与中层语义提取视觉先验;2)Pseudo-Label Evolution Fusion,通过师生交互与深度可分离卷积实现语义去噪,并引入Spectral Tensor Attention Fusion进行多层注意力图的谱域信息聚合;3)Local Pseudo-Label Refinement,利用注意力多样性优化局部细节与边界保真度。 Result: 在多个UCOD数据集上达到最先进(SOTA)性能,显著提升细节感知能力、边界对齐鲁棒性及复杂伪装场景下的泛化能力。 Conclusion: 融合多层级感知线索与精细化伪标签演化机制,是提升无监督伪装目标检测性能的有效范式;所提框架兼顾语义准确性与结构完整性,为UCOD提供了新思路。 Abstract: Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.

[85] MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

Jian Zou,Xiaoyu Xu,Zhihua Wang,Yilin Wang,Balu Adsumilli,Kede Ma

Main category: cs.CV

TL;DR: 本文提出MDS-VQA,一种模型驱动的数据选择方法,用于筛选对现有VQA模型既困难又内容多样的未标注视频,在有限标注预算下提升模型细调效果。

Details Motivation: 现有学习型视频质量评估(VQA)方法存在模型设计与数据构建脱节的问题:模型中心法依赖固定基准,数据中心法缺乏对模型弱点的系统性覆盖。 Method: MDS-VQA通过训练一个基于排序目标的失败预测器估计样本难度,并利用深度语义视频特征度量多样性,再以贪心策略在标注预算约束下平衡二者进行数据筛选。 Result: 在多个VQA数据集和模型上验证表明,仅用5%的目标域样本进行主动细调,模型平均SRCC从0.651提升至0.722,并取得最优gMAD排名。 Conclusion: MDS-VQA能高效识别高信息量样本,显著提升VQA模型在目标域的适应性与泛化能力,弥合了模型与数据协同优化的鸿沟。 Abstract: Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.

[86] Mobile-GS: Real-time Gaussian Splatting for Mobile Devices

Xiaobiao Du,Yida Wang,Kun Zhan,Xin Yu

Main category: cs.CV

TL;DR: 本文提出Mobile-GS,一种面向移动设备的实时3D高斯泼溅渲染方法,通过深度感知的顺序无关渲染、神经视图相关增强、球谐蒸馏、神经矢量量化和基于贡献的剪枝等技术,在保证高画质的同时显著降低计算与存储开销。

Details Motivation: 3D高斯泼溅(3DGS)虽渲染质量高,但计算开销大、存储成本高,难以部署于资源受限的移动设备。 Method: 提出深度感知的顺序无关渲染以消除高斯深度排序瓶颈;引入神经视图相关增强缓解透明度伪影;采用一阶球谐蒸馏、神经矢量量化和贡献驱动的高斯剪枝压缩表示。 Result: 在移动设备上实现高质量、实时渲染,模型体积显著减小,且视觉质量保持优异。 Conclusion: Mobile-GS有效平衡了渲染质量、速度与资源消耗,为边缘端3DGS部署提供了实用可行的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of applications.However, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.

[87] Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Hongyi Lin,Wenxiu Shi,Heye Huang,Dingyi Zhuang,Song Zhang,Yang Liu,Xiaobo Qu,Jinhua Zhao

Main category: cs.CV

TL;DR: 本文提出RiskMV-DPO,一种物理信息驱动、风险可控的多视角驾驶场景生成方法,通过融合目标风险等级与物理建模生成高风险动态轨迹,并结合几何-外观对齐与区域感知直接偏好优化(RA-DPO),在nuScenes上显著提升3D检测性能(mAP从18.17升至30.50)并保证视觉质量。

Details Motivation: 长尾高风险驾驶场景在真实数据中稀少且难以手动设计,现有生成方法将风险视为后验标签,且难以保持多视角几何一致性。 Method: 提出RiskMV-DPO框架:1)物理建模驱动的风险可控轨迹生成;2)以轨迹为几何锚点的扩散视频生成;3)几何-外观对齐模块;4)运动感知掩码引导的区域感知直接偏好优化(RA-DPO)。 Result: 在nuScenes上实现多样化长尾场景生成,3D检测mAP提升12.33,FID降至15.70,视觉质量达SOTA。 Conclusion: RiskMV-DPO将世界模型从被动预测转向主动、风险可控的合成,为具身智能的安全开发提供可扩展工具链。 Abstract: Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.

[88] ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation

Md Jahidul Islam

Main category: cs.CV

TL;DR: 本文提出ReHARK框架,通过在再生核希尔伯特空间中引入全局邻近正则化,解决大模型在单样本场景下的稳定性-可塑性困境,显著提升单样本视觉语言迁移性能。

Details Motivation: 大型视觉语言模型(如CLIP)在极低数据(尤其单样本)下游任务中面临稳定性与可塑性难以兼顾的问题;现有无训练方法(如Tip-Adapter)存在边界偏差和缺乏全局结构正则化等缺陷。 Method: 提出ReHARK无训练框架,包含四阶段精细化流程:(1)混合先验构建(融合CLIP文本知识与GPT-3及视觉原型);(2)支撑集增强(跨模态插值生成中间样本);(3)自适应分布校正(对齐测试特征统计与增强支撑集);(4)多尺度RBF核集成(建模多尺度特征几何)。 Result: 在11个基准上实验验证,ReHARK以平均65.83%准确率创下单样本适应新SOTA,显著优于现有基线。 Conclusion: ReHARK通过全局RKHS正则化与多阶段语义-视觉协同优化,有效缓解了单样本VLM适配中的稳定性-可塑性矛盾,为无训练小样本迁移提供了新范式。 Abstract: The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.

[89] Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang,Haowei Zhu,Jun-hai Yong,Hao Pan,Bin Wang

Main category: cs.CV

TL;DR: Mango-GS是一种基于多帧、节点引导的高保真4D重建方法,利用时序Transformer建模短时帧间运动依赖,并通过稀疏控制节点实现高效、稳定、一致的动态场景重建与实时渲染。

Details Motivation: 现有高斯溅射动态建模方法多采用逐帧优化,易过拟合瞬时状态,难以捕捉底层运动动力学,导致时间一致性差。 Method: 提出Mango-GS框架:引入时序Transformer建模短时窗口内运动依赖;用稀疏控制节点(含解耦规范位置和潜在码)作为语义锚点;结合输入掩码策略与两个多帧损失,端到端训练。 Result: 在多个数据集上达到重建质量与实时渲染速度的SOTA水平,支持高保真动态场景重建与交互式渲染。 Conclusion: Mango-GS通过节点引导的多帧建模有效提升了动态3D场景重建的时序一致性、稳定性与效率,为4D内容生成提供了新范式。 Abstract: Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.

[90] PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation

Xiangyu Li,Chenglin Wang,Qiantong Shen,Fanding Li,Wei Wang,Kuanquan Wang,Yi Shen,Baochun Zhao,Gongning Luo

Main category: cs.CV

TL;DR: 本文提出了一种PCA增强的概率U-Net(PEP U-Net),通过在后验网络中引入PCA降维与逆PCA重建,缓解潜在空间冗余、提升表达能力与计算效率,在保持生成多样性的同时改善分割精度与不确定性建模的平衡。

Details Motivation: 解决现有基于cVAE的模糊医学图像分割方法存在的高维潜在空间冗余和单后验网络表达能力有限的问题。 Method: 提出PCA增强的概率U-Net(PEP U-Net):在后验网络中引入PCA进行降维以减少冗余,并通过逆PCA重建关键信息以增强潜在空间表征能力。 Result: 相比传统生成模型,在保持多样分割假设生成能力的同时,显著提升了分割精度与预测变异性之间的平衡,增强了医学图像分割中生成建模的性能。 Conclusion: PEP U-Net有效缓解了潜在空间冗余问题,提升了计算效率与表征能力,为模糊医学图像分割提供了更优的不确定性建模方案。 Abstract: Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.

[91] MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che,Shuo Wen,Shan Huang,Chuang Wang,Yuzhe Yang,Gregory Dudek,Xueqian Wang,Jian Su

Main category: cs.CV

TL;DR: 本文提出了MANSION框架,用于生成多楼层、建筑规模的3D环境,以支持真实世界中跨楼层、长时程的具身智能任务,并发布了包含1000+建筑的MansionWorld数据集及语义场景编辑智能体,揭示了现有最先进智能体在该场景下的性能显著下降。

Details Motivation: 现有具身智能基准局限于单层室内环境,无法反映真实世界中多楼层、长时程任务所需的复杂空间推理能力。 Method: 提出MANSION语言驱动框架,建模垂直结构约束,生成可导航、人类友好的全楼3D环境;构建MansionWorld数据集,并设计基于开放词汇命令的Task-Semantic Scene Editing Agent进行场景定制。 Result: 发布了含1000多个多样化建筑(如医院、办公楼)的MansionWorld数据集和配套编辑智能体;基准测试表明当前SOTA智能体在跨楼层长时程任务上性能急剧下降。 Conclusion: MANSION为评估和推动下一代空间推理与规划能力提供了关键且更具现实挑战性的测试平台。 Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

[92] Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Xinyu Nan,Ning Wang,Yuyao Zhai,Mei Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的双监督图像美学增强方法DIAE,通过多模态美学感知(MAP)将模糊的美学指令转化为明确指导,并构建弱配对数据集IIAEData及双分支监督框架,以解决美学增强中指令理解难和高质量配对数据稀缺的问题。

Details Motivation: 现有图像编辑模型在美学增强方面表现不佳,主要受限于:1)难以准确理解并遵循具有美学感知的编辑指令;2)缺乏内容一致但美学质量不同的“完美配对”图像数据。 Method: 提出Dual-supervised Image Aesthetic Enhancement(DIAE):1)引入Multimodal Aesthetic Perception(MAP),利用标准化多属性文本指令与对应文本-图像控制信号;2)构建弱配对数据集IIAEData;3)设计双分支监督框架实现弱监督训练。 Result: DIAE在图像美学评分和内容一致性评分上均优于基线模型。 Conclusion: DIAE有效提升了图像美学增强能力,验证了多模态美学感知与弱监督学习策略在该任务中的有效性。 Abstract: Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.

[93] TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision

Robinson Umeike,Cuong Pham,Ryan Hausen,Thang Dao,Shane Crawford,Tanya Brown-Giammanco,Gerard Lemson,John van de Lindt,Blythe Johnston,Arik Mitschang,Trung Do

Main category: cs.CV

TL;DR: 本文提出了TornadoNet基准,用于评估实时目标检测模型在街景图像中进行多级建筑损毁评估的性能,比较了YOLO系列CNN模型与RT-DETR等Transformer模型,并引入序数感知监督策略以提升损毁严重程度估计的准确性。

Details Motivation: 现有方法缺乏在真实灾后条件下对建筑损毁多级检测的系统性评估,尤其缺少对模型架构与损失函数协同影响的可控分析。 Method: 构建包含3333张街景图像和8890个标注建筑实例的TornadoNet基准;采用IN-CORE五级损毁分类框架;对比YOLO系列CNN与RT-DETR等Transformer模型;提出软序数分类目标与序数距离惩罚等序数感知监督策略。 Result: YOLO模型在检测精度(最高46.05% mAP@0.5)与推理速度(66–276 FPS)上占优;RT-DETR在序数一致性上更优(88.13% Ordinal Top-1 Accuracy,MAOE=0.65);引入序数监督后,RT-DETR的mAP提升4.8个百分点,Ordinal Top-1 Accuracy达91.15%,MAOE降至0.56。 Conclusion: 序数感知监督能显著提升损毁严重程度估计的可靠性,其效果依赖于与检测器架构的合理匹配;TornadoNet为灾后响应提供了可部署的方法与工具。 Abstract: We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet

[94] SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang,Junkun Hong,Hongrong Wang,Honghao Cai,Xunpeng Ren,Ge Wang,Mingcong Lei,Shenhao Yan,Jiahao Yang,Chengsi Yao,Xi Li,Yiming Zhao,Yatong Han,Jinke Ren

Main category: cs.CV

TL;DR: 本文提出了一种分阶段视觉语言学习框架(SVLL)和一种改进的偏好优化方法(Bias-DPO),以提升具身任务规划中视觉接地性与因果一致性的平衡,显著提高任务成功率并减少物理约束违规。

Details Motivation: 现有具身任务规划方法在端到端训练中易出现过早时间绑定,而强化学习方法又存在优化不稳定问题;同时标准DPO忽略最优路径的绝对似然约束,导致不安全或幻觉行为。 Method: 提出三阶段SVLL框架:前两阶段解耦空间接地与时间推理,第三阶段引入Bias-DPO——在DPO基础上显式最大化专家轨迹似然、惩罚过度自信的幻觉动作。 Result: 在AI2-THOR基准和真实机器人部署中,SVLL超越Qwen2.5-VL-7B、GPT-4o、Gemini-2.0-flash等SOTA模型,任务成功率更高,物理约束违反显著减少。 Conclusion: SVLL结合Bias-DPO能有效锚定策略于专家流形,缓解因果错位,确保严格遵循环境可供性,抑制物理上不可能的捷径。 Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

[95] R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia,Yousen Tang,Yongtao Wang,Zhifeng Wang,Weijun Qin

Main category: cs.CV

TL;DR: 本文提出R4Det,通过全景深度融合模块提升深度估计质量,设计可变形门控时间融合模块避免依赖自车姿态,并引入实例引导的动态细化模块提取2D语义原型,显著提升4D雷达-相机融合的3D目标检测性能。

Details Motivation: 现有4D雷达-相机融合的3D目标检测方法存在三大问题:绝对深度估计不鲁棒准确、时间融合模块严重依赖不稳定的自车姿态、稀疏雷达点云对小物体反射失败导致仅能依赖视觉单模态先验。 Method: 提出R4Det框架,包含三个核心模块:1)全景深度融合模块(Panoramic Depth Fusion),实现绝对与相对深度互增强;2)可变形门控时间融合模块(Deformable Gated Temporal Fusion),摆脱对自车姿态的依赖;3)实例引导动态细化模块(Instance-Guided Dynamic Refinement),从2D实例中提取语义原型。 Result: 在TJ4DRadSet和VoD数据集上达到SOTA的3D目标检测性能。 Conclusion: R4Det有效克服了当前4D雷达-相机融合检测方法的关键缺陷,在深度估计、时间融合鲁棒性和小物体检测方面均有显著提升,验证了多模态协同与结构化先验建模的有效性。 Abstract: 4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

[96] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Hui Zhang,Juntao Liu,Zongkai Liu,Liqiang Niu,Fandong Meng,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出WeEdit,一个面向文本中心图像编辑的系统性解决方案,包含数据构建、双基准测试和两阶段训练策略,显著提升复杂文本编辑的精度与清晰度。

Details Motivation: 现有模型在执行复杂文本编辑时表现不佳,常产生模糊或幻觉字符,主要原因是缺乏针对文本编辑的专门训练范式、大规模数据集及标准化评测基准。 Method: 提出基于HTML的自动编辑流水线生成330K多语言训练对;设计双语/多语言基准;采用字形引导监督微调注入空间与内容先验,再通过多目标强化学习优化指令遵循性、文本清晰度与背景保持。 Result: WeEdit在多种编辑任务上显著超越现有开源模型。 Conclusion: WeEdit通过系统性数据、评测与算法设计,有效解决了文本中心图像编辑的关键挑战,为该领域提供了可扩展、可复现的范式。 Abstract: Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

[97] LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

Junkun Jiang,Ho Yin Au,Jingyu Xiang,Jie Chen

Main category: cs.CV

TL;DR: 本文提出LabanLite运动表示法和LaMoGen生成框架,通过将人体动作分解为可解释的Laban符号序列,结合大语言模型进行符号推理,实现高可控性、可解释的语言驱动运动合成。

Details Motivation: 现有基于联合文本-运动嵌入的方法难以生成时间准确、细节丰富的运动,且缺乏可解释性。 Method: 提出LabanLite运动表示法(基于Labanotation系统),将原子级身体动作编码为离散Laban符号+文本模板;构建LaMoGen框架,利用大语言模型进行符号推理生成运动序列;建立基于Labanotation的基准测试集及三维度评估指标。 Result: LaMoGen在自建基准及两个公开数据集上均优于先前方法,显著提升运动合成的可解释性与可控性。 Conclusion: 符号推理与基于智能体的设计范式能有效提升语言驱动运动合成的质量与可解释性。 Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.

[98] Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints

Lijun Guo,Haoyu Zhao,Xingyue Zhao,Rong Fu,Linghao Zhuang,Siteng Huang,Zhongyu Li,Hua Zou

Main category: cs.CV

TL;DR: 本文提出Articulat3D框架,利用单目视频构建高保真关节物体数字孪生体,通过运动先验驱动初始化与几何/运动约束优化,实现几何准确且时间一致的重建。

Details Motivation: 现有方法依赖多视角、静态离散状态捕捉,难以在真实世界中大规模应用;亟需从随意拍摄的单目视频中构建关节物体数字孪生。 Method: 提出Motion Prior-Driven Initialization(利用3D点轨迹与紧凑运动基实现软刚性分组)和Geometric and Motion Constraints Refinement(基于可学习运动学原语施加物理合理的关节约束)。 Result: 在合成数据集和真实单目视频上达到SOTA性能,显著提升无控现实场景下数字孪生构建的可行性。 Conclusion: Articulat3D实现了仅用单目视频即可高精度、高一致性地重建关节物体三维结构与运动,推动数字孪生技术走向实用化。 Abstract: Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.

[99] DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

Tong Zhao,Mingkun Lei,Liangyu Yuan,Yanming Yang,Chenxi Song,Yang Wang,Beier Zhu,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为DyWeight的轻量级、学习型多步ODE求解器,通过动态加权历史梯度并隐式校准时间步长,显著提升扩散模型采样效率与生成质量。

Details Motivation: 现有扩散模型采样速度慢,多步ODE求解器虽有改进,但其手工设计的系数无法适应扩散过程中的非平稳动力学特性。 Method: 提出Dynamic Gradient Weighting(DyWeight),采用学习驱动的多步求解范式,放松传统数值约束,学习时变参数以自适应聚合历史梯度,并隐式调节有效步长,实现与模型去噪动力学对齐。 Result: 在CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion和FLUX.1-dev等多个基准上,DyWeight以更少函数评估次数实现了更高视觉保真度与采样稳定性,达到高效扩散求解器的新SOTA。 Conclusion: DyWeight通过数据驱动的动态梯度加权与隐式时间校准,克服了传统多步求解器的适应性瓶颈,在效率与质量间取得更好平衡,为扩散模型快速采样提供了新范式。 Abstract: Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight

[100] SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

Muyi Sun,Yifan Gao,Ziang Jia,Xingqun Qi,Qianli Zhang,Qian Liu,Tianzheng Deng

Main category: cs.CV

TL;DR: 本文提出SemiTooth,一种面向多源CBCT数据的半监督牙齿结构分割框架,通过构建多源半监督数据集MS3Toothset,并设计多教师-多学生架构及严格加权置信约束,显著提升无标注数据利用效率与跨域分割精度。

Details Motivation: CBCT牙齿分割面临全标注数据获取难、多机构数据采集差异大导致的标注质量低、体素级不一致和域间差异等问题,亟需高效利用多源未标注数据。 Method: 构建包含三类标注水平的多源数据集MS3Toothset;提出多教师-多学生半监督框架SemiTooth,各学生网络分别学习对应来源的无标注数据并受各自教师监督;引入更严格的加权置信约束提升多源准确性。 Result: 在MS3Toothset上实验表明,SemiTooth在半监督与多源牙齿分割任务中达到当前最优(SOTA)性能。 Conclusion: SemiTooth为临床CBCT多源半监督牙齿分割提供了通用且高效的解决方案,提升了模型泛化性与实用性。 Abstract: With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source accuracy.Extensive experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.

[101] OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han,Bin Zhu,Shiqi Hu,Franklin Mingzhe Li,Patrick Carrington,Roger Zimmermann,Jingjing Chen

Main category: cs.CV

TL;DR: 本文提出OSCBench基准,专门评估文本到视频(T2V)模型对物体状态变化(OSC)的理解能力,发现现有模型在OSC任务上表现薄弱,尤其在新颖和组合场景中,揭示OSC是T2V发展的关键瓶颈。

Details Motivation: 现有T2V评测基准忽视了文本中明确指定的物体状态变化(OSC)这一关键动作理解维度,而OSC对真实世界动作建模至关重要。 Method: 构建基于烹饪教学数据的OSCBench基准,涵盖常规、新颖和组合三类动作-物体交互场景;结合人工用户研究与多模态大语言模型(MLLM)自动评估,对六种主流T2V模型进行系统评测。 Result: 当前T2V模型在语义与场景对齐上表现良好,但在准确且时序一致地生成物体状态变化方面普遍不足,尤其在新颖和组合场景下错误率显著升高。 Conclusion: 物体状态变化(OSC)是制约T2V模型实用化的关键瓶颈;OSCBench为推动具备状态感知能力的视频生成模型发展提供了可诊断、可扩展的基准工具。 Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

[102] Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Lu Niu,Cheng Xue

Main category: cs.CV

TL;DR: 本文提出NA-MVP框架,通过双向多视角提示对齐实现噪声感知的少样本学习,提升视觉-语言模型在标签噪声下的鲁棒性。

Details Motivation: 现有视觉-语言少样本方法易受噪声标签影响,缺乏建模细粒度语义线索和自适应分离干净/噪声信号的能力。 Method: NA-MVP包含三部分:(1) 多视角提示+非平衡最优传输实现区域级细粒度对齐并抑制不可靠区域;(2) 双向提示设计,分别捕获清洁导向与噪声感知线索;(3) 对齐引导的选择性精炼策略,仅修正误标样本。 Result: 在合成与真实噪声基准上,NA-MVP持续超越现有最优方法。 Conclusion: 区域感知、双向提示对齐与选择性精炼可有效提升少样本视觉-语言模型在噪声标签下的鲁棒性与跨模态对齐能力。 Abstract: Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.

[103] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang,Zhuoran Jin,Yupu Hao,Yubo Chen,Kang Liu,Yulong Ao,Jun Zhao

Main category: cs.CV

TL;DR: 本文提出Think While Watching框架,解决多模态大语言模型在流式视频理解中难以兼顾实时感知与生成、长程依赖建模弱的问题,通过内存锚定、分段因果掩码与阶段匹配训练,在单轮和多轮流式设置下显著提升性能并减少输出token。

Details Motivation: 现有MLLMs在流式视频理解中受限于离线推理或在线推理能力弱,且主流流式方法采用交替感知-生成范式,导致无法并发处理、早期记忆衰减,损害长程依赖建模。 Method: 提出内存锚定的流式视频推理框架Think While Watching,构建三阶段多轮思维链数据集,采用阶段匹配训练策略,并引入分段级流式因果掩码与流式位置编码;推理时设计观看与思考重叠的高效流水线,并自适应选择最优注意力后端。 Result: 在StreamingBench单轮设置下准确率提升2.6%,OVO-Bench提升3.79%;多轮设置下维持性能同时输出token减少56%。 Conclusion: Think While Watching有效缓解了流式视频多轮交互中的记忆衰减与延迟问题,实现了感知与生成的真正并发,为在线多模态推理提供了可扩展新范式。 Abstract: Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

[104] Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

Jiin Im,Sisung Liu,Je Hyeong Hong

Main category: cs.CV

TL;DR: 本文提出Shape-of-You (SoY)框架,通过融合Gromov-Wasserstein(FGW)优化和3D基础模型,提升无监督语义对应任务的性能,解决2D特征在几何歧义(如对称性、重复纹理)下的局限性,并在SPair-71k和AP-10k上达到SOTA。

Details Motivation: 现有基于2D基础模型和最近邻伪标签的无监督语义对应方法局限于局部外观匹配,难以处理由对称性或重复结构引起的几何歧义,缺乏对图像内在结构关系的建模。 Method: 将伪标签生成建模为Fused Gromov-Wasserstein(FGW)优化问题,联合优化跨图像特征相似性与图像内几何结构一致性;引入3D基础模型定义几何结构;采用锚点线性化近似求解高复杂度FGW;设计软目标损失,动态融合噪声伪标签与网络预测。 Result: 在SPair-71k和AP-10k数据集上取得当前最优性能,显著优于以往无显式几何标注的方法。 Conclusion: 结构感知的伪标签生成(通过FGW与3D几何先验)可有效缓解纯2D外观驱动方法的几何歧义问题,为无监督语义对应提供了新范式。 Abstract: Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.

[105] Linking Perception, Confidence and Accuracy in MLLMs

Yuetian Du,Yucheng Wang,Rongyu Zhang,Zhijie Xu,Boyu Yang,Ming Kong,Jie Liu,Qiang Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于置信度驱动的强化学习方法(CDRL)和置信度感知的测试时扩展方法(CA-TTS),以解决多模态大语言模型(MLLMs)中普遍存在的置信度校准问题,并在多个基准上取得显著性能提升。

Details Motivation: 现有MLLMs虽提升了视觉感知精度,但缺乏对自身不确定性的认知能力,即存在严重的置信度误校准问题。 Method: 提出CDRL方法,利用原始-噪声图像对和新颖的置信度奖励函数来增强感知敏感性并校准置信度;进一步设计CA-TTS框架,在测试时动态协调Self-Consistency、Self-Reflection和Visual Self-Check模块,由专家模型担任多种角色进行调度与外部验证。 Result: 在四个基准上实现一致8.8%的性能提升,达到新SOTA;消融实验验证各模块有效性及扩展策略优越性。 Conclusion: 置信度校准不仅提升模型可靠性,还能作为‘免费午餐’增强测试时扩展效果;所提CDRL与CA-TTS框架为MLLMs可信推理提供了新范式。 Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.

[106] MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models

Shengyuan Liu,Zanting Ye,Yunrui Lin,Chen Hu,Wanting Geng,Xu Han,Bulat Ibragimov,Yefeng Zheng,Yixuan Yuan

Main category: cs.CV

TL;DR: 本文提出MedPruner,一种无需训练、模型无关的分层token剪枝框架,用于高效3D医学图像理解,通过两阶段机制(片间锚点过滤+动态信息核选择)显著减少视觉token数量(<5%),同时保持或提升性能。

Details Motivation: 现有3D医学VLM因直接拼接2D切片导致解剖冗余严重,且固定剪枝比无法适应不同切片的信息密度差异,计算效率低,难以临床部署。 Method: 提出MedPruner框架:第一阶段为片间锚点过滤模块,消除切片级时间冗余;第二阶段为动态信息核选择策略,基于累积注意力权重实现自适应token级压缩。 Result: 在三个3D医学基准和三种VLM上验证,MedPruner使MedGemma等模型仅保留少于5%视觉token时仍维持甚至超越原性能,大幅降低计算开销。 Conclusion: 3D医学VLM中存在大量token冗余,动态、分层的token剪枝对提升计算效率与临床实用性至关重要,MedPruner为此提供了有效且通用的解决方案。 Abstract: While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.

[107] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai,Yujie Zhou,Long Xing,Jiazi Bu,Xilin Wei,Yuhong Liu,Beichen Zhang,Kai Chen,Yuhang Zang

Main category: cs.CV

TL;DR: 本文提出Endogenous Chain-of-Thought(EndoCoT)框架,通过迭代优化隐式思维状态并将其与扩散Transformer(DiT)去噪过程对齐,提升多模态大语言模型(MLLM)在空间推理等复杂任务中的指导能力,显著提升准确率。

Details Motivation: 现有MLLM作为文本编码器嵌入扩散模型时存在两大问题:一是单步编码无法激发链式思维(Chain-of-Thought),导致推理深度不足;二是解码过程中指导信号固定不变,无法支持逐步分解复杂指令。 Method: 提出EndoCoT框架,包含两个核心模块:(1)迭代思维引导模块,持续优化MLLM的隐式思维状态以激活深层推理;(2)终端思维锚定模块,将最终思维状态与真实答案对齐,确保推理过程受文本监督。 Result: 在Maze、TSP、VSP、Sudoku等多个空间推理基准上平均准确率达92.1%,较最强基线提升8.3个百分点。 Conclusion: EndoCoT有效释放MLLM的推理潜力,并实现其与扩散模型去噪过程的动态协同,为复杂多步推理任务提供可解释、渐进式的生成范式。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

[108] Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

Yichi Zhang,Le Xue,Wenbo Zhang,Lanlan Li,Feiyang Xiao,Yuchen Liu,Xiaohui Zhang,Hongwei Zhang,Shuqi Wang,Gang Feng,Liling Peng,Xin Gao,Yuanfan Xu,Yuan Qi,Kuangyu Shi,Hong Zhang,Yuan Cheng,Mei Tian,Zixin Hu

Main category: cs.CV

TL;DR: 本文提出SegAnyPET,一种基于3D全身体PET影像的通用分割基础模型,利用迄今最大最全面的PET数据集(11041例扫描、59831个掩码)训练,支持零样本跨中心、跨示踪剂、跨疾病器官与病灶分割,并支持人机协同临床工作流。

Details Motivation: PET图像解剖对比度低、数据获取与标注成本高,导致深度学习在定量PET分析中的发展严重受限,亟需通用基础模型解决多任务分割问题。 Method: 构建大规模3D全身体PET分割数据集;设计基于3D架构与提示工程的通用分割基础模型SegAnyPET,支持器官/病灶分割及人机交互修正。 Result: 在多中心、多示踪剂、多疾病数据集上展现出优异的零样本分割性能,验证了其泛化性与临床适用性。 Conclusion: SegAnyPET为PET影像定量分析提供了可扩展、通用且临床友好的基础模型框架,有望推动分子影像的临床应用发展。 Abstract: Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.

[109] MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li,Dong Wu,Jun Li,Shunkai Zhou,Zecui Zeng,Lusong Li,Hongbin Zha

Main category: cs.CV

TL;DR: 本文提出了MV-SAM3D,一种无需训练的多视角一致且物理合理的布局感知3D生成框架,通过3D潜空间中的多扩散融合与物理感知优化,显著提升了重建保真度和布局合理性。

Details Motivation: 现有单视角布局感知3D生成方法无法利用多视角互补信息,且独立估计的对象位姿易导致穿透、悬浮等物理不合理布局。 Method: 提出MV-SAM3D框架:1)将多视角融合建模为3D潜空间中的Multi-Diffusion过程;2)设计注意力熵加权与可见性加权两种自适应权重策略实现置信度感知融合;3)引入物理感知优化,在生成中及生成后注入碰撞与接触约束。 Result: 在标准基准和真实多物体场景实验中,显著提升了重建保真度与布局合理性,且完全无需额外训练。 Conclusion: MV-SAM3D验证了无需训练即可通过多视角融合与物理约束提升布局感知3D生成质量的有效性,为实用化场景级3D生成提供了新思路。 Abstract: Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.

[110] Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Sizhong Qin,Ramon Elias Weber,Xinzheng Lu

Main category: cs.CV

TL;DR: 本文提出HouseMind,一种多模态大语言模型,统一处理建筑平面图的理解、生成和编辑,通过离散房间实例标记实现几何、语义与空间层次的联合推理,显著提升布局的有效性与可控性。

Details Motivation: 现有AI系统在建筑平面图设计中难以联合推理几何、语义和空间层次,且扩散模型与语言模型在空间一致性与可控生成方面仍存在不足。 Method: 提出HouseMind模型,引入离散房间实例标记构建统一词汇表,结合多模态对齐与指令微调,实现从文本指令到连贯、可控平面布局的生成。 Result: 实验表明该框架在几何有效性与可控性上优于现有方法,同时保持高效性和本地可部署性。 Conclusion: HouseMind为建筑平面图的智能设计提供了统一、可控、可部署的多模态解决方案,推动了AI在空间推理任务中的实际应用。 Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

Chongxiao Wang,Junjie Liang,Peng Cao,Jinzhu Yang,Osmar R. Zaiane

Main category: cs.CV

TL;DR: 本文提出IDRL框架,通过解耦多模态表征并引入个体感知的模态融合模块,解决跨模态不一致与个体差异问题,提升抑郁检测鲁棒性。

Details Motivation: 现有方法存在跨模态抑郁线索冲突、无关信息干扰以及个体间抑郁表现差异大导致模态重要性不同等问题,影响融合可靠性。 Method: IDRL框架包含两部分:1)将多模态表征解耦为模态共有的抑郁空间、模态特有的抑郁空间和抑郁无关空间;2)设计个体感知模态融合模块(IAF),动态加权抑郁相关特征以实现自适应跨模态融合。 Result: 在多个数据集上实验表明,IDRL在多模态抑郁检测任务中性能优越且鲁棒性强。 Conclusion: IDRL有效缓解了模态间不一致性与个体差异带来的挑战,为可靠抑郁诊断提供了新思路。 Abstract: Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.

[112] FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation

Meilu Zhu,Zhiwei Wang,Axiu Mao,Yuxing Li,Xiaohan Xing,Yixuan Yuan,Edmund Y. Lam

Main category: cs.CV

TL;DR: 本文提出了首个面向医学图像分割的联邦学习基准FL-MedSegBench,涵盖9个任务、10种模态、2D/3D数据,并系统评估了8种通用FL与5种个性化FL方法在精度、公平性、通信效率、收敛性及域泛化等方面的表现,揭示了个性化方法(如FedBN)优势显著、无绝对最优方法、通信鲁棒性与公平性等关键发现。

Details Motivation: 缺乏标准化的医学图像分割联邦学习基准,导致方法评估不公且不全面。 Method: 构建FL-MedSegBench基准,包含九个分割任务、十种成像模态、2D/3D数据及临床异质性;系统评估八种通用FL和五种个性化FL方法,在分割精度、公平性、通信效率、收敛行为和跨域泛化五个维度进行实验分析。 Result: (i)个性化FL(尤其是FedBN)持续优于通用FL;(ii)无单一方法在所有数据集上占优;(iii)基于归一化的个性化方法对降低通信频率具有强鲁棒性;(iv)Ditto和FedRDN等方法能更好保护表现差的客户端;(v)方法在未见域上的泛化能力与其在参与客户端上的整体性能强相关。 Conclusion: FL-MedSegBench为医学图像分割联邦学习提供了首个全面、可复现的评估平台,揭示了关键实践规律,并将开源工具包以推动临床可用FL方案的发展。 Abstract: Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.

[113] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang,Yijiang Li,Ningzhi Gao,Xingfu Yan,Leyu Shi

Main category: cs.CV

TL;DR: 本文提出BackdoorIDS,一种零样本、推理时检测预训练视觉编码器后门样本的方法,利用注意力劫持与恢复现象,通过输入掩码轨迹下的嵌入序列聚类来识别后门样本,无需重训练,兼容多种架构。

Details Motivation: 下游用户常依赖来源不明的第三方预训练视觉编码器,面临后门攻击风险;现有防御方法多需训练或不适用于零样本场景。 Method: 基于‘注意力劫持与恢复’现象:对图像逐步掩码,观察编码器注意力从触发器特征转向良性内容时嵌入的突变;提取掩码轨迹上的嵌入序列,用DBSCAN等密度聚类判断是否形成多个簇以判定后门。 Result: BackdoorIDS在多种攻击类型、数据集和模型族(CNN、ViT、CLIP、LLaVA-1.5)上显著优于现有防御方法,具备零样本、免重训练、即插即用特性。 Conclusion: BackdoorIDS是一种通用、高效、无需修改模型的后门检测方法,为保障预训练视觉编码器安全提供了实用新范式。 Abstract: Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.

[114] PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Haohua Chen,Tianze Zhou,Wei Zhu,Runqi Wang,Yandong Guan,Dejia Song,Yibo Chen,Xu Tang,Yao Hu,Lu Sheng,Zhiyong Wu

Main category: cs.CV

TL;DR: 本文提出PROMO框架,基于Flow Matching DiT主干网络与潜在多模态条件拼接,将虚拟试穿(VTON)建模为结构化图像编辑任务,在保持主体、精准纹理迁移和无缝融合三方面实现高效高质量合成,显著提升推理效率并超越现有方法。

Details Motivation: 扩散模型虽能生成高保真虚拟试穿结果,但架构复杂、采样慢,难以兼顾质量与效率;同时VTON的配对数据可作为通用图像编辑的优质监督资源。 Method: 将VTON视为结构化图像编辑问题,提出PROMO框架:采用Flow Matching DiT作为主干,引入潜在空间多模态条件拼接与自参考机制,提升条件建模效率与推理速度。 Result: 在标准基准上,PROMO在视觉保真度上超越既有VTON方法和通用图像编辑模型,同时在质量与速度间取得更好平衡。 Conclusion: 基于流匹配的Transformer结合潜在多模态条件与自参考加速,是实现高质量、训练高效且推理快速的虚拟试穿的有效方案。 Abstract: Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.

[115] UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Cao Thien Tan,Phan Thi Thu Trang,Do Nghiem Duc,Ho Ngoc Anh,Hanyang Zhuang,Nguyen Duc Dung

Main category: cs.CV

TL;DR: 本文提出UCAN,一种轻量级混合CNN-Transformer网络,通过统一卷积与注意力机制、引入Hedgehog Attention和蒸馏大核模块,并采用跨层参数共享,在保持高精度的同时显著降低计算开销,适用于资源受限设备的图像超分辨率任务。

Details Motivation: 现有混合CNN-Transformer模型在图像超分辨率中虽性能优异,但扩大注意力窗口或卷积核会显著增加计算成本,难以部署于资源受限设备。 Method: 提出UCAN网络:1)结合窗口化空间注意力与Hedgehog Attention以兼顾局部纹理与长程依赖;2)设计蒸馏式大核模块保留高频结构且避免高计算负担;3)采用跨层参数共享进一步压缩模型复杂度。 Result: 在Manga109(4×)上UCAN-L达31.63 dB PSNR,仅需48.4G MACs;在BSDS100上达27.79 dB,优于参数量大得多的同类方法;实验验证其在精度、效率与可扩展性间取得更优平衡。 Conclusion: UCAN是一种高效、轻量且可扩展的图像超分辨率模型,特别适合实际高分辨率图像恢复应用。 Abstract: Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

[116] PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures

Chi Chen,Tianle Jiang,Xiaodong Wei,Yanming Wang

Main category: cs.CV

TL;DR: 本文提出PolyCrysDiff框架,基于条件潜在扩散模型实现可控、可计算的三维多晶微观结构生成,在形态、取向分布与空间相关性上高度保真,并通过CPFEM验证其物理有效性,进而揭示微观结构对力学性能的影响。

Details Motivation: 真实、可控地构建多晶材料三维微观结构是阐明结构-性能关系的关键,但目前仍具挑战性。 Method: 提出基于条件潜在扩散模型(conditional latent diffusion)的PolyCrysDiff框架,实现端到端、可控的3D多晶微观结构生成,并结合晶体塑性有限元(CPFEM)验证其物理有效性。 Result: PolyCrysDiff在晶粒形貌、取向分布和三维空间相关性上高度保真,晶粒属性(如尺寸、球形度)控制R²超0.972,优于MRF和CNN等主流方法;生成结构经CPFEM验证具备计算可行性与物理合理性。 Conclusion: PolyCrysDiff为加速数据驱动的多晶材料优化与设计提供了关键工具,推动了结构-性能关系的系统性解析。 Abstract: The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.

[117] COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection

Guillem González,Guillem Alenyà,Sergi Foix

Main category: cs.CV

TL;DR: 本文提出COTONET,一种基于YOLO11改进的轻量级目标检测模型,专用于多阶段棉花铃识别;通过引入多种注意力机制(如SimAM、PHAM)和结构优化(如CARAFE、SE块),在保持低计算开销(7.6M参数,27.8 GFLOPS)的同时显著提升检测精度(mAP50达81.1%,mAP50-95达60.6%),适用于边缘设备与采棉机器人。

Details Motivation: 棉花采摘过程中物理操作易导致纤维降解,需精准识别不同生育期的棉铃以实现类人工的轻柔抓取,而现有检测方法对复杂田间环境下多阶段棉铃识别效果不佳。 Method: 提出COTONET模型:基于YOLO11定制,引入Squeeze-and-Excitation块替代卷积块、CARAFE替代标准上采样、SimAM用于主干特征聚合、PHAM用于下采样路径的通道/空间/坐标三重注意力,并利用非可学习梯度操作增强形状与特征提取能力。 Result: COTONET在棉铃检测任务中达到mAP50=81.1%、mAP50-95=60.6%,优于标准YOLO基线;模型仅含7.6M参数、27.8 GFLOPS,适配边缘计算与移动机器人平台。 Conclusion: COTONET通过融合多种注意力机制与轻量化结构设计,在保证实时性与部署可行性的同时,显著提升了田间多阶段棉铃检测的准确性与鲁棒性,为智能棉花采摘提供了可行的技术方案。 Abstract: Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.

[118] Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction

Ammar Kheder,Helmi Toropainen,Wenqing Peng,Samuel Antão,Zhi-Song Liu,Michael Boy

Main category: cs.CV

TL;DR: CRAN-PM是一种双分支视觉Transformer,通过跨分辨率注意力机制高效融合全球气象数据(25 km)与本地高分辨率PM2.5数据(1 km),引入高程感知自注意力和风向引导的交叉注意力,提升物理一致性与预测精度,在欧洲大陆尺度(2900万像素)上实现快速、内存高效的PM2.5预测。

Details Motivation: Vision Transformer在时空预测中表现优异,但在超高清、洲际尺度环境监测任务中受限于自注意力机制的计算复杂度;需解决高分辨率输入(如1 km空气质图)带来的可扩展性瓶颈,并增强模型对物理规律(如地形、风场)的建模能力。 Method: 提出CRAN-PM:双分支ViT架构,采用跨分辨率注意力融合粗粒度气象数据与细粒度PM2.5观测;设计 elevation-aware self-attention 建模地形影响,wind-guided cross-attention 引入风场先验以约束特征学习;端到端可训练且内存高效。 Result: 在2022年欧洲每日PM2.5预测任务(362天、2971个EEA站点)上,相比最优单尺度基线,T+1和T+3预测RMSE分别降低4.7%和10.7%,复杂地形偏差降低36%;单GPU可在1.8秒内生成完整2900万像素欧洲地图。 Conclusion: CRAN-PM通过物理信息引导的注意力机制与跨分辨率建模,显著提升了超大规模环境预测任务的精度、效率与物理一致性,为洲际尺度高分辨率空气质量预报提供了可行方案。 Abstract: Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.

[119] VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On

Xiaoye Liang,Zhiyuan Qu,Mingye Zou,Jiaxin Liu,Lai Jiang,Mai Xu,Yiheng Zhu

Main category: cs.CV

TL;DR: 本文提出VTEdit-Bench基准和VTEdit-QA评估器,系统评测通用多参考图像编辑模型在虚拟试穿(VTON)任务中的性能,发现顶尖通用编辑模型在常规任务上媲美专用模型,且在复杂场景中泛化更稳,但在多衣物条件等复杂参考配置下仍有挑战。

Details Motivation: 现有专用VTON模型难以应对日益增长的真实场景需求,而通用多参考图像编辑模型虽展现出强泛化能力,但其在VTON任务中的优势与局限尚缺乏系统性评估基准。 Method: 构建了包含24,220对测试图像、覆盖五类典型VTON任务的综合基准VTEdit-Bench;提出基于参考感知视觉语言模型的评估器VTEdit-QA,从模型一致性、衣物一致性和整体图像质量三方面评估VTON效果;对8个通用编辑模型和7个专用VTON模型进行系统评测。 Result: 顶尖通用编辑模型在常规VTON任务上性能接近专用模型,在更难场景中泛化更稳定,但在多衣物条件等复杂参考配置下表现仍受限。 Conclusion: 通用多参考图像编辑模型是构建灵活VTON系统的可行路径,但需进一步提升对复杂参考(尤其是多衣物)的建模能力;VTEdit-Bench和VTEdit-QA为该方向提供了重要评估工具。 Abstract: As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.

[120] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Dingcheng Zhen,Xu Zheng,Ruixin Zhang,Zhiqi Jiang,Yichao Yan,Ming Tao,Shunshun Yin

Main category: cs.CV

TL;DR: 本文提出Neighbor Forcing和ConvKV内存机制,解决自回归扩散模型在长时序人类动画生成中的学习信号不一致与历史状态无界增长问题,实现小时级实时生成与高效推理。

Details Motivation: 现有自回归扩散模型在小时级实时人类动画任务中面临两个关键挑战:一是强制策略导致样本级表征与扩散状态不匹配,造成学习信号不一致和收敛不稳定;二是历史表征无界增长且缺乏结构,阻碍缓存复用并严重限制推理效率。 Method: 提出Neighbor Forcing——一种扩散步一致的自回归建模方法,将时间邻近帧作为相同噪声条件下的潜在邻居进行传播;并设计结构化ConvKV内存机制,将因果注意力中的键值压缩为固定长度表示,实现常数内存推理和真正无限视频生成。 Result: 实验表明该方法显著提升训练收敛性、小时级生成质量与推理效率;LiveAct可在仅2块NVIDIA H100/H200 GPU上实现小时级实时人类动画与20 FPS流式推理;在唇动同步精度、人体动画质量与情感表现力方面达到SOTA,且推理成本最低。 Conclusion: Neighbor Forcing与ConvKV共同构成了一种高效、稳定、可扩展的自回归扩散框架,为长时序、高保真、低延迟的视频生成提供了新范式。 Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.

[121] Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Chenyangguang Zhang,Botao Ye,Boqi Chen,Alexandros Delitzas,Fangjinhua Wang,Marc Pollefeys,Xi Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏3D手部关节点控制的新型框架,用于生成高保真、3D一致的自视角视频,解决了现有方法在严重自视角遮挡下运动不一致、幻觉伪影及跨具身泛化能力差的问题。

Details Motivation: 现有自视角视频生成方法依赖2D轨迹或隐式姿态,在严重自视角遮挡下易导致3D几何坍缩、运动不一致、伪影,并难以泛化到机器人手等非人具身形态。 Method: 提出以稀疏3D手部关节点为具身无关控制信号的新框架;设计高效控制模块:通过惩罚被遮挡关节的不可靠视觉信号提取遮挡感知特征,并采用基于3D的加权机制处理动态遮挡;将3D几何嵌入直接注入潜在空间以保证结构一致性;构建百万级高质量自视角视频数据集及跨具身基准。 Result: 在多个指标上显著超越SOTA方法,生成视频具有高保真度与真实交互感,并展现出优异的跨具身泛化能力(如迁移至机器人手)。 Conclusion: 稀疏3D关节点作为明确语义与几何结构的控制信号,结合遮挡感知与3D嵌入机制,可有效提升自视角视频生成的3D一致性与泛化性。 Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.

[122] HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification

Marjan Stoimchev,Boshko Koloski,Jurica Levatić,Dragi Kocev,Sašo Džeroski

Main category: cs.CV

TL;DR: HELM是一个用于遥感图像层次多标签分类的新框架,通过引入层次特定的类标记、图卷积网络和自监督分支,有效处理多路径层次结构并利用无标签数据,在多个数据集上达到SOTA性能。

Details Motivation: 现有方法难以处理多路径层次结构,且很少利用无标签数据。 Method: HELM使用Vision Transformer中的层次特定类标记捕捉标签交互,用图卷积网络显式编码层次结构,并集成自监督分支利用无标签影像。 Result: 在UCM、AID、DFC-15和MLRSNet四个遥感图像数据集上,HELM在监督和半监督设置下均超越强基线,尤其在低标签场景中表现突出。 Conclusion: HELM成功解决了HMLC中多路径层次建模和无标签数据利用的关键挑战,显著提升了遥感图像分类性能。 Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.

[123] Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

Alaa Yasser,Kittipat Phunjanna,Marcos Escudero Viñolo,Catarina Barata,Jenny Benois-Pineau

Main category: cs.CV

TL;DR: 本文提出了一种机制性公平性审计方法,通过结合投影残差流分解、零样本概念激活向量和偏差增强的TextSpan分析,在视觉Transformer中定位个体注意力头级别的性别与年龄偏差,并在CLIP ViT-L-14模型上验证了其可行性。

Details Motivation: 标准公平性审计只能量化模型是否存在偏差,而无法定位偏差在模型内部的具体位置;本文旨在实现偏差的细粒度(注意力头级别)定位,以支持更精准的干预。 Method: 融合投影残差流分解(projected residual-stream decomposition)、零样本Concept Activation Vectors(CAVs)和偏差增强的TextSpan分析,构建机制性公平性审计流程,并应用于CLIP ViT-L-14在FACET基准42职业类上的性别与年龄偏差审计。 Result: 成功识别出4个终端层注意力头,消融后显著降低全局性别偏差(Cramer's V从0.381降至0.362),且准确率微升;其中单个最终层头主导对刻板印象最严重类别的修正;而年龄偏差则表现出更弥散的编码模式,消融效果较弱且不一致。 Conclusion: 注意力头级别的偏差定位在判别式视觉编码器中是可行的,但不同受保护属性(如性别 vs. 年龄)的偏差局部化程度存在差异,提示偏差机制具有属性依赖性。 Abstract: Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness

[124] Intrinsic Concept Extraction Based on Compositional Interpretability

Hanyu Shi,Hong Tao,Guoheng Huang,Jianbin Jiang,Xuhang Chen,Chi-Man Pun,Shanhu Wang,Pan Pan

Main category: cs.CV

TL;DR: 本文提出CI-ICE新任务,旨在从单张图像中提取可组合、可解释的内在概念,并设计HyperExpress方法,利用双曲空间建模与概念级优化实现概念解耦与组合重构。

Details Motivation: 现有无监督概念提取方法无法提取可组合的内在概念,限制了概念的可解释性与重建能力。 Method: 提出HyperExpress方法:1)利用双曲空间的层次建模能力实现概念解耦并保持层次与关系结构;2)引入概念级优化,映射概念嵌入空间以维持复杂概念关系并确保可组合性。 Result: 在单图中成功提取出具备组合性与可解释性的内在概念,性能优异。 Conclusion: CI-ICE任务及HyperExpress方法为图像概念理解提供了新范式,提升了概念的可组合性、可解释性与重建能力。 Abstract: Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.

[125] OSM-based Domain Adaptation for Remote Sensing VLMs

Stefan Maria Ailuro,Mario Markov,Mohammad Mahdi,Delyan Boychev,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 本文提出OSMDA框架,通过将航拍图像与OpenStreetMap(OSM)瓦片配对,利用VLM自身的OCR和图表理解能力自动生成富含地理元数据的文本描述,从而实现无需人工标注、无需强教师模型的遥感领域自适应。

Details Motivation: 遥感视觉-语言模型(VLMs)严重依赖高质量图像-文本标注,但此类标注稀缺且昂贵;现有伪标签方法依赖大型教师模型,成本高、可扩展性差且性能受限于教师上限。 Method: 提出OSMDA:以基础VLM为自身标注引擎,将航拍图像与渲染的OSM瓦片配对,利用其OCR与图表理解能力生成带OSM元数据的文本描述,并仅用卫星图像微调得到OSMDA-VLM。 Result: 在10个跨模态遥感基准上全面评估,与9个强基线对比;等量混合真实数据时达到SOTA性能,训练成本显著低于依赖教师模型的方法。 Conclusion: 基于强基础模型与众包地理数据(如OSM)对齐,是遥感领域自适应的一条实用、可扩展路径;代码、数据集与模型权重将开源。 Abstract: Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

[126] CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing

Yue Shi,Rui Shi,Yuxuan Xiong,Bingbing Ni,Wenjun Zhang

Main category: cs.CV

TL;DR: 本文提出CEI-3D,一种面向编辑的3D重建流程,通过协同显式-隐式重建与物理属性解耦,实现更真实、细粒度且高效的3D编辑。

Details Motivation: 现有3D编辑方法因重建网络深度耦合,导致编辑结果不真实、不精细。 Method: 提出协同显式-隐式重建(SDF隐式网络 + 可微采样局部可控handler点),设计物理属性解耦模块(含双扩散-反照率网络)与空间感知编辑模块(基于跨视图传播的3D分割)。 Result: 在真实与合成数据集上均优于SOTA方法,编辑结果更真实、更精细,且耗时更少。 Conclusion: CEI-3D通过显隐协同表征与属性解耦,显著提升了3D编辑的真实性、可控性与效率。 Abstract: Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on https://github.com/shiyue001/CEI-3D.

[127] Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning

Robin Peretzke,Marlin Hanstein,Maximilian Fischer,Lars Badhi Wessel,Obada Alhalabi,Sebastian Regnery,Andreas Kudak,Maximilian Deng,Tanja Eichkorn,Philipp Hoegen Saßmannshausen,Fabian Allmendinger,Jan-Hendrik Bolten,Philipp Schröter,Christine Jungk,Jürgen Peter Debus,Peter Neher,Laila König,Klaus Maier-Hein

Main category: cs.CV

TL;DR: 本文提出RICE-NET模型,结合纵向MRI与放疗剂量图,利用常规T1加权MRI自动区分胶质母细胞瘤治疗后复发与放射性增强,F1达0.92,验证了放疗图的关键作用及模型临床可解释性。

Details Motivation: 胶质母细胞瘤治疗后肿瘤复发与放射性增强的鉴别是临床难题;现有方法依赖稀少的扩散MRI或未纳入日益受重视的放疗剂量图。 Method: 提出RICE-NET——一种融合纵向MRI与放疗剂量分布的多模态3D深度学习模型,仅需常规T1加权MRI数据进行病灶分类,并通过消融实验和基于遮挡的可解释性分析评估各模态贡献与空间关注区域。 Result: 在92例患者队列上,模型在独立测试集上F1分数达0.92;消融实验证实放疗图对可靠分类起主导作用;可解释性分析显示模型聚焦于临床相关区域。 Conclusion: 多模态深度学习(尤其整合放疗剂量图)有望显著提升神经肿瘤学中诊断准确性与临床决策支持能力。 Abstract: The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.

[128] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

Jiahao Li,Qingwang Zhang,Qiuyu Chen,Guozhan Qiu,Yunzhong Lou,Xiangdong Zhou

Main category: cs.CV

TL;DR: FutureCAD 是一个结合大语言模型(LLM)与B-Rep接地变换器(BRepGround)的文本到CAD生成框架,可生成可执行的CadQuery脚本,并通过自然语言实现几何元素选择与定位,显著提升AI驱动CAD建模能力。

Details Motivation: 现有CAD生成方法分为参数化建模与B-Rep合成两类,二者割裂导致难以支持复杂工业产品设计;而实际CAD系统中二者紧密耦合,亟需统一建模范式。 Method: 提出FutureCAD框架:1)用LLM生成CadQuery脚本并以自然语言描述几何选择;2)BRepGround将语言描述接地至B-Rep原始几何体;3)基于真实CAD数据集,先监督微调LLM,再用强化学习优化泛化能力。 Result: 在多项指标上达到SOTA性能,验证了文本驱动高保真、可执行CAD生成的可行性与有效性。 Conclusion: FutureCAD弥合了参数化建模与B-Rep表示之间的范式鸿沟,为AI驱动的工业级CAD设计提供了新路径。 Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.

[129] A Decade of Generative Adversarial Networks for Porous Material Reconstruction

Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani

Main category: cs.CV

TL;DR: 本文综述了2017年至2026年初发表的96篇论文,系统分析了基于生成对抗网络(GAN)的多孔材料数字重建方法的发展与应用,归纳出六类GAN架构,并总结了在孔隙率精度、渗透率预测和重建体积等方面的进展及现存挑战。

Details Motivation: 传统多孔材料重建方法(如微CT和统计重建)存在局限,而深度学习尤其是GAN技术为高精度、大规模重建带来新机遇,亟需系统梳理其发展脉络与适用场景。 Method: 对96篇同行评议论文进行系统性文献综述,按架构将GAN分为六类(Vanilla、多尺度、条件、注意力增强、风格化、混合架构),并定量评估其在孔隙率、渗透率预测和重建尺度等指标上的性能。 Result: 识别出六类GAN架构;孔隙率重建误差控制在1%以内;渗透率预测平均相对误差降低达79%;最大重建体积从64³提升至2200³体素;揭示计算效率、内存限制和2D到3D结构连续性等关键挑战。 Conclusion: GAN在多孔材料重建中展现出显著优势,但需根据具体应用场景(如精度、尺度、计算资源)选择合适架构;未来研究应聚焦于提升效率、扩展规模及保障结构真实性。 Abstract: Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.

[130] ZeroSense:How Vision matters in Long Context Compression

Yonghan Gao,Zehong Chen,Lijian Xu,Jingzhi Chen,Jingwei Guan,Xingyu Zeng

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-文本压缩(VTC)质量评估框架,通过解耦多模态大语言模型(MLLM)的能力,并设计ZeroSense基准来消除语义相关性与上下文依赖,从而更准确地衡量文本保真度。实验表明,现有基于下游任务的评估方式高估了VTC性能,VTC质量与下游准确率存在显著偏差。

Details Motivation: 现有VTC方法(如DeepSeek-OCR)虽在下游任务上表现优异,但其评估严重依赖下游MLLM的强语言先验,无法真实反映文本压缩后的保真度。 Method: 提出解耦MLLM能力的评估框架,并构建ZeroSense基准——该基准采用低语义相关性的测试样本,排除上下文推理干扰,使评估结果仅反映VTC本身的重建质量。 Result: 在多个数据集上的实验显示,VTC重建质量与下游任务准确率显著不一致,验证了传统评估方式的偏差及新框架的必要性。 Conclusion: 下游任务性能不能替代VTC质量评估;本文提出的解耦评估框架和ZeroSense基准为VTC研究提供了更可靠、更本质的评测标准。 Abstract: Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

[131] Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration

Zhaocheng Yu,Xiang Chen,Runzhe Li,Zihan Geng,Guanglu Sun,Haipeng Li,Kui Jiang

Main category: cs.CV

TL;DR: 本文提出Derain-Agent,一种即插即用的动态去雨框架,通过规划网络和强度调制机制,实现对单张图像去雨结果的自适应精细化修复。

Details Motivation: 现有深度学习去雨模型采用静态推理范式,难以适应真实雨天图像中噪声、模糊、色彩偏差等复杂耦合退化,导致复原图像存在残余伪影和感知质量不一致问题。 Method: 提出Derain-Agent框架,包含两个核心模块:1)规划网络,为每个图像实例智能调度最优的修复工具序列;2)强度调制机制,实现工具在空间上的自适应强度应用。 Result: 该方法展现出强泛化能力,在合成与真实世界数据集上均能持续提升当前最优去雨模型的性能。 Conclusion: Derain-Agent成功将单图像去雨从静态处理转向动态、基于智能体的修复范式,显著改善了复原质量与鲁棒性。 Abstract: While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.

[132] Single-View Rolling-Shutter SfM

Sofía Errázuriz Muñoz,Kim Kiehn,Petr Hruby,Kathlén Kohn

Main category: cs.CV

TL;DR: 本文提出了一种针对滚动快门(RS)相机的单视图几何建模方法,系统分析了从单张RS图像中可恢复的运动与场景参数,并推导出若干最小重建问题,通过实验验证了其可行性与实际限制。

Details Motivation: 滚动快门相机广泛存在,但其结构光三维重建(RS SfM)问题尚未完全解决。 Method: 刻画RS相机单视图下世界点/线的几何关系,据此分析可恢复的运动与场景参数,并系统推导最小重建问题。 Result: 提出了多个代表性最小重建问题,并通过概念验证求解器进行了评估,证实了方法的可行性,也揭示了实际应用中的局限性。 Conclusion: 该工作为RS SfM提供了理论基础和可行路径,明确了单图像RS重建的能力边界。 Abstract: Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.

[133] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team,Xiaoyu Zhang,Weihong Pan,Zhichao Ye,Jialin Liu,Yipeng Chen,Nan Wang,Xiaojun Xiang,Weijian Xie,Yifu Wang,Haoyu Ji,Siji Pan,Zhewen Le,Jing Guo,Xianbin Liu,Donghui Shen,Ziqiang Zhao,Haomin Liu,Guofeng Zhang

Main category: cs.CV

TL;DR: InSpatio-WorldFM 是一种开源实时帧模型,通过独立生成每帧并结合显式3D锚点与隐式空间记忆,实现低延迟、多视角一致的空间智能推理。

Details Motivation: 解决视频式世界模型因窗口级处理导致的高延迟问题,满足实时空间推理需求。 Method: 采用帧独立生成范式,引入显式3D锚点和隐式空间记忆保障多视角一致性,并设计三阶段渐进训练流程(从图像扩散模型经可控帧模型到少步蒸馏为实时生成器)。 Result: 在消费级GPU上支持交互式探索,实现强多视角一致性与低延迟实时世界模拟。 Conclusion: InSpatio-WorldFM 为实时世界模拟提供了比传统视频式世界模型更高效的替代方案。 Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

[134] PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qin,Michele Magno

Main category: cs.CV

TL;DR: PicoSAM3是一个轻量级、可提示的视觉分割模型,专为边缘和传感器端实时部署设计,在保持高精度的同时显著降低参数量和延迟。

Details Motivation: 满足低延迟、高隐私要求的应用(如智能眼镜和物联网设备)对实时、端侧分割的需求。 Method: 提出PicoSAM3模型,采用密集CNN架构、ROI提示编码、高效通道注意力机制,并通过知识蒸馏从SAM2/SAM3中学习;支持INT8量化并在索尼IMX500视觉传感器上部署。 Result: 在COCO和LVIS数据集上分别达到65.45%和64.01% mIoU,优于同类边缘模型;INT8量化后在IMX500上实现11.82ms延迟,且内存与算子约束完全合规;消融实验显示知识蒸馏带来最高+14.5% mIoU提升。 Conclusion: 验证了高质量、空间灵活的可提示分割可在传感器端直接实现,为边缘AI视觉任务提供了新范式。 Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

[135] Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments

Pankaj Deoli,Karthik Ranganath,Karsten Berns

Main category: cs.CV

TL;DR: 本文评估了经典和深度学习方法在林区越野应用中的RGB-NIR图像配准性能,发现NeMAR存在GAN损失不稳定问题,MURF在大尺度特征对齐上表现良好但难以处理密集植被的细节,需进一步改进以实现鲁棒的多尺度配准。

Details Motivation: RGB-NIR图像配准在传感器融合、图像增强和越野自主系统中至关重要,尤其在林区越野场景下亟需适配的配准方法。 Method: 评估了经典图像配准方法和基于深度学习的方法(包括NeMAR和MURF),其中NeMAR在6种配置下训练,MURF用于测试林区越野数据上的特征对齐能力。 Result: NeMAR部分成功但GAN损失不稳定,影响几何一致性;MURF在大尺度特征对齐上表现良好,但在密集植被区域的细节配准上表现不佳。 Conclusion: 当前方法尚不能满足林区越野应用中鲁棒、多尺度图像配准的需求,需进一步优化与改进。 Abstract: RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.

[136] AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

Jennifer Nolan,Travis Driver,John Christian

Main category: cs.CV

TL;DR: 本文提出AstroSplat,一种基于物理的高斯点绘框架,融合行星反射模型,提升小天体表面从原位图像中自主重建与光度表征的精度,验证显示其优于传统球谐参数化方法。

Details Motivation: 现有基于高斯点绘的表面重建方法依赖纯外观的球谐强度参数化,未显式建模材料属性和光-面相互作用,难以满足小天体任务对物理一致性和科学表征的需求。 Method: 提出AstroSplat框架,将行星反射模型(如Hapke模型)嵌入高斯点绘表示中,实现物理驱动的辐射度建模与优化;在NASA黎明号任务真实影像上进行端到端训练与验证。 Result: 在黎明号真实数据上,AstroSplat相比标准球谐参数化显著提升了渲染保真度与表面几何重建精度,同时支持光度反演与材质特性估计。 Conclusion: AstroSplat证明了将物理反射模型融入神经辐射场类方法的有效性,为深空探测中小天体的自主感知与科学分析提供了更可靠、可解释的视觉重建新范式。 Abstract: Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.

[137] Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Junhyeong Byeon,Jeongyeol Kim,Sejoon Lim

Main category: cs.CV

TL;DR: 本文提出了一种用于野外视频情感识别的多模态框架,结合冻结的CLIP(视觉)和Wav2Vec 2.0(音频)主干网络、时序卷积网络(TCN)建模面部动态、双向交叉注意力融合模块,以及基于CLIP文本特征的对比学习目标,显著提升了ABAW第10届EXPR挑战赛上的性能。

Details Motivation: 单一模态(如面部或语音)难以应对野外视频中表情、姿态、光照、背景噪声及情感动态性等复杂变化,需多模态协同建模。 Method: 采用冻结的CLIP和Wav2Vec 2.0分别提取视觉与音频特征;用TCN建模固定长度视频窗口内的时序面部变化;引入双向交叉注意力融合视觉与音频特征;添加文本引导的对比损失以对齐视觉语义表征;最后通过轻量分类头预测情绪。 Result: 在ABAW第10届EXPR基准上,该框架优于单模态方法,提供了强多模态基线,验证了时序视觉建模、音频表征学习与跨模态融合的有效性。 Conclusion: 融合时序视觉建模、预训练音频表征与对称交叉注意力融合,并辅以文本引导对比学习,可显著提升野外环境下鲁棒的情感识别性能。 Abstract: Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

[138] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu,Zhongxiang Sun,Zilu Zhang,Xiao Zhang,Jun Xu

Main category: cs.CV

TL;DR: 本文提出了HomeSafe-Bench基准和HD-Guard安全监控架构,用于评估和提升家用场景下视觉语言模型对动态不安全行为的实时检测能力。

Details Motivation: 现有安全评估方法难以覆盖家庭环境中因感知延迟、常识缺失等导致的动态不安全行为,缺乏针对性基准与高效实时检测方案。 Method: 构建了融合物理仿真与视频生成的HomeSafe-Bench基准(含438个案例、六类功能区、多维细粒度标注);提出分层流式HD-Guard架构,包含高频轻量FastBrain与异步大模型SlowBrain协同工作。 Result: HD-Guard在延迟与性能间取得更优权衡;实验揭示了当前VLM在家庭安全检测中的关键瓶颈。 Conclusion: HomeSafe-Bench和HD-Guard为家庭机器人安全提供了新基准与实用架构,推动VLM向真实居家安全应用落地。 Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

[139] Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

Chongyang Xu,Yixian Zou,Ziliang Feng,Fanman Meng,Shuaicheng Liu

Main category: cs.CV

TL;DR: Ada3Drift是一种面向机器人控制的单步生成式视觉运动策略方法,通过在训练阶段引入漂移场(drifting field)实现高保真多模态动作生成,显著降低推理延迟,同时保持动作模式多样性。

Details Motivation: 扩散模型虽能建模多模态动作分布,但推理延迟高;流匹配与一致性模型虽快,却导致模态坍缩、轨迹不物理可行;而机器人场景中训练计算资源充裕、推理需实时,因此应将迭代优化从推理前移到训练阶段。 Method: 提出Ada3Drift:1)设计训练时漂移场,吸引预测动作向专家演示模态靠拢、排斥其他样本;2)采用sigmoid调度损失,从粗粒度分布学习逐步过渡到细粒度模态锐化;3)引入多尺度场聚合,捕捉不同空间粒度的动作模态;输入为3D点云。 Result: 在Adroit、Meta-World、RoboTwin三个仿真基准及真实机器人操作任务上达到SOTA性能,函数评估次数(NFE)仅为扩散模型的1/10(即10×加速)。 Conclusion: 通过将迭代精炼迁移至训练阶段,Ada3Drift在保证单步(1 NFE)高效推理的同时,成功恢复并保留了多模态动作结构,解决了速度与模态保真度之间的权衡难题。 Abstract: Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.

[140] CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation

Ziqi Ye,Ziyang Gong,Ning Liao,Xiaoxing Hu,Di Wang,Hongruixuan Chen,Chen Huang,Yiguo He,Yuru Jia,Xiaoxing Wang,Haipeng Wang,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了CrossEarth-SAR,首个面向合成孔径雷达(SAR)图像跨域语义分割的十亿级视觉基础模型,采用物理引导的稀疏混合专家(MoE)架构,并构建了大规模数据集CrossEarth-SAR-200K与首个统一SAR领域泛化基准套件(22个子基准)。

Details Motivation: SAR图像因成像机制多样、传感器与地域差异导致严重域偏移,限制其语义泛化能力。 Method: 提出物理引导的稀疏MoE架构;构建弱监督与全监督融合的大规模数据集CrossEarth-SAR-200K;设计覆盖8类域差距的22子基准统一评估套件。 Result: 在22个基准中的20个上达到SOTA性能,多域迁移下部分基准mIoU提升超10%。 Conclusion: CrossEarth-SAR显著提升了SAR图像跨域语义分割的泛化能力,为SAR视觉基础模型研究奠定新基础。 Abstract: Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.

[141] Pano360: Perspective to Panoramic Vision with Geometric Consistency

Zhengdong Zhu,Weiyi Xue,Zuyuan Yang,Wenlve Zhou,Zhiheng Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于3D光摄影测量空间和Transformer架构的全景拼接新方法,通过利用相机位姿引导3D空间中的全局对齐,并结合多特征联合优化计算接缝,显著提升了弱纹理、大视差和重复图案等挑战性场景下的拼接精度与视觉质量。

Details Motivation: 现有全景拼接方法严重依赖两两图像间的2D特征匹配,难以保证多视角间的几何一致性,导致在弱纹理、大视差和重复纹理等复杂场景中出现严重畸变和错位。 Method: 将拼接任务扩展至3D光摄影测量空间,采用新型Transformer架构实现3D感知与全局信息聚合;利用相机位姿指导图像在3D空间中的扭曲与全局对齐,并引入多特征联合优化策略计算最优接缝。 Result: 在自建的大规模真实场景数据集上实验表明,该方法在对齐精度和感知质量上均显著优于现有方法。 Conclusion: 在3D空间中进行多视角几何建模与对齐是提升全景拼接鲁棒性与精度的有效途径,所提Transformer架构与多特征联合优化策略具有良好的泛化性和实用性。 Abstract: Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.

[142] Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

Nicholas Schaub,Andriy Kharchenko,Hamdah Abbasi,Sameeul Samee,Hythem Sidky,Nathan Hotaling

Main category: cs.CV

TL;DR: Nyxus是一个为大规模2D/3D图像数据设计的可扩展、支持内存外计算的特征提取库,覆盖多生物医学领域,提供多种接口(Python、CLI、Napari插件、OCI容器),并支持程序化调优以适配机器学习应用。

Details Motivation: 现代成像设备产生海量图像数据,传统分析算法在效率、鲁棒性和准确性上难以兼顾;深度学习提升了分割精度,但领域专用特征提取库繁多且缺乏统一性能评估标准。 Method: 从零设计Nyuxs特征提取库,支持CPU/GPU可扩展计算与内存外处理;通过严格基准测试验证;提供Python包、命令行工具、Napari插件和OCI容器等多种部署形式;支持程序化调优特征集以平衡计算效率与覆盖度。 Result: Nyxus实现了对放射组学和细胞分析等多领域的全覆盖,具备高可扩展性与跨平台兼容性,并已集成至可视化、低代码及超算/云工作流中。 Conclusion: Nyxus解决了大规模图像特征提取中的可扩展性、互操作性与易用性瓶颈,为科学图像分析提供了统一、高效、可定制的新范式。 Abstract: Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.

[143] Single Pixel Image Classification using an Ultrafast Digital Light Projector

Aisha Kanwal,Graeme E. Johnstone,Fahimeh Dehkhoda,Johannes H. Herrnsdorf,Robert K. Henderson,Martin D. Dawson,Xavier Porte,Michael J. Strain

Main category: cs.CV

TL;DR: 本文提出了一种结合单像素成像(SPI)与低复杂度机器学习模型(如ELM和浅层DNN)的超高速图像分类方法,实现了多kHz帧率下的实时MNIST数字分类,并跳过传统图像重建步骤,适用于自动驾驶和异常检测等场景。

Details Motivation: 为满足自动驾驶等应用对复杂动态环境信息实时分类的需求,需突破传统图像处理在速度与计算开销上的瓶颈。 Method: 采用microLED-on-CMOS数字光投影仪实现超快单像素成像(SPI),结合极简结构的极限学习机(ELM)和反向传播训练的浅层深度神经网络,在不重建图像的前提下直接进行时空域特征分类。 Result: 在MNIST基准任务上实现了多kHz帧率的实时分类;ELM在保持低计算开销的同时表现出良好准确率,并验证了其作为二分类器用于超快异常检测的有效性。 Conclusion: 基于SPI与低复杂度模型的端到端分类范式可显著提升图像识别速度与能效,尤其适用于资源受限或需超快响应的机器视觉任务。 Abstract: Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.

[144] Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Chiyuan He,Zihuan Qiu,Fanman Meng,Runtong Zhang,Linfeng Xu,Qingbo Wu,Hongliang Li

Main category: cs.CV

TL;DR: 本文提出SeGP-CL方法,在无样本持续学习中通过对抗锚点探测语义漂移区域,并利用锚点引导的跨模态几何蒸馏与文本语义几何正则化,有效缓解预训练视觉语言模型在新任务微调中的灾难性遗忘与语义几何畸变。

Details Motivation: 现有持续学习方法在适配新任务时未显式保持预训练及历史阶段继承的跨模态语义几何结构,导致新任务监督引发几何畸变,尤其在新旧语义交界处出现显著漂移。 Method: 提出Semantic Geometry Preservation for Continual Learning(SeGP-CL):1)用双目标投影梯度下降(DPGD)构建紧凑对抗锚点集以探测易漂移区域;2)锚点引导的跨模态几何蒸馏(ACGD)维持跨模态结构;3)轻量级文本语义几何正则化(TSGR)稳定文本参考系;4)训练后基于锚点估计原始空间漂移,迁移旧视觉原型并进行双路径推理融合。 Result: 在五个持续学习基准上广泛实验表明,SeGP-CL持续提升稳定性与前向迁移能力,达到当前最优性能,并更好保持VLM的语义几何结构。 Conclusion: 显式建模和保护跨模态语义几何结构是缓解VLM持续学习中灾难性遗忘的关键,SeGP-CL为无样本设定下VLM持续学习提供了有效且可扩展的几何一致性范式。 Abstract: Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.

[145] Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Yanghao Wang,Ziqi Jiang,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的粗到细视觉生成方法,利用h-transform约束扩散采样过程,在未知前向变换的情况下实现高质量、高保真度的生成。

Details Motivation: 现有基于训练的方法成本高、泛化差;而训练-free方法要么依赖已知前向变换(如双三次下采样),要么难以平衡引导强度与生成质量。 Method: 提出基于h-transform的引导机制,在扩散采样的每一步修改转移概率,通过添加漂移项引导生成方向,并引入噪声水平感知的权重衰减策略来缓解近似误差。 Result: 在多种图像和视频生成任务上验证了方法的有效性和强泛化能力,无需配对数据或预设前向算子。 Conclusion: h-transform提供了一种灵活、鲁棒且无需训练的粗到细生成新范式,兼顾引导精度与合成质量。 Abstract: Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

[146] NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction

David Svitov,Mahtab Dahaghin

Main category: cs.CV

TL;DR: NBAvatar是一种结合定向平面基元与神经渲染的新方法,用于高质量、真实感地渲染受手脸交互影响的头部虚拟人,显著提升新视角与新姿态渲染质量。

Details Motivation: 现有方法难以同时处理手脸交互引起的非刚性形变与保持高保真外观细节,尤其在新视角和新姿态渲染中表现不足。 Method: 提出NBAvatar方法,融合显式的定向平面基元建模(保障时序与姿态一致的几何结构)与隐式的神经渲染(捕获精细外观及手脸交互导致的颜色变化)。 Result: 在高分辨率百万像素渲染下,相比基于高斯的虚拟人方法LPIPS降低最多30%,PSNR和SSIM提升;相比InteractAvatar,在结构相似性上更优。 Conclusion: NBAvatar通过显隐结合表征有效建模复杂手脸交互下的动态头部几何与外观,为实时高质量头像渲染提供了新范式。 Abstract: We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.

[147] Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Shuo Sun,Unal Artan,Malcolm Mielle,Achim J. Lilienthaland,Martin Magnusson

Main category: cs.CV

TL;DR: 本文提出了一种用于多自由移动相机下稠密动态场景重建与相机位姿估计的两阶段优化框架,通过构建时空连接图和宽基线初始化策略提升鲁棒性,并在新提出的MultiCamRobolab真实数据集上验证了其优越性能。

Details Motivation: 现有方法仅支持单相机输入或需刚性校准的相机阵列,难以应对多自由移动相机拍摄共享事件的实际场景。 Method: 采用两阶段优化框架:第一阶段扩展单相机视觉SLAM,构建利用时序连续性和跨相机空间重叠的时空连接图,并引入基于前馈重建模型的宽基线初始化;第二阶段利用宽基线光流优化稠密跨相机与单相机一致性以联合 refine 深度和位姿。 Result: 在合成与真实世界基准(含新发布的MultiCamRobolab数据集)上显著超越现有前馈模型,且内存占用更低。 Conclusion: 所提方法有效解决了多自由移动相机下的动态场景重建与位姿估计难题,在鲁棒性、精度和效率上均取得实质性提升。 Abstract: We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

[148] Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing

Simone Cammarasana

Main category: cs.CV

TL;DR: 本文提出了一种系统性的分类法,将替代或扩展标准卷积的算子分为五类,并对每类的定义、性质及适用任务进行了分析与比较。

Details Motivation: 标准卷积作为CNN的核心组件,虽具简洁性与平移等变性,但其固定、线性、局部平均的结构限制了对低秩分解、自适应基表示和非均匀空间依赖等结构化信号特性的建模能力。 Method: 构建了一个涵盖五大家族的算子分类体系:基于分解的算子、自适应加权算子、基自适应算子、积分与核算子、注意力机制算子;并对每类进行形式化定义、结构性对比和任务适用性分析,最后从线性、局部性、等变性、计算开销及任务类型等维度开展横向比较。 Result: 提出了首个系统、全面的卷积替代算子分类框架,明确了各类算子的理论特性与实际适用场景,并指出了当前研究的开放挑战与未来方向。 Conclusion: 扩展或替代卷积的算子具有多样性和互补性,需根据具体任务需求(如是否需等变性、是否强调全局建模)选择合适算子家族;统一分类有助于推动图像处理中算子设计的理论化与工程化发展。 Abstract: The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.

[149] Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments

Zhaoyang Jiang,Zhizhong Fu,David McAllister,Yunsoo Kim,Honghan Wu

Main category: cs.CV

TL;DR: LoV3D是一种用于纵向脑MRI分析的3D视觉-语言模型训练流程,能进行区域级解剖评估、纵向对比,并输出三类诊断及诊断摘要,通过临床加权验证器实现无标注偏好优化,在多个数据集上显著提升诊断准确率与泛化能力。

Details Motivation: 现有深度学习工具在纵向脑MRI分析中存在碎片化问题:分类器仅输出标签,体积分割缺乏解释性,视觉-语言模型易产生幻觉;亟需一种兼具可解释性、纵向一致性与生物合理性的端到端诊断框架。 Method: 提出LoV3D流水线:输入纵向T1加权MRI,依次完成区域级解剖评估、与前序扫描的纵向比较,最终输出三类诊断(正常/轻度认知障碍/痴呆)及诊断摘要;引入临床加权Verifier,基于标准化体积指标自动评分候选输出,驱动无监督的Direct Preference Optimization。 Result: 在ADNI测试集上三类诊断准确率达93.7%(较无约束基线+34.8%),二类诊断97.2%(较SOTA+4%),区域解剖分类82.6%(较VLM基线+33.1%);零样本迁移至MIRIAD和AIBL数据集分别达95.4%和82.9%准确率,且MIRIAD上痴呆召回率达100%。 Conclusion: LoV3D通过分步结构化推理与临床先验引导的无标注优化,显著提升了纵向神经影像诊断的准确性、可靠性与跨中心泛化能力,为可信AI辅助诊断提供了新范式。 Abstract: Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.

[150] Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs

Hiran Sarkar,Liming Kuang,Yordanka Velikova,Benjamin Busam

Main category: cs.CV

TL;DR: Node-RF 结合神经ODE与动态NeRF,实现连续时空建模,支持长时序外推且内存开销恒定。

Details Motivation: 现有方法仅能在观测边界内建模场景动态,难以外推到训练序列之外;需一种能泛化至未见运动模式的连续时空表示方法。 Method: 将神经ODE(NODE)嵌入动态NeRF框架,通过ODE求解器隐式演化场景状态特征;利用NeRF渲染器将演化后的特征映射为任意视角图像,支持长程外推;多序列联合训练以共享动力学先验。 Result: Node-RF在多个运动序列上训练后,可泛化至未见运动条件,准确外推远超训练长度的场景动态,并能识别系统关键点以支撑未来预测。 Conclusion: Node-RF提供了一种内存高效、具泛化能力的连续时空建模范式,无需显式物理模型即可刻画抽象系统行为,显著提升视觉动态预测的外推能力。 Abstract: Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.

[151] Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

Xiaolong Qian,Qi Jiang,Yao Gao,Lei Sun,Zhonghua Yi,Kailun Yang,Luc Van Gool,Kaiwei Wang

Main category: cs.CV

TL;DR: 本文提出UniCAC基准和ODE评估框架,系统评估24种CAC算法,揭示影响性能的三大关键因素:先验利用、网络架构和训练策略。

Details Motivation: 现有计算像差校正(CAC)方法泛化能力差、针对特定光学系统需重复训练,且缺乏涵盖广泛光学像差的综合基准,难以评估和提升跨镜头通用性。 Method: 构建大规模摄影相机基准UniCAC(基于自动光学设计),提出光学退化评估器(ODE)量化像差难度,并对24种图像恢复与CAC算法进行系统实验与对比分析。 Result: 识别出影响CAC性能的三个最关键因素(先验利用、网络架构、训练策略),并通过实验验证其各自影响;提供了可公开访问的基准、代码和Zemax文件。 Conclusion: UniCAC基准与ODE框架为CAC研究提供了可靠评估基础,所揭示的关键影响因素为未来通用CAC方法的设计与优化提供了重要指导。 Abstract: Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.

[152] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Yan Li,Ning Liao,Xiangyu Zhao,Shaofeng Zhang,Xiaoxing Wang,Yifan Yang,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出EvoTok,一种通过残差向量量化在共享潜在空间中实现视觉理解与生成统一的图像分词器,解决了二者在表征粒度上的根本矛盾。

Details Motivation: 现有方法难以协调视觉理解(需高层语义)与图像生成(需像素级细节)之间的粒度差异,导致表征干扰或不一致。 Method: 提出EvoTok:采用残差向量量化将图像编码为级联残差token序列,在共享潜在空间中构建从低层细节到高层语义的演化轨迹。 Result: 仅用1300万图像训练即在ImageNet-1K上达0.43 rFID重建质量;集成大语言模型后在9项视觉理解基准中7项表现优异,并在GenEval和GenAI-Bench等生成基准上取得显著效果。 Conclusion: 将视觉表征建模为演化轨迹,是统一视觉理解与生成的一种有效且原理清晰的解决方案。 Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.

[153] Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

Agniv Sharma,Xianghui Xie,Tom Fischer,Eddy Ilg,Gerard Pons-Moll

Main category: cs.CV

TL;DR: 本文提出Hoi3DGen框架,通过多模态大语言模型构建高质量人-物交互数据,并设计端到端文本到3D生成流程,显著提升文本一致性与3D模型质量。

Details Motivation: 现有文本生成3D人-物交互方法受限于高质量交互数据稀缺,导致生成结果存在Janus问题且文本遵循度低。 Method: 首先利用多模态大语言模型构建真实、高质量的人-物交互数据集;然后构建完整的文本到3D生成管线。 Result: 在文本一致性上超越基线4–15倍,在3D模型质量上提升3–7倍,具备跨类别和交互类型的强泛化能力。 Conclusion: Hoi3DGen能精准依据文本描述生成高质量带纹理的人-物交互3D网格,有效解决现有方法的 fidelity 和 consistency 瓶颈。 Abstract: Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.

[154] HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Rui Shao,Ruize Gao,Bin Xie,Yixing Li,Kaiwen Zhou,Shuai Wang,Weili Guan,Gongwei Chen

Main category: cs.CV

TL;DR: 本文提出HATS框架,通过硬度感知的轨迹合成解决GUI代理训练中语义模糊动作导致的泛化能力不足问题,提升代理在真实场景中的鲁棒性。

Details Motivation: 现有GUI代理轨迹合成方法忽视语义模糊动作(如上下文依赖、时序依赖或视觉模糊的动作),导致训练数据语义失准、代理泛化能力差。 Method: 提出HATS框架,包含两个闭环模块:(1) 硬度驱动探索——聚焦采集高语义模糊但信息丰富的交互轨迹;(2) 对齐引导精炼——迭代验证并修复指令与执行之间的语义对齐。 Result: 在多个GUI基准环境中,HATS训练的代理持续优于当前最优基线方法。 Conclusion: 语义模糊性是影响GUI代理泛化能力的关键因素,HATS通过显式建模和处理动作硬度,有效缓解语义失准问题,显著提升代理鲁棒性与性能。 Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

[155] O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

Mengfei Duan,Hao Shi,Fei Teng,Guoqiang Zhao,Yuheng Zhang,Zhiyong Li,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出了O3N,首个纯视觉、端到端的全向开放词汇占用预测框架,通过极螺旋Mamba(PsM)、占用代价聚合(OCA)和自然模态对齐(NMA)模块,实现360°连续空间建模、几何语义一致性重建与像素-体素-文本三元统一表征,在多个基准上达到SOTA并具备强跨场景泛化与语义扩展能力。

Details Motivation: 现有3D占用预测方法受限于窄视角输入和预定义训练分布,难以满足具身智能体在开放世界探索中对全面、安全场景感知的需求。 Method: 提出O3N框架,包含:1)Polar-spiral Mamba(PsM)模块,以极螺旋拓扑嵌入全向体素,支持连续空间表示与长程上下文建模;2)Occupancy Cost Aggregation(OCA)模块,统一几何与语义监督;3)Natural Modality Alignment(NMA)模块,实现无梯度的视觉-体素-文本特征对齐。 Result: 在QuadOcc和Human360Occ基准上达到SOTA性能,并展现出优异的跨场景泛化能力和语义可扩展性。 Conclusion: O3N为通用3D世界建模提供了新范式,推动了具身智能与自主代理在开放环境中三维理解与重建的发展。 Abstract: Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.

[156] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Quanhao Li,Zhen Xing,Rui Wang,Haidong Cao,Qi Dai,Daoguo Dong,Zuxuan Wu

Main category: cs.CV

TL;DR: 本文提出FlashMotion框架,用于少步长轨迹可控视频生成,通过训练轨迹适配器、蒸馏多步生成器为少步版本,并结合扩散与对抗目标微调适配器,显著提升生成速度、视频质量与轨迹精度;同时构建新基准FlashBench进行综合评估。

Details Motivation: 现有轨迹可控视频生成方法依赖多步去噪过程,计算开销大;而直接将视频蒸馏方法应用于该任务会导致视频质量与轨迹精度明显下降。 Method: 提出FlashMotion训练框架:1)在多步视频生成器上训练轨迹适配器实现精确控制;2)将生成器蒸馏为少步版本以加速;3)采用融合扩散与对抗目标的混合策略微调适配器,使其适配少步生成器。同时构建FlashBench基准评测长序列轨迹可控视频生成性能。 Result: 在两种适配器架构上的实验表明,FlashMotion在视频视觉质量和轨迹一致性上均优于现有视频蒸馏方法及多步模型。 Conclusion: FlashMotion有效解决了少步长下轨迹可控视频生成中质量与精度难以兼顾的问题,实现了高效、高质、高精度的生成。 Abstract: Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.

[157] EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan,Chi Kit Wong,Yuanhuiyi Lyu,Hanqian Li,Jiahao Huo,Jiacheng Chen,Lutao Jiang,Xu Zheng,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出了EgoIntent基准,用于评估多模态大语言模型(MLLMs)在第一人称视频中细粒度步骤级意图理解能力,涵盖局部意图(What)、全局意图(Why)和下一步计划(Next)三个维度,实验表明当前模型在此任务上表现仍很有限。

Details Motivation: 现有基准仅关注片段级意图推理,缺乏对更精细的步骤级意图理解(如每一步的‘做什么’、‘为什么做’及‘接下来做什么’)的评估,而智能助手、机器人模仿学习和增强现实指导等应用亟需此类能力。 Method: 构建了EgoIntent基准:包含3014个步骤、覆盖15种室内外日常场景;每个视频片段在关键动作发生前即截断,避免未来帧泄露,确保评估的是真正的前瞻性步骤理解与规划能力;在15个主流MLLM上进行三维度评测。 Result: 所有被测模型表现欠佳,最佳模型在三个意图维度上的平均得分仅为33.31。 Conclusion: 步骤级意图理解在第一人称视频中仍是极具挑战性的开放问题,亟需进一步研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.

[158] GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan,Jiarui Jin,Yue Ma,Shijian Wang,Jiahui Hu,Wenxiang Jiao,Yuan Lu,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出GlyphBanana方法,通过无需训练的智能体工作流,在潜在空间和注意力图中注入字形模板,提升文本与数学公式渲染精度,并构建专用基准进行评估。

Details Motivation: 当前生成模型在处理分布外提示时指令遵循能力有限,导致复杂文本和数学公式渲染不准确。 Method: 提出GlyphBanana,采用无需训练的智能体工作流,结合辅助工具将字形模板注入潜在空间和注意力图,实现图像迭代优化。 Result: 该方法可无缝适配多种文本到图像模型,在复杂字符与公式渲染任务上显著优于现有基线。 Conclusion: GlyphBanana是一种通用、高效且无需训练的文本渲染增强方法,配合专用基准推动了该领域发展。 Abstract: Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

[159] LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

Haiying Xu,Zihan Wang,Song Dai,Zhengxuan Zhang,Kairan Dou,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出LatentGeo框架,通过学习连续潜在视觉表示来内化辅助几何构造,避免像素级渲染和外部执行器,并设计三阶段课程学习与Latent-aware强化学习(LaGDPO)提升几何推理性能。

Details Motivation: 现有方法在表示辅助几何构造时存在空间关系建模不准确、符号与几何结构表征不匹配、依赖外部工具阻碍端到端优化等问题。 Method: 提出LatentGeo框架:1)学习连续潜在视觉表示以隐式编码辅助构造;2)三阶段课程学习(含辅助视觉监督)对齐并内化表征;3)引入LaGDPO潜变量感知强化学习稳定训练并提升任务正确率。 Result: 在新基准GeoAux及MathVerse上显著提升需辅助构造的几何推理任务性能;消融实验验证各组件有效性。 Conclusion: LatentGeo有效解决了MLLM中辅助几何构造的表征难题,实现了无需外部工具、端到端可优化的几何推理能力提升。 Abstract: Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.

[160] BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Jingyang Ke,Weihan Li,Amartya Pradhan,Jeffrey Markowitz,Anqi Wu

Main category: cs.CV

TL;DR: 本文提出BehaviorVLM,一种无需任务特定微调、仅需极少人工标注的统一视觉-语言框架,用于自由移动动物的姿态估计与行为理解。其通过引导预训练视觉-语言模型进行显式、可验证的多步推理,结合量子点标注数据与几何校验提升姿态估计鲁棒性,并融合嵌入聚类、视频字幕与大语言模型推理实现端到端行为发现与语义标注。

Details Motivation: 现有动物姿态估计与行为理解方法严重依赖人工标注或不稳定的无监督流程,制约了可扩展性与可复现性。 Method: 提出BehaviorVLM框架:姿态估计部分采用基于量子点数据的多阶段 pipeline,融合时序、空间与跨视角推理,并通过重投影误差等几何检验暴露低置信度标签;行为理解部分结合深度嵌入聚类发现过分割行为片段、VLM逐片段视频描述、LLM语义融合与标注,全程无需关键点输入。 Result: 显著降低人工标注需求;生成可验证、可过滤、可修正的姿态标签;实现无需关键点的端到端行为发现与语义标注;支持多动物、可解释、轻标注的大规模行为分析。 Conclusion: BehaviorVLM为神经科学研究中自由行为分析提供了高可扩展性、强可解释性与低标注依赖的新范式,推动从神经活动到自然行为的可靠建模。 Abstract: Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.

[161] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出ForensicZip,一种无需训练的视觉令牌压缩框架,从伪造驱动的角度出发,通过建模时间令牌演化为出生-死亡最优传输问题,并结合高频先验,有效保留伪造痕迹,在大幅降低计算开销的同时保持高检测性能。

Details Motivation: 现有视觉令牌剪枝方法多基于语义驱动,易丢弃包含伪造痕迹(如高频异常、时序抖动)的背景区域,难以兼顾加速与 forensic 有效性。 Method: 提出ForensicZip框架:将时间维度令牌演化建模为带松弛虚拟节点的Birth-Death Optimal Transport问题,量化物理不连续性;融合传输新颖性得分与高频先验,实现伪造证据与语义内容的分离。 Result: 在deepfake和AIGC基准上,仅保留10%令牌时,实现2.97倍加速与超90% FLOPs下降,同时维持SOTA检测性能。 Conclusion: ForensicZip验证了伪造驱动的令牌压缩范式优于传统语义驱动方法,为高效、可解释的多媒体取证提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

[162] RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Yaoqi Sun,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出RDNet网络,通过引入SwinTransformer替代CNN主干,并设计三个关键模块(DAD、FCE、RPL)来提升遥感图像显著目标检测在多尺度鲁棒性和精确定位方面的性能。

Details Motivation: 遥感图像显著目标检测面临目标尺度变化大、自注意力计算开销高、CNN难以建模全局上下文和长程依赖等挑战,现有固定卷积核方法难以适应多样尺度,导致细节丢失或无关特征聚合。 Method: 提出Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network(RDNet),以SwinTransformer替代CNN主干;设计动态自适应细节感知(DAD)模块(依据区域比例调节卷积核)、频域匹配上下文增强(FCE)模块(结合小波变换与注意力)和区域比例感知定位(RPL)模块(含交叉注意力与比例引导PG块)。 Result: RDNet在多尺度鲁棒性和精确定位上表现优异,检测性能超越当前最先进方法。 Conclusion: RDNet有效缓解了遥感图像中显著目标尺度差异大和全局建模能力弱的问题,验证了动态卷积、频域上下文建模与比例感知定位协同设计的有效性。 Abstract: Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.

[163] Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Görkay Aydemir,Fatma Güney,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出Verifier元模型,用于评估跟踪器预测的可靠性并指导伪标签生成,从而提升真实世界视频中长期点跟踪模型的微调效果。

Details Motivation: 现有长期点跟踪模型在合成数据上训练,但在真实视频中性能下降,且缺乏密集真值标注;自训练虽被探索,但伪标签质量依赖教师模型的可靠性,而该可靠性在不同帧和场景中变化较大。 Method: 提出Verifier元模型,接收多个预训练跟踪器产生的候选轨迹,逐帧评估其可靠性,并选择最可信的预测生成高质量伪标签轨迹,用于后续模型微调。 Result: 在四个真实世界基准上实验表明,该方法在更少数据下达到当前最优性能。 Conclusion: Verifier能有效提升伪标签质量,实现数据高效的真实世界适应,为长期点跟踪的域迁移提供新思路。 Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r

[164] A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Jiajun Sun,Zhe Gao

Main category: cs.CV

TL;DR: 本文提出了一种两阶段音视频双模态模型,用于解决ABAW竞赛中野外环境下的面部表情识别(EXPR)任务,通过DINOv2视觉编码器、PadAug增强、MoE分类头、多尺度人脸重裁、Wav2Vec 2.0音频特征及门控融合等技术,显著提升了帧级表情分类性能。

Details Motivation: 野外视频中存在人脸定位不准、姿态与尺度变化大、运动模糊、时序不稳定等问题,导致表情识别困难。 Method: 两阶段双模态方法:第一阶段用DINOv2 ViT-L/14作为视觉骨干,引入PadAug数据增强和MoE训练头;第二阶段进行多尺度人脸重裁与视觉特征平均,并融合Wav2Vec 2.0音频特征,采用轻量级门控融合模块与时序平滑。 Result: 在ABAW官方验证集上Macro-F1达0.5368,5折交叉验证结果为0.5122±0.0277,优于官方基线。 Conclusion: 所提两阶段音视频融合框架能有效缓解野外场景下表情识别的多种挑战,提升鲁棒性与时序一致性。 Abstract: This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

[165] HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Andy Li,Aiden Durrant,Milan Markovic,Georgios Leontidis

Main category: cs.CV

TL;DR: 本文提出了一种名为Hierarchical Auto-Pruning (HiAP) 的端到端连续松弛剪枝框架,用于高效压缩视觉Transformer,支持多粒度(宏/微)结构剪枝,无需人工设定稀疏度或分阶段流程,在ImageNet上实现了精度与效率的帕累托最优。

Details Motivation: Vision Transformers在边缘设备部署受限于高计算资源和内存带宽需求;现有结构化剪枝方法多为单粒度、多阶段、依赖启发式阈值,难以兼顾效率与易用性。 Method: 提出HiAP框架:引入多粒度随机Gumbel-Sigmoid门控机制(宏观门剪注意力头/FFN块,微观门剪头内维度/FFN神经元),联合优化;采用含结构可行性惩罚与解析FLOPs建模的损失函数,实现单阶段端到端训练。 Result: 在ImageNet上验证HiAP能自动发现高效子网络,对DeiT-Small等模型达到与复杂多阶段方法相当的精度-效率权衡,同时大幅简化部署流程。 Conclusion: HiAP通过统一、可微、多粒度的剪枝机制,有效缓解视觉Transformer的内存与计算瓶颈,为边缘部署提供了简洁高效的自动化压缩方案。 Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

[166] SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo,Jiaxiang Tang,Ruijie Lu,Gang Zeng

Main category: cs.CV

TL;DR: 本文提出SceneAssistant,一种基于视觉反馈的智能体,用于开放词汇的文本到3D场景生成,结合VLM的空间推理与3D生成模型,通过原子操作迭代优化场景布局。

Details Motivation: 现有文本到3D场景生成方法受限于特定领域或预定义空间关系,难以支持开放词汇、无约束的3D场景合成。 Method: 提出SceneAssistant框架,利用视觉语言模型(VLM)进行空间推理与规划,并集成现代3D对象生成模型;引入一组原子操作(如Scale、Rotate、FocusOn),VLM基于渲染的视觉反馈迭代执行动作以优化场景。 Result: 实验表明该方法能生成多样、高质量、开放词汇的3D场景;定性与定量人工评估均优于现有方法;并支持自然语言驱动的已有场景编辑。 Conclusion: SceneAssistant实现了更灵活、可控、符合语义的开放词汇3D场景生成与编辑,显著提升了文本到3D场景合成的能力与实用性。 Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

[167] BiGain: Unified Token Compression for Joint Generation and Classification

Jiacheng Liu,Shengkun Tang,Jiacheng Cui,Dongkuan Xu,Zhiqiang Shen

Main category: cs.CV

TL;DR: 本文提出BiGain框架,通过频域分离思想设计两种频率感知的token压缩算子(Laplacian-gated token merging 和 Interpolate-Extrapolate KV Downsampling),在不牺牲生成质量的前提下显著提升加速扩散模型的分类性能。

Details Motivation: 现有扩散模型加速方法(如token合并或下采样)多关注生成质量与计算效率的平衡,却忽视其判别能力(如分类精度);本文旨在联合优化生成质量与判别能力。 Method: 提出无训练、即插即用的BiGain框架,基于频域分离思想:1)Laplacian-gated token merging——依据谱平滑性控制token合并,保留边缘与纹理;2)Interpolate-Extrapolate KV Downsampling——对KV进行可控的近邻/均值混合下采样,保持Query不变以维持注意力精度。 Result: 在多个骨干网络(DiT/U-Net)和数据集(ImageNet-1K等)上,BiGain在同等加速下提升分类准确率(如ImageNet-1K上+7.15%)并改善FID(+0.34);分析表明均衡保留高低频信息是有效压缩的关键准则。 Conclusion: BiGain首次统一提升加速扩散模型的生成与判别能力,为低成本部署提供新范式;频域感知的token压缩是一种可靠且可推广的设计原则。 Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

[168] One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Dogyun Park,Anil Kag,Michael Vasilkovsky,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: 本文提出Elastic Latent Interface Transformer(ELIT),一种轻量级、即插即用的机制,用于解耦输入图像尺寸与计算量,在保持DiT架构不变的前提下,通过可变长潜在接口和重要性排序的跨注意力机制,实现动态计算-质量权衡。

Details Motivation: 现有扩散Transformer(DiTs)存在两个关键问题:计算量(FLOPs)与图像分辨率强耦合,难以进行有原则的延迟-质量权衡;且对所有空间token均匀分配计算资源,造成重要区域识别不足和资源浪费。 Method: 引入一个可学习、可变长度的潜在接口(latent interface)作为中间表征;设计轻量级Read/Write跨注意力层在空间token与潜在token间传递信息,并隐式学习区域重要性;通过随机丢弃尾部潜在token进行训练,使模型学会按重要性排序表征(前部捕获全局结构,后部细化细节);推理时可动态调整潜在token数量以适配计算约束。 Result: 在ImageNet-1K 512px上,ELIT平均提升FID 35.3%、FDD 39.6%;在多个数据集及DiT变体(U-ViT、HDiT、MM-DiT)上均取得一致性能增益;仅增加两个跨注意力层,不改变原有DiT结构与rectified flow目标。 Conclusion: ELIT是一种简洁、通用、即插即用的改进方案,有效解耦分辨率与计算量,实现细粒度计算控制与重要性感知的资源分配,在不牺牲DiT高质量生成能力的同时显著提升效率-质量平衡能力。 Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/

[169] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Xiangyu Zhao,Peiyuan Zhang,Junming Lin,Tianhao Liang,Yuchen Duan,Shengyuan Ding,Changyao Tian,Yuhang Zang,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出FIRM框架,通过高质量数据集构建、专用奖励模型训练及创新的'基础+奖励'奖励策略,显著提升图像编辑与文本到图像生成中强化学习的忠实度和指令遵循能力。

Details Motivation: 现有奖励模型存在幻觉和评分噪声问题,导致强化学习优化过程被误导。 Method: 设计专门的数据整理流程构建高质量评分数据集(FIRM-Edit-370K和FIRM-Gen-293K),训练专用8B参数奖励模型,并提出'Base-and-Bonus'奖励策略(CME用于编辑,QMA用于生成),同时构建FIRM-Bench基准进行评估。 Result: FIRM-Edit-8B和FIRM-Gen-8B模型在人类判断对齐性上优于现有指标;集成后模型FIRM-Qwen-Edit和FIRM-SD3.5在保真度和指令遵循方面取得显著突破。 Conclusion: FIRM有效缓解了奖励模型幻觉问题,为图像生成与编辑任务中的忠实性建模树立了新标准,并开源全部数据、模型与代码。 Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

[170] DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Jing He,Zixin Zhang,Haodong Li,Yihao Liang,Kanghao Chen,Bin Ren,Xu Zheng,Shuai Yang,Kun Zhou,Yinchuan Li,Nicu Sebe,Ying-Cong Chen

Main category: cs.CV

TL;DR: DVD是一种新型视频深度估计框架,通过确定性地适配预训练视频扩散模型,解决了生成式模型几何幻觉与判别式模型数据依赖的固有矛盾。

Details Motivation: 现有视频深度估计方法存在生成模型几何幻觉与尺度漂移、判别模型依赖大量标注数据的根本权衡问题。 Method: DVD提出三项核心技术:(i) 将扩散时间步作为结构锚点以平衡全局稳定性与高频细节;(ii) 潜在流形校正(LMR),通过微分约束缓解回归导致的过平滑,恢复清晰边界和连贯运动;(iii) 全局仿射一致性,利用其内在特性实现无需复杂时序对齐的长视频无缝推理。 Result: DVD在多个基准上实现零样本SOTA性能,并仅用领先基线1/163的任务特定数据即挖掘出视频基础模型中隐含的深层几何先验。 Conclusion: DVD首次实现了预训练视频扩散模型向单通深度回归器的确定性迁移,显著降低数据依赖并提升几何一致性,且已开源完整训练流程。 Abstract: Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.

[171] Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi,Stephanie Fu,Long Lian,Hanrong Ye,David Eigen,Aaron Reite,Boyi Li,Jan Kautz,Song Han,David M. Chan,Pavlo Molchanov,Trevor Darrell,Hongxu Yin

Main category: cs.CV

TL;DR: AutoGaze是一种轻量级模块,通过自回归选择多尺度视觉块,在满足误差阈值前提下大幅减少视频输入冗余,显著提升MLLM处理长时高分辨率视频的效率与性能。

Details Motivation: 现有MLLM在处理长时、高分辨率视频时因ViT或LLM对所有像素一视同仁而受限于时空冗余,导致计算开销大、可扩展性差。 Method: 提出AutoGaze模块,结合下一词预测与强化学习进行训练,自回归地选取最小必要多尺度图像块以重建视频,并满足用户设定的误差阈值。 Result: 视觉token减少4–100倍,ViT/MLLM加速最高达19倍;支持1K帧、4K分辨率视频理解,在VideoMME达67.0%;在新构建的HLVid(5分钟4K视频QA)基准上相对基线提升10.1%,超越此前最优MLLM 4.5%。 Conclusion: AutoGaze有效缓解了视频理解中的冗余问题,为MLLM高效处理高分辨率长视频提供了可行路径,并推动了评测基准(HLVid)的发展。 Abstract: Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

[172] Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu,Diankun Wu,Jiawei Chi,Yimo Cai,Yi-Hsin Hung,Xumin Yu,Hao Li,Han Hu,Yongming Rao,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出Spatial-TTT,一种基于测试时训练(TTT)的流式视觉空间智能方法,通过快速权重更新、混合架构、滑动窗口注意力与3D时空卷积的空间预测机制,结合密集3D空间描述数据集,有效提升长时序视频中的空间理解能力。

Details Motivation: 人类通过连续视觉观察理解现实空间,因此模型需在无限视频流中动态维护和更新空间证据;核心挑战在于如何随时间选择、组织和保留空间信息,而非仅扩大上下文窗口。 Method: 提出Spatial-TTT框架:采用测试时训练动态更新部分参数(快速权重);设计混合架构,结合大块并行更新与滑动窗口注意力;引入基于3D时空卷积的空间预测机制以建模几何对应与时间连续性;构建含密集3D空间描述的新数据集,引导快速权重结构化记忆全局3D空间信号。 Result: 在多个视频空间理解基准上达到SOTA性能,显著提升长时序空间理解能力。 Conclusion: Spatial-TTT验证了测试时训练在流式空间智能中的有效性,其架构设计与数据构造策略为持续空间感知建模提供了新范式。 Abstract: Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.

[173] DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei,Xinyu Liu,Shiwei Zhang,Hangjie Yuan,Jinbo Xing,Zhekai Chen,Xiang Wang,Haonan Qiu,Rui Zhao,Yutong Feng,Ruihang Chu,Yingya Zhang,Yike Guo,Xihui Liu,Hongming Shan

Main category: cs.CV

TL;DR: 本文提出DreamVideo-Omni框架,通过两阶段训练实现多主体身份定制与全粒度运动控制,引入条件感知3D旋转位置编码、分层运动注入及群组/角色嵌入解决控制模糊与身份退化问题,并设计潜在身份奖励反馈机制提升身份保持能力。

Details Motivation: 现有大模型在视频合成中难以同时精确控制多主体身份和多粒度运动,存在运动粒度有限、控制模糊和身份退化等问题。 Method: 提出两阶段训练范式:第一阶段融合外观、全局/局部运动及相机运动等多维控制信号,引入条件感知3D旋转位置编码、分层运动注入策略及群组/角色嵌入;第二阶段构建潜在身份奖励模型,提供运动感知的身份奖励以优化身份保持。 Result: 在自建大规模数据集和DreamOmni Bench评测基准上,DreamVideo-Omni在多主体身份保持与全粒度运动控制方面显著优于现有方法,生成高质量可控视频。 Conclusion: DreamVideo-Omni实现了多主体身份与全粒度运动的协同可控视频生成,为复杂场景下高保真可控视频合成提供了新范式。 Abstract: While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

[174] Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Yiran Guan,Liang Yin,Dingkang Liang,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出Video Streaming Thinking (VST),一种支持边看边想的流式视频理解新范式,通过在视频流中同步激活推理、结合结构化微调与强化学习后训练,并利用视频知识图谱合成高质量流式QA数据,显著提升实时性与推理能力。

Details Motivation: 现有在线视频大模型仅关注流式感知,缺乏同步的逻辑推理流;而直接应用测试时扩展方法会导致不可接受的响应延迟,亟需平衡实时性与深度推理能力。 Method: 提出VST范式,包含:1)边看边想机制,在视频流中动态激活推理以摊销LLM延迟;2)两阶段后训练:VST-SFT实现因果流式推理结构适配,VST-RL在多轮视频交互环境中端到端优化;3)基于视频知识图谱的自动化数据合成 pipeline,生成带实体-关系锚定的流式思维链QA对。 Result: VST-7B在StreamingBench达79.5%,OVO-Bench达59.3%;相比Video-R1响应快15.7倍,在VideoHolmes上提升+5.4%;同时在离线长视频/推理基准上保持竞争力。 Conclusion: VST成功实现了流式视频感知与逻辑推理的协同,兼顾实时响应与深度理解,为在线视频大模型提供了可扩展、高效且泛化性强的新架构。 Abstract: Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

[175] GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Mingxin Liu,Ziqian Fan,Zhaokai Wang,Leyao Gu,Zirun Zhu,Yiguo He,Yuchen Yang,Changyao Tian,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Qibing Ren,Zhihang Zhong,Xuanhe Zhou,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出了GRADE基准,用于评估图像编辑中学科知识和推理能力,涵盖10个学术领域共520个样本,并设计了多维评估协议,揭示了现有模型在知识密集型编辑任务中的显著缺陷。

Details Motivation: 当前图像编辑基准局限于自然图像和浅层常识推理,难以评估统一多模态模型在结构化、领域特定约束下的联合理解、推理与生成能力。 Method: 构建了首个面向学科知识和推理的图像编辑基准GRADE,包含10个学术领域的520个样本,并提出多维评估协议(学科推理、视觉一致性、逻辑可读性)。 Result: 在20个SOTA开源与闭源模型上的实验表明,现有模型在隐式、知识密集型编辑任务中存在显著性能差距;深入分析揭示了模型在学科编辑中的具体短板与约束。 Conclusion: GRADE为统一多模态模型的发展指明了关键方向,推动了学科导向的图像编辑与推理研究,相关资源已开源。 Abstract: Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

[176] OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan,Jilan Xu,Shangzhe Di,Haoning Wu,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出OmniStream,一种统一的流式视觉骨干网络,通过因果时空注意力和3D旋转位置编码实现帧级在线视频处理,并在多任务预训练下展现出跨语义、空间与时间推理的泛化能力。

Details Motivation: 现代视觉智能体需要具备通用性、因果性和物理结构化的表征以适应实时流式环境,但现有视觉基础模型功能割裂,难以兼顾图像语义、时序建模与空间几何。 Method: 提出OmniStream模型,引入因果时空注意力机制与3D-RoPE位置编码,结合持久KV缓存支持帧级流式处理;采用融合静态/时序表征学习、流式几何重建与视觉-语言对齐的多任务预训练框架,在29个数据集上训练。 Result: 即使骨干网络完全冻结,OmniStream在图像/视频探针、流式几何重建、复杂时空推理及未见机器人操控任务中均达到与专用模型相当的性能。 Conclusion: 证明了单一同质化视觉骨干网络可有效统一语义、空间与时序推理能力,是迈向通用视觉理解与具身智能的重要一步。 Abstract: Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

[177] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen,Shilin Yan,Hongwei Xue,Shuaiqi Lu,Xiaojun Tang,Guannan Zhang,Tiancheng Zhao,Jianwei Yin

Main category: cs.CV

TL;DR: 本文提出MM-CondChain基准,用于评估多模态大语言模型(MLLMs)在视觉引导下的深度组合条件推理能力,涵盖自然图像、数据图表和GUI轨迹三类场景;通过代理合成流水线(含Planner、VPIR和Composer)构建可验证的多层推理链;实验表明当前最强MLLM在此任务上表现仍有限(Path F1仅53.33%),凸显该能力仍是根本性挑战。

Details Motivation: 现有基准未能充分评估MLLMs在GUI导航等视觉工作流中处理深度链式、视觉接地的组合条件(如多对象/属性/关系联合判断)的能力,多聚焦于浅层或独立约束。 Method: 提出MM-CondChain基准及配套的代理合成流水线:Planner分层生成组合条件,Verifiable Programmatic Intermediate Representation(VPIR)确保每层条件机械可验证,Composer整合为完整指令;覆盖自然图像、数据图表与GUI轨迹三类视觉域。 Result: 在多个MLLM上实验显示,最强模型Path F1仅为53.33%,且在难负样本、推理深度增加或谓词复杂度升高时性能显著下降。 Conclusion: 深度视觉组合推理仍是MLLM的核心瓶颈,MM-CondChain为该能力提供了首个系统性、可扩展、可验证的评估基准。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

[178] EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Tianwei Xiong,Jun Hao Liew,Zilong Huang,Zhijie Lin,Jiashi Feng,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出EVATok框架,通过自适应视频分词器优化视频生成中的token分配,提升重建质量和生成效率。

Details Motivation: 传统视频分词器对不同视频的时序块采用固定token分配,导致简单/静态片段浪费token、复杂/动态片段token不足,效率低下。 Method: EVATok框架包含三部分:1)估计每段视频的最优token分配;2)设计轻量级路由器快速预测该分配;3)训练能根据路由器预测结果进行自适应编码的分词器,并结合视频语义编码器改进训练策略。 Result: 在UCF-101数据集上,EVATok相比LARP和固定长度基线,平均token使用量减少至少24.4%,重建质量更优,且在class-to-video生成任务中达到SOTA性能。 Conclusion: EVATok通过视频自适应token分配显著提升了AR视频生成模型的效率与质量,为高效视频建模提供了新范式。 Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.