cs.CL [Back]

[1] Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Amirhossein Bozorgkhoo,Igor Molybog

Main category: cs.CL

TL;DR: 本文提出了一种理论框架，用于分析性地连接预训练大语言模型的关键超参数与基于推测解码（Speculative Decoding, SD）的推理系统吞吐量效率，从而在预训练前预测吞吐量最优的超参数配置。

Details

Motivation: 以往通过实验方法优化推测解码推理流水线吞吐量需进行大语言模型训练，成本高昂；本文旨在建立理论模型以低成本、前置地指导超参数设计。 Method: 提出一种解析理论，将预训练大语言模型的关键超参数与基于推测解码的下游推理系统吞吐量效率进行数学建模与关联分析。 Result: 该理论可预测推理系统各组件在预训练前的吞吐量最优超参数，避免昂贵的实验调优。 Conclusion: 理论驱动的方法能有效替代实验试错，为推测解码系统的高效设计提供可解释、可预测的指导。 Abstract: Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.

[2] Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

Jingtao Wang,Yucong Wang,Jun Ding,Rui Cai,Xun Wang

Main category: cs.CL

TL;DR: 本文提出ARACH，一种无需训练的推理时插件，通过自适应上下文中心来聚合上下文并重新分配注意力，从而提升大语言模型性能，且不更新参数。

Details

Motivation: 现有训练后技术多为黑箱式输入/输出干预（如提示工程、测试时采样重排序等），缺乏对模型内部计算的即插即用式干预机制。 Method: 提出ARACH（Attention Reallocation via an Adaptive Context Hub），在推理时引入自适应上下文中心，动态聚合上下文并重分配注意力，无需任何参数更新。 Result: 在多个语言建模任务上实现一致性能提升，推理开销小，并通过注意力分析证实其可缓解‘attention sink’现象。 Conclusion: 对模型内部计算的工程化干预是一种区别于提示工程和训练式后训练的新颖推理时策略。 Abstract: Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.

[3] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Hanxu Hu,Yuxuan Wang,Maggie Huan,Jannis Vamvas,Yinya Huang,Zhijiang Guo,Rico Sennrich

Main category: cs.CL

TL;DR: 本文提出DeReason方法，通过基于难度的数据解耦策略，在通用STEM领域中优化监督微调（SFT）与强化学习（RL）的协同训练流程，显著提升大语言模型的推理能力。

Details

Motivation: 现有研究在通用STEM领域中对监督微调（SFT）与强化学习（RL）的交互机制缺乏系统探索，尤其在样本效率和数据分配策略上存在关键挑战。 Method: 提出DeReason：利用LLM打分估计问题的推理强度，将训练数据划分为推理密集型与非推理密集型子集；前者用于RL以培养复杂推理，后者用于SFT以夯实基础领域知识。 Result: 在多个通用STEM与数学基准上，DeReason显著优于SFT-only、RL-only及随机划分的SFT+RL基线，验证了其有效性与泛化性。 Conclusion: SFT与RL在通用推理任务中具有互补作用，合理按推理难度解耦训练数据并分配至不同阶段，是提升模型性能的关键设计原则。 Abstract: Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.

[4] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Marco Brambilla,Piero Fraternali

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱的检索增强生成（RAG）新框架MDER-DR，通过MDER索引方法保留上下文信息、DR检索机制实现多跳问答的迭代推理，在标准与领域特定基准上显著优于传统RAG方法（最高提升66%）。

Details

Motivation: 现有KG上的RAG方法在将文本转为三元组索引时易丢失上下文细节，导致多跳问答性能下降。 Method: 提出Map-Disambiguate-Enrich-Reduce（MDER）索引方法，生成上下文驱动的三元组描述并融合实体摘要；并设计Decompose-Resolve（DR）检索机制，将查询分解为可解析三元组并通过迭代推理在KG中定位。 Result: 在标准及领域专用QA基准上，MDER-DR相较基线RAG方法最高提升66%，且具备跨语言鲁棒性。 Conclusion: MDER-DR是一种领域无关、LLM驱动的知识图谱问答框架，能有效应对稀疏、不完整和复杂关系数据，显著提升多-hop QA性能。 Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.

[5] Markovian Generation Chains in Large Language Models

Mingmeng Geng,Amr Mohamed,Guokan Shang,Michalis Vazirgiannis,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在反复迭代生成（如重述、往返翻译）过程中的文本演化行为，将其建模为无记忆的马尔可夫生成链，并发现输出可能收敛或持续创新，其多样性受温度参数和初始句子影响。

Details

Motivation: 探究大语言模型在多次迭代推理过程中文本如何演化，以理解其内在动态及对多智能体LLM系统的影响。 Method: 提出‘马尔可夫生成链’建模框架，开展迭代重述与往返翻译实验，结合句子级马尔可夫链建模与模拟数据分析。 Result: 迭代生成可能收敛至小的循环集，也可能持续生成新句子；多样性可能随温度参数和初始句子而增加或减少。 Conclusion: LLM的迭代推理具有复杂动力学特性，其多样性变化非单调，需谨慎设计多轮或多智能体应用场景。 Abstract: The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.

[6] Artificial Intelligence for Sentiment Analysis of Persian Poetry

Arash Zargar,Abolfazl Moshiri,Mitra Shafaei,Shabnam Rahimi-Golkhandan,Mohamad Tavakoli-Targhi,Farzad Khalvati

Main category: cs.CL

TL;DR: 本研究利用BERT和GPT等大语言模型分析波斯诗人鲁米与帕尔文·埃特萨米的诗歌，探究模型理解波斯诗复杂性的能力及诗 sentiment 与格律间的关联；结果表明GPT-4o可可靠用于波斯诗歌分析，鲁米诗歌整体情感更积极，且其格律运用更能表达多样情感。

Details

Motivation: 探索现代大语言模型（LLM）在理解波斯诗歌复杂性（如语义、格律、情感）方面的能力，并检验诗歌情感与格律之间是否存在潜在关联。 Method: 采用多个基于BERT和GPT（特别是GPT-4o）的语言模型，对鲁米和帕尔文·埃特萨米两位波斯诗人的作品进行情感分析与格律使用比较分析。 Result: GPT-4o可被可靠用于波斯诗歌分析；鲁米诗歌整体情感比帕尔文·埃特萨米更积极；鲁米在格律运用上更具多样性，能表达更广泛的情感；LLM可用于减少人为偏见的计算机化语义研究。 Conclusion: 大语言模型（尤其是GPT-4o）能有效支持波斯诗歌的自动化语义分析，验证了其在人文学科中替代或辅助人工解读的潜力，有助于提升诗歌研究的客观性与可扩展性。 Abstract: Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.

[7] ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi,Saiph Savage

Main category: cs.CL

TL;DR: 本文提出了ThReadMed-QA——首个基于真实患者-医生在线对话的多轮医学问答基准，揭示当前主流大模型在多轮医学问答中性能显著退化，并引入新指标CCS与EPR量化其一致性差与错误传播问题。

Details

Motivation: 现有医学问答基准多为单轮，无法反映真实医患咨询中反复澄清、多轮交互的特点；需构建基于真实对话的多轮评估基准。 Method: 从r/AskDocs提取2437个完整医患对话线程（共8204个QA对），构建ThReadMed-QA基准；采用医师标注的LLM-as-a-judge评估五种SOTA大模型在分层测试集（238个对话）上的表现；提出Conversational Consistency Score（CCS）和Error Propagation Rate（EPR）两个新指标分析多轮失败模式。 Result: GPT-5在首轮准确率最高（75.2/100），但到第2轮下降16.2分；所有模型在第2轮均显著退化（p<0.001），错误率约增至三倍；CCS显示Claude Haiku近1/3对话在正确与完全错误间剧烈波动；EPR表明单次错误使后续错误概率提升1.9–6.1倍。 Conclusion: 当前大模型虽具较强单轮医学问答能力，但在多轮交互中可靠性严重不足，存在系统性一致性缺失与错误传播问题，亟需面向多轮医疗对话的建模与评估新范式。 Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.

[8] Temporal Text Classification with Large Language Models

Nishat Raihan,Marcos Zampieri

Main category: cs.CL

TL;DR: This paper evaluates leading proprietary and open-source LLMs on Temporal Text Classification (TTC), i.e., estimating text publication dates, using zero-shot, few-shot, and fine-tuning approaches across three historical corpora.

Details

Motivation: Despite recent progress in LLMs, their performance on automatic text dating (Temporal Text Classification) remains unexplored — this study fills that gap. Method: Systematic evaluation of proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora (two English, one Portuguese), under zero-shot, few-shot, and fine-tuning settings. Result: Proprietary models perform well—especially with few-shot prompting; fine-tuning improves open-source models significantly but they still underperform compared to proprietary ones. Conclusion: Proprietary LLMs currently outperform open-source alternatives on TTC, with few-shot prompting being particularly effective; fine-tuning helps open-source models but does not close the performance gap. Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.

[9] Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Aria Nourbakhsh,Salima Lamsiyah,Adelaide Danilov,Christoph Schommer

Main category: cs.CL

TL;DR: 本文提出了一种利用教师模型生成的归因图作为监督信号，通过学生Transformer模型评估序列到序列（seq2seq）模型中各种可解释性（XAI）方法有效性的新框架；实验表明Attention、Value Zeroing和Layer Gradient×Activation等归因方法在BLEU等指标上提升最显著，且归因图重建精度与下游任务增益正相关。

Details

Motivation: 现有可解释AI（XAI）方法在序列到序列模型（尤其是Transformer）中的系统化、自动化评估仍不充分，缺乏统一、可量化的评估范式。 Method: 以教师模型（Marian-MT/mBART）生成的归因图为监督信号，将四种组合算子（加法、乘法、平均、替换）注入学生Transformer的注意力机制；使用Inseq库提取源-目标对归因分数，并引入‘Attributor’模型学习重建教师归因图。 Result: Attention、Value Zeroing和Layer Gradient×Activation在de-en/fr-en/ar-en三组语言对上带来最大BLEU/chrF提升；而Saliency、Integrated Gradients等梯度类方法效果较弱且不稳定；Attributor重建精度与归因注入效用呈正相关。 Conclusion: 不同归因方法捕获的信息存在本质差异；基于注意力机制的归因更能反映seq2seq中源-目标表征对齐；归因图的可重建性是其可解释性价值的重要代理指标。 Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.

[10] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin

Main category: cs.CL

TL;DR: 本文提出了一种“坚持或切换”（stick-or-switch）评估框架，用于评估17个大语言模型（LLMs）在多轮临床诊断对话中的推理能力，发现多轮交互普遍导致性能下降（即‘对话税’），模型常因错误用户建议而放弃正确诊断或安全弃权，并存在盲目切换问题。

Details

Motivation: 尽管大语言模型在静态诊断基准上表现优异，但其在更贴近现实的多轮医疗对话场景中的诊断推理能力尚缺乏系统研究。 Method: 构建了'stick-or-switch'评估框架，用于量化模型在多轮对话中的信念坚守（如坚持正确诊断或安全弃权）与灵活调整（如识别并采纳新出现的正确建议）能力，并在三个临床数据集上评测17个LLM。 Result: 实验揭示了显著的'对话税'现象：多轮交互一致地降低了模型性能；模型常放弃初始正确诊断或安全弃权以迎合错误用户建议；部分模型表现出'盲目切换'，无法区分正确与错误建议。 Conclusion: 当前LLMs在多轮临床对话中诊断推理鲁棒性不足，需改进其信念维持与信号甄别能力，以提升真实医疗场景下的可靠性与安全性。 Abstract: Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

[11] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects

Amani Maina-Kilaas,Roger Levy

Main category: cs.CL

TL;DR: 本文探讨了语言处理中结构歧义表征的作用，指出大语言模型的惊讶度（surprisal）虽能稳健预测阅读时间，但在结构预期被违反时系统性低估难度，暗示结构歧义表征具有因果作用；作者提出基于粒子滤波器的替代模型，明确表征结构假设，并证明重采样操作会自然引发实时‘挖掘效应’（digging-in），且该效应强度随粒子数量减少而增强。

Details

Motivation: 大语言模型基于惊讶度的语言处理预测在结构预期违背时表现不足，提示其缺乏结构歧义显式表征可能限制了对人类句法处理的建模能力。 Method: 理论建模与算法分析：构建并分析基于粒子滤波器的语言处理模型，严格推导其算法性质（如花园路径效应放大、重采样引发的实时挖掘效应），并与完全并行模型对比。 Result: 证明粒子滤波模型中重采样操作必然导致‘挖掘效应’——即歧义区域越长，后续消歧越困难；且该效应强度与粒子数量成反比；完全并行模型不产生此效应。 Conclusion: 结构歧义的显式概率表征（如粒子滤波形式）对解释人类句法处理中的特定动态现象（如挖掘效应）至关重要，单纯依赖LLM惊讶度不足以刻画全部认知机制。 Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.

[12] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Michiko Yoshitake,Yuta Suzuki,Ryo Igarashi,Yoshitaka Ushiku,Keisuke Nagato

Main category: cs.CL

TL;DR: MaterialFigBench 是一个面向材料科学领域的多模态大语言模型评测基准，聚焦于模型对相图、应力-应变曲线等关键图表的理解与定量解析能力；实验表明当前多模态LLM仍严重依赖记忆知识而非真实看图推理。

Details

Motivation: 现有基准多依赖文本，缺乏对材料科学中不可或缺的图表（如相图、衍射图谱等）理解能力的系统评估，亟需构建领域专用、以图为核心的评测基准。 Method: 构建包含137道大学材料科学教材改编自由问答题的MaterialFigBench数据集，覆盖晶体结构、相变、力学性能等核心主题；为图像读数模糊性设定专家定义的答案容差范围；评测多个SOTA多模态LLM（如GPT系列）在各题型上的表现并分析错误模式。 Result: 模型整体准确率随版本更新提升，但普遍无法真正理解图表——常靠记忆知识作答而非图像解析；在视觉推理、数值精度（如有效数字）和定量解读方面存在显著短板，仅部分题型有改进。 Conclusion: MaterialFigBench揭示了当前多模态LLM在材料科学图表理解上的根本局限，为推动具备强图理解能力的下一代模型提供了领域定制化评测基础与改进方向。 Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.

[13] BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion

Varun Iyer,Cornelia Caragea

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的解码干预方法BLooP，通过鼓励大语言模型生成源文档中出现的二元组（bigram）来提升摘要的忠实性与质量。

Details

Motivation: 现有大语言模型在无微调情况下进行抽象式摘要时，常遗漏关键细节并引入无关信息，需提升其生成摘要的忠实性与准确性。 Method: 提出BLooP（Bigram Lookahead Promotion），一种基于哈希表查找的训练-free解码策略，在每步解码中优先选择能构成源文档中已有bigram的候选词，无需训练、微调或修改模型。 Result: 在CNN/DM、CCSum、Multi-News和SciTLDR等多个数据集上，BLooP显著提升了Llama-3.1-8B-Instruct、Mistral-Nemo-Instruct-2407和Gemma-2-9b-it等模型的ROUGE和BARTScore；人工评估证实其显著提升摘要忠实性且不损害可读性。 Conclusion: BLooP是一种轻量、通用、即插即用的解码增强方法，有效提升大语言模型在抽象摘要任务中的事实一致性与性能，无需额外训练开销。 Abstract: Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at https://github.com/varuniyer/BLooP

[14] LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction

Yuzhi Liang,Lixiang Ma,Xinrong Zhu

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型先验与统计因果发现的增强型因果推理框架，用于提升法律判决预测的准确性和鲁棒性，通过细粒度法律要素提取与因果结构消歧，显著优于现有方法。

Details

Motivation: 现有基于预训练语言模型的法律判决预测方法依赖统计相关性，缺乏对法律构成要件和因果逻辑的显式建模，易学得虚假相关、鲁棒性差；而现有因果方法在真实法律文本中面临法律因子提取噪声大、因果结构发现不确定性高两大瓶颈。 Method: 提出融合LLM先验与统计因果发现的增强因果框架：1）设计粗到细混合提取机制（统计采样+LLM语义推理）精准识别并净化法律构成要素；2）引入LLM辅助因果结构消歧机制，将LLM作为约束性先验知识库，对模糊因果方向进行概率评估与剪枝，生成合法合规候选因果图；3）基于因果图显式约束文本注意力强度，构建因果感知判决预测模型。 Result: 在LEVEN、QA、CAIL等多个基准数据集上，该方法在预测精度和鲁棒性（尤其在易混淆罪名区分上）显著优于当前最优基线。 Conclusion: 融合LLM先验与因果发现可有效缓解法律文本中要素噪声与结构不确定性问题，提升LJP模型的可解释性、合法性与泛化能力。 Abstract: Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.

[15] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CL

TL;DR: 本文提出Tool-DC框架，通过'尝试-检查-重试'范式提升大语言模型在工具调用任务中的性能，分为无需训练（TF）和需训练（TB）两种变体，在多个基准上显著优于基线方法。

Details

Motivation: 现有方法在面对大规模、高噪声候选工具的长上下文工具调用任务时表现不佳，限制了实际应用。 Method: 提出Tool-DC分而治之框架，包含训练自由的Tool-DC（TF）和训练驱动的Tool-DC（TB），均基于'尝试-检查-重试'范式以降低推理难度并增强模型自省能力。 Result: Tool-DC（TF）在BFCL和ACEBench基准上平均提升达+25.10%；Tool-DC（TB）使Qwen2.5-7B达到甚至超越OpenAI o3和Claude-Haiku-4.5等专有模型的性能。 Conclusion: Tool-DC有效提升了LLM在复杂工具调用任务中的鲁棒性与效率，兼顾即插即用性与推理效率，具有较强实用价值。 Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.

[16] Tiny Aya: Bridging Scale and Multilingual Depth

Alejandro R. Salamanca,Diana Abagyan,Daniel D'souza,Ammar Khairi,David Mora,Saurabh Dash,Viraat Aryabumi,Sara Rajaee,Mehrnaz Mofakhami,Ananya Sahu,Thomas Euyang,Brittawnya Prince,Madeline Smith,Hangyu Lin,Acyr Locatelli,Sara Hooker,Tom Kocmi,Aidan Gomez,Ivan Zhang,Phil Blunsom,Nick Frosst,Joelle Pineau,Beyza Ermis,Ahmet Üstün,Julia Kreutzer,Marzieh Fadaee

Main category: cs.CL

TL;DR: Tiny Aya 是一个仅含 3.35B 参数的小型多语言大模型，在70种语言上训练并经区域感知后训练优化，在翻译质量、多语言理解与目标语言生成方面达到SOTA，提供基础模型、全局指令微调模型及三个区域专用模型。

Details

Motivation: 探索高效、语言性能均衡、易于部署的多语言AI扩展路径，弥补小型模型在多语言能力上的不足。 Method: 在70种语言数据上预训练，并采用区域感知（region-aware）的后训练策略；发布基础模型、全局平衡指令微调模型及面向非洲、南亚、欧洲、亚太和西亚的三个区域专业化模型。 Result: 在翻译质量、多语言理解与目标语言生成任务上达到当前最优水平（SOTA），同时保持仅3.35B参数规模。 Conclusion: Tiny Aya 证明了小参数量模型通过高质量数据构建与区域适配训练，可实现卓越且均衡的多语言能力，为多语言AI提供了以效率和实用性为导向的新范式。 Abstract: Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

[17] Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

Sanchit Pandey

Main category: cs.CL

TL;DR: 本文研究了7B及以下参数规模的语言模型在检索增强生成（RAG）中的表现，发现其主要瓶颈在于无法有效利用检索到的信息，而非检索质量；即使提供完美检索结果（oracle），小模型仍大幅失败，且检索上下文常干扰原有知识，导致性能下降。

Details

Motivation: 探究7B及以下参数规模的语言模型能否有效利用RAG中检索到的信息，厘清RAG失效是源于检索质量差还是模型自身无法利用上下文。 Method: 在360M至8B共5种模型规模、SmolLM2/Qwen2.5/Llama 3.1三种架构、四种检索条件（无检索、BM25、E5 dense、oracle）下进行系统评估；提出参数化知识划分方法，分离模型已知与需外部知识的问题，以区分‘利用失败’与‘检索失败’。 Result: 1）即使oracle检索，7B及以下模型在需外部知识的问题上85–100%失败；2）添加检索上下文导致42–100%原本可答问题出错，表明存在显著干扰效应；3）错误分析显示主导失败模式为‘无关生成’（忽略上下文）。现象跨提示模板与检索方法一致。 Conclusion: 对<7B模型而言，RAG的主要限制是上下文利用能力而非检索质量；在标准评估下部署RAG可能带来净负收益。 Abstract: Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.

[18] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Mayank Saini Arit Kumar Bishwas

Main category: cs.CL

TL;DR: 本文提出了一种用于自主多模态查询处理的智能体AI框架，通过动态分解查询、按模态分配任务并自适应合成结果，在效率、成本和交互质量上显著优于传统分层基线方法。

Details

Motivation: 解决现有多模态AI系统中工具调用僵化、跨模态协同不足、部署成本高、响应延迟大等问题，提升多模态查询处理的整体效能与经济性。 Method: 构建以中央Supervisor为核心的代理式框架，支持文本、图像、音频、视频和文档模态；对文本查询采用RouteLLM进行学习式路由，对非文本查询采用SLM辅助的模态分解；通过动态任务分解、模态专用工具调度与自适应结果合成实现端到端处理。 Result: 在2847个跨15类任务的查询上评估，相比匹配的分层基线，时间准确率提升72%，对话返工减少85%，成本降低67%，同时保持精度不变。 Conclusion: 智能集中式编排可从根本上优化多模态AI的实际部署效益，验证了代理式架构在复杂多模态场景中的有效性与可扩展性。 Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.

[19] Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

Zhenxu Tian,Yi Su,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出DapQ方法，通过位置感知的伪查询模拟解码阶段的注意力模式，实现与生成过程对齐的KV缓存压缩，在严格内存限制下保持近乎无损性能。

Details

Motivation: 现有KV缓存压缩方法仅依赖prefill阶段的输入注意力模式评估token重要性，无法反映解码阶段真实关注的token；而解码阶段的真实查询在推理时不可知，需构建有效伪查询。 Method: 提出DapQ框架：利用位置信息主导的伪查询来近似解码阶段的query，构建与生成过程对齐的观察窗口，从而更精准地评估token重要性并进行轻量级缓存淘汰。 Result: 在多个基准和LLM上验证，DapQ在严苛内存约束下（如仅3% KV缓存预算）仍取得接近无损性能（NIAH任务达99.5%）。 Conclusion: 位置信息比语义内容更能有效建模解码阶段注意力行为；DapQ通过位置感知伪查询实现解码对齐的KV压缩，显著提升长上下文推理效率与精度平衡。 Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).

[20] Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin,Jeon Haesung,Lianbo Liu,Hao Shi,Mengjie Zhao,Yusuke Fujita,Yui Sudo

Main category: cs.CL

TL;DR: Hikari is a policy-free, end-to-end simultaneous speech-to-text translation model using a probabilistic WAIT token and Decoder Time Dilation to improve quality-latency trade-off, achieving SOTA BLEU scores.

Details

Motivation: Traditional SiMT relies on offline models and hand-crafted or learned policies; there's a need for simpler, fully end-to-end approaches that better balance translation quality and latency. Method: Introduces Hikari: (1) a probabilistic WAIT token mechanism to encode READ/WRITE decisions, (2) Decoder Time Dilation to reduce autoregressive overhead and balance training distribution, and (3) supervised fine-tuning for delay recovery. Result: Achieves new state-of-the-art BLEU scores on English-to-Japanese, German, and Russian in both low- and high-latency regimes, outperforming recent baselines. Conclusion: Hikari demonstrates that policy-free, end-to-end modeling with carefully designed architectural and training innovations can significantly advance simultaneous speech-to-text translation. Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

[21] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Ofir Marom

Main category: cs.CL

TL;DR: 本文提出UtilityMax Prompting框架，使用形式化数学语言替代自然语言来定义大语言模型（LLM）任务，将任务建模为影响图并定义效用函数，引导LLM显式优化多目标，显著提升电影推荐任务中的精度与NDCG。

Details

Motivation: 自然语言提示存在固有歧义性，难以同时满足多个目标，限制了LLM在多目标任务中的性能。 Method: 提出UtilityMax Prompting框架：将任务建模为以LLM输出为唯一决策变量的影响图，定义作用于图中条件概率分布的效用函数，并指示LLM寻找使期望效用最大化的答案。 Result: 在MovieLens 1M数据集及三个前沿模型（Claude Sonnet 4.6、GPT-5.4、Gemini 2.5 Pro）上验证，该方法在多目标电影推荐任务中一致优于自然语言基线，提升了精度和NDCG。 Conclusion: 形式化、基于效用的提示方法能有效缓解自然语言提示的歧义性，增强LLM对复杂多目标任务的精确优化能力。 Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

[22] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai,Singo Sakashita,Shumpei Ishikawa,Shogo Watanabe,Anna Matsuoka,Mikio Sakurai,Yasuto Fujimoto,Yoshiyuki Takahara,Atsushi Ohara,Hirohiko Miyake,Genichiro Ishii

Main category: cs.CL

TL;DR: 本文评估了七种开源大语言模型（LLM）在日语病理报告撰写中的性能，涵盖结构化诊断文本生成与信息提取、日语病理报告错字纠正、以及病理医生和临床医生对模型生成解释性文本的主观评价三方面。结果显示，具备推理能力的思维模型和医学专用模型在结构化报告和错字纠正任务中表现更优；但解释性文本的偏好因评审者而异。总体表明，开源LLM可在有限但具临床价值的场景中辅助日语病理报告撰写。

Details

Motivation: 大型语言模型（LLM）在日语病理报告撰写中的性能尚未被探索，亟需评估其在临床实际场景中的适用性。 Method: 评估七种开源LLM，从三方面展开：(A) 按预定义格式生成与抽取病理诊断文本；(B) 纠正日语病理报告中的错别字；(C) 由病理医生和临床医生对模型生成的解释性文本进行主观评分。 Result: 思维类模型和医学专用模型在结构化报告生成与错字纠正任务中表现更优；但解释性文本的主观偏好在不同评审者间差异显著。 Conclusion: 尽管LLM效用因任务而异，但开源LLM可在有限但临床相关的场景中有效辅助日语病理报告撰写。 Abstract: The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.

[23] QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Jihao Zhao,Daixuan Li,Pengfei Li,Shuaishuai Zu,Biao Qin,Hongyan Liu

Main category: cs.CL

TL;DR: 本文提出QChunker，通过理解-检索-增强范式改进RAG中的文本分块质量，结合多智能体辩论框架与新评估指标ChunkScore，提升语义完整性与信息粒度。

Details

Motivation: 现有RAG受限于知识库中文本块的语义完整性与信息粒度；传统分块方法缺乏逻辑连贯性，且评估依赖下游QA任务、效率低。 Method: 提出QChunker框架：将分块建模为文本切分+知识补全的复合任务；设计四智能体辩论系统（问题提纲生成器、文本切分器、完整性审查员、知识补全器）；构建45K高质量分块数据集并蒸馏至小模型；提出直接评估指标ChunkScore，并用于多路径采样下的最优分块选择。 Result: ChunkScore被理论与实验验证可高效、直接区分分块质量；在四个异构领域实验中，QChunker显著提升分块的逻辑连贯性与信息丰富度，从而增强RAG效果。 Conclusion: QChunker通过引入理解前置与结构化评估，从根本上优化了RAG的知识输入质量，为高质量检索增强生成提供了新范式。 Abstract: The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.

[24] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Junjie Wu,Xuan Kan,Zihao He,Shunwen Tan,Bo Pan,Kaitai Zhang

Main category: cs.CL

TL;DR: 本文提出了一种多任务强化学习框架MT-RL-Judge，用于提升多模态大语言模型（MLLM）作为评判者（judge）在多种视觉任务中的泛化能力与判断一致性，并验证其在分布外任务上的鲁棒性。

Details

Motivation: 现有MLLM-as-a-Judge模型多为单任务优化，难以泛化到多样化的评估场景，缺乏可靠评价所需的跨任务适应能力。 Method: 提出MT-RL-Judge框架，通过多任务强化学习联合优化MLLM评判器，利用强化学习的泛化能力提升跨任务性能。 Result: 在多个强基线对比实验中，MT-RL-Judge在判断一致性与人类偏好相关性上均更优，并在分布外任务上展现出强泛化能力。 Conclusion: 多任务强化学习可有效增强MLLM作为评判者的通用性与鲁棒性，为构建可靠、可泛化的自动评估系统提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

[25] A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

María Isabel Rivas Ginel,Janiça Hackenbuchner,Alina Secară,Ralph Krüger,Caroline Rossi

Main category: cs.CL

TL;DR: 本文探讨了自动化语言和翻译行业中价值的构建与协商，指出技术效率与人类专业能力相互依存，适应性成为连接人机价值的核心要素。

Details

Motivation: 探究自动化背景下翻译行业中人类价值与技术价值如何被构建、协商与重构。 Method: 基于LT-LiDER项目中29位行业利益相关者的访谈数据，结合Chesterman翻译伦理框架进行定性分析。 Result: 发现效率导向的技术价值已成为自动化生产环境的基本预期；人类价值未被取代，而是通过专业知识、监督、问责与情境判断重新定位；适应性成为连接人机价值的关键中介价值。 Conclusion: 自动化并非取代翻译价值，而是重塑其形态，形成技术效率赋能人类交际工作的互依配置。 Abstract: This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.

[26] In the LLM era, Word Sense Induction remains unsolved

Anna Mosolova,Marie Candito,Carlos Ramisch

Main category: cs.CL

TL;DR: 本文探讨了词义归纳（WSI）在缺乏标注数据时的评估方法问题，提出基于SemCor的新评估数据集，并系统评估了预训练嵌入、聚类算法及LLM方法；结果表明‘每词一簇’启发式仍是最强基线，LLM直接用于WSI效果不佳，但结合Wiktionary的数据增强可提升性能并超越此前SOTA。

Details

Motivation: 当前WSI评估存在方法论缺陷，缺乏尊重真实语料多义性和频率分布的基准；且在低资源/领域场景下亟需更可靠的无监督或弱监督WSI方案。 Method: 构建SemCor派生的评估数据集；对比测试多种预训练词向量与聚类算法（按词性分组）；提出并评估基于LLM的WSI方法；探索LLM生成、语料库和词典三类数据增强源，以及利用Wiktionary的半监督设置（must-link约束、每词簇数控制）。 Result: 未有无监督方法超越‘每词一簇’（1cpl）基线；不同词性下最优方法不同；LLM直接执行WSI表现差；数据增强有效；利用Wiktionary的半监督方法在测试集上比此前SOTA提升3.3%。 Conclusion: WSI问题尚未解决，需更好融合传统词典资源与大语言模型的词汇语义能力，尤其应在评估设计和方法耦合上深入改进。 Abstract: In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.

[27] SemBench: A Universal Semantic Framework for LLM Evaluation

Mikel Zubillaga,Naiara Perez,Oscar Sainz,German Rigau

Main category: cs.CL

TL;DR: 本文提出SemBench框架，利用词典义项定义和句子编码器自动生成语义理解评估基准，无需人工标注例句，支持多语言、轻量高效且与传统WiC基准结果高度相关。

Details

Motivation: 传统语义理解评估基准（如WiC）构建成本高、依赖高资源语言，难以扩展至低资源语言；需一种可扩展、语言无关的自动化评估方法。 Method: 提出SemBench框架，仅基于词典义项定义和句子编码器自动生成合成语义判别任务（如一词多义区分），无需人工编写例句；在英语、西班牙语、巴斯克语三种语言及多种LLM上进行验证。 Result: SemBench生成的模型排名与标准WiC数据集高度相关；仅需少量样例即可获得稳定、有意义的排名；验证了其跨语言适用性与数据高效性。 Conclusion: SemBench是一种轻量、自适应、数据高效的跨语言语义理解评估框架，为LLM语义能力评测提供了新范式。 Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.

[28] Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Assaf Siani,Anna Kernerman,Ilan Kernerman

Main category: cs.CL

TL;DR: 本文提出了一种用于英语-希伯来语质量估计（QE）的半合成平行数据集构建方法，并基于该数据集训练了BERT和XLM-R等神经QE模型，探讨了数据规模、分布平衡与错误类型对模型性能的影响。

Details

Motivation: 解决低资源语言对（尤其是形态复杂语言）的质量估计问题，因缺乏平行语料及语言特异性因素（如性、数一致）导致现有QE系统准确率、适应性和可靠性不足。 Method: 构建半合成英语-希伯来语QE数据集：基于典型语言模式生成英文句子，经多引擎翻译并用BLEU筛选；人工标注质量分；加入高质量专业译文（最高分）；可控注入性别与数一致性错误；训练BERT与XLM-R等神经QE模型。 Result: 发现数据集规模、分布平衡性及错误类型分布显著影响QE模型性能；验证了所提数据集在提升形态丰富语言QE能力上的有效性。 Conclusion: 该半合成数据构建方法为低资源、形态复杂语言对的QE建模提供了可行路径，推动了无参考翻译质量评估技术的发展。 Abstract: Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.

[29] Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

Konstantin Krestnikov

Main category: cs.CL

TL;DR: 本文提出“压缩-一致性原则”，指出语言模型在下一个词预测中倾向于选择能更简洁、更一致地描述训练数据的假设；当错误选项在结构上更难压缩时，模型才会表现出对正确陈述的偏好。

Details

Motivation: 解释为什么语言模型即使在混合质量数据上训练，有时仍偏好正确陈述。 Method: 基于小规模GPT-2风格字符级Transformer（3.5M–86M参数），在可控混合正确与错误规则的合成数学语料上进行实验，并设计多种错误类型（随机错误 vs. 一致但错误规则）及验证机制。 Result: 在随机错误设定下，模型在配对评估中显著偏好正确完成（平衡数据达83.1%，仅10%正确规则时仍有67.0%）；而换成一致但错误规则后偏好消失（近随机准确率）；自然语言类合成环境中效果减弱但仍存在（57.7%）；嵌入验证步骤或增加一致规则数可提升正确性偏好。 Conclusion: 所谓‘真值偏好’主要源于模型对压缩效率和内部一致性的优化倾向，而非内在追求真理的机制。 Abstract: Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.

[30] Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents

Yaocong Li,Qiang Lan,Leihan Zhang,Le Zhang

Main category: cs.CL

TL;DR: 本文提出了Legal-DC基准数据集和LegRAG框架，以解决中文法律场景中检索增强生成（RAG）系统缺乏专用评估资源及难以适应法律条文结构化特性的两大问题。

Details

Motivation: 现有中文法律RAG基准缺乏对检索器与生成器联合评估的支持，且主流RAG系统难以适配法律条文的结构化特性。 Method: 构建Legal-DC基准数据集（含480份法律文档、2475个带条款级引用的问答对）；提出LegRAG框架，融合法律自适应索引（按条款边界切分）与双路径自反思机制；引入面向高可靠性需求的自动化大模型评估方法。 Result: LegRAG在关键评估指标上较现有最优方法提升1.3%–5.6%；发布开源代码与数据。 Conclusion: 本研究为中文法律RAG提供了专用基准、实用框架与实证洞见，推动其在真实法律咨询场景中的落地应用。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.

[31] Trust Oriented Explainable AI for Fake News Detection

Krzysztof Siwek,Daniel Stankowski,Maciej Stodolski

Main category: cs.CL

TL;DR: 本文探讨了可解释人工智能（XAI）在基于自然语言处理（NLP）的假新闻检测中的应用，比较了SHAP、LIME和Integrated Gradients三种可解释性方法，结果表明XAI可在保持高检测准确率的同时提升模型透明度与可解释性，但也存在计算开销大和参数敏感等局限。

Details

Motivation: 提升假新闻检测系统的可靠性与可信度，解决深度学习模型‘黑箱’问题，增强用户对检测结果的理解与信任。 Method: 选取SHAP、LIME和Integrated Gradients三种XAI方法，应用于不同神经网络架构的假新闻分类模型，并通过实验对比其解释效果、效率与适用性。 Result: XAI方法显著提升了模型透明度与可解释性，同时维持高检测准确率；SHAP提供细粒度局部归因，LIME生成简洁直观解释，Integrated Gradients在卷积模型中效率更优；但存在计算成本高、对参数设置敏感等限制。 Conclusion: 将XAI与NLP结合是提升假新闻检测系统可靠性与可信度的有效途径，未来需进一步优化计算效率与鲁棒性。 Abstract: This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.

[32] Large Language Models for Biomedical Article Classification

Jakub Proboszcz,Paweł Cichosz

Main category: cs.CL

TL;DR: 本研究系统评估了大语言模型（LLM）在生物医学文献分类任务中的文本分类能力，涵盖多种开源与闭源模型、提示设计、输出处理方法及少样本设置，并与传统分类器对比，结果表明LLM在零样本和少样本下性能接近传统方法，验证了其在专业领域应用的可行性。

Details

Motivation: 探索大语言模型作为文本分类器在非平凡领域（如生物医学）中的实用性，弥补以往研究在配置范围（提示类型、输出处理、少样本策略等）上的不足。 Method: 系统评估多个大小不一的开源与闭源大语言模型；测试不同提示方式（零样本/少样本）、输出处理方法（生成类别及类别概率）、少样本示例数量与选择策略；将最优配置性能与朴素贝叶斯、随机森林及微调Transformer等传统方法对比。 Result: 在15个具挑战性的生物医学数据集上，零样本平均PR AUC达0.4以上，少样本接近0.5；该性能与朴素贝叶斯（0.5）、默认随机森林（0.5）、调参后随机森林（0.55）及微调Transformer（0.5）相当；使用输出token概率进行类别概率预测效果最佳。 Conclusion: 大语言模型可有效胜任生物医学文本分类任务，尤其在少样本设定下表现稳健；研究提供了实用配置建议，证实LLM在专业领域具备替代或补充传统分类器的潜力。 Abstract: This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.

[33] DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Yutong Yan,Raphael Tang,Zhenyu Gao,Wenxi Jiang,Yao Lu

Main category: cs.CL

TL;DR: DatedGPT 提出了一组按年份严格切分训练数据的1.3B参数语言模型，以消除金融回测中的前瞻性偏差，并通过时序对齐的指令微调和实证验证确保知识时效性。

Details

Motivation: 大型语言模型在金融回测中可能因预训练数据包含未来信息而引入前瞻性偏差，损害预测有效性。 Method: 构建十二个1.3B参数模型（DatedGPT），每个模型从零开始、仅用截至某一年（2013–2024）的时序划分数据（约1000亿token）训练，并在同样时间边界内进行通用与金融领域指令微调；通过困惑度探测验证知识截止性，并在标准基准上评估性能。 Result: 各模型的知识范围被证实严格受限于其训练数据截止年份，在标准基准上性能媲美同规模现有模型；提供交互式网页Demo支持跨年份模型对比查询。 Conclusion: 时序严格隔离的预训练与微调可有效缓解LLM在金融时序任务中的前瞻性偏差，为可信金融AI建模提供了可行范式。 Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.

[34] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

Remigiusz Kinas,Paweł Kiszczak,Sergio P. Perez,Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej

Main category: cs.CL

TL;DR: Bielik-Minitron-7B 是通过结构化混合剪枝与知识蒸馏两阶段压缩方法，将 Bielik-11B-v3.0 模型压缩 33.4% 得到的 7.35B 参数模型，针对欧洲语言优化，在保持约 90% 原模型性能的同时实现最高 50% 推理加速。

Details

Motivation: 为降低面向欧洲小语种语言模型的部署成本，同时维持高质量表现，需探索高效压缩方法以兼顾性能、速度与语言覆盖能力。 Method: 采用受 NVIDIA Minitron 启发的两阶段压缩：第一阶段用 NVIDIA Model Optimizer 进行结构化混合剪枝；第二阶段用 NVIDIA NeMo Framework 进行基于 logits 的知识蒸馏；随后经 SFT、DPO-P 和 GRPO 对齐优化。 Result: 模型参数从 11.04B 减少至 7.35B（压缩率 33.4%），推理速度提升最高达 50%，性能恢复至原模型约 90%。 Conclusion: 该方法为资源受限场景下构建高质量、低成本、多语言支持的语言模型提供了可行且高效的范式。 Abstract: This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.

[35] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan

Main category: cs.CL

TL;DR: 本文提出首个用于评估大语言模型（LLM）多模态、多轮对话中心智理论（ToM）能力的基准数据集CoMMET，弥补现有文本单轮信念任务评测的不足，并系统分析了不同LLM在社会认知能力上的表现与局限。

Details

Motivation: 现有ToM评测基准局限于纯文本输入和单一信念任务，难以全面评估LLM在真实多模态、交互式场景中的社会推理能力。 Method: 构建多模态、多轮对话式ToM评测基准CoMMET，受心智理论手册任务启发，覆盖更广泛的心理状态类型，并对多种家族与规模的LLM进行系统性评估。 Result: 发现当前LLM在多模态多轮ToM任务中仍存在明显局限，不同模型表现差异显著，揭示了其社会认知能力的边界。 Conclusion: CoMMET为LLM的社会智能评估提供了新范式，推动未来在多模态理解、动态心理建模和道德推理等方向的研究与改进。 Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.

[36] PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Minjia Wang,Yunfeng Wang,Xiao Ma,Dexin Lv,Qifan Guo,Lynn Zheng,Benliang Wang,Lei Wang,Jiannan Li,Yongwei Xing,David Xu,Zheng Sun

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）代理合成真实数字足迹的新方法，从结构化用户画像出发生成多样化、合理的用户事件序列及对应数字产物（如邮件、日程等），实验证明其生成数据更具多样性与真实性，并在真实场景下游任务中提升模型泛化性能。

Details

Motivation: 数字足迹研究常受限于多样性和可获取性不足的数据，亟需高质量合成方法。 Method: 基于结构化用户画像，利用大语言模型（LLM）代理生成多样化、合理的用户事件序列及相应数字 artifacts（如邮件、消息、日历条目等）。 Result: 内在评估显示生成数据比现有基线更丰富、更真实；在真实世界分布外任务上，用该合成数据微调的模型性能优于其他合成数据训练的模型。 Conclusion: LLM驱动的合成方法能有效生成高保真、多样化的数字足迹数据，为行为建模与个性化应用提供可靠数据支撑。 Abstract: Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.

[37] CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote,Korbinian Randl,Ioanna Miliou,Athanasios Lakes,Panagiotis Papapetrou

Main category: cs.CL

TL;DR: CHiL(L)Grader 是首个将校准置信度估计融入人机协同流程的自动评分框架，通过后验温度缩放、基于置信度的选择性预测与持续学习，在保证专家级评分质量的同时，将不确定样本交由人工处理，并适应不断演进的评分标准。

Details

Motivation: 指令微调大模型在教育评估中常过度自信，且随课程更新其可靠性下降，难以在高风险场景中完全自主部署。 Method: 提出 CHiL(L)Grader 框架，结合后验温度缩放、基于置信度的选择性预测和持续学习，实现人机协同评分与模型自适应更新。 Result: 在三个简答题数据集上，自动评分覆盖35–65%的回答，达到专家级质量（QWK ≥ 0.80）；接受与拒绝样本间QWK差达0.347，验证置信路由有效性；每轮教师反馈均提升模型能力。 Conclusion: 不确定性量化是实现可靠AI辅助评分的关键，CHiL(L)Grader为安全、可演化的大规模教育评估提供了可行路径。 Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

[38] BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

Ilias Aarab

Main category: cs.CL

TL;DR: 本文提出BTZSC基准，系统比较了NLI交叉编码器、嵌入模型、重排序器和指令微调大语言模型在零样本文本分类任务上的性能，发现现代重排序器（如Qwen3-Reranker-8B）达到新SOTA，嵌入模型在精度与延迟间取得最佳平衡，而指令微调LLM表现具竞争力但略逊于专用重排序器。

Details

Motivation: 现有评估基准（如MTEB）常依赖监督式探针或微调，未能真实反映模型的零样本能力；亟需一个纯零样本、覆盖多样任务和数据特性的新基准来公平比较主流方法。 Method: 构建包含22个公开数据集的BTZSC零样本分类基准，涵盖情感、主题、意图和情绪等类别；在该基准上系统评测四类模型（NLI交叉编码器、嵌入模型、重排序器、指令微调LLM），共38个公开及自定义检查点。 Result: （i）Qwen3-Reranker-8B达宏观F1=0.72，创SOTA；（ii）GTE-large-en-v1.5等强嵌入模型精度接近最优且延迟最低；（iii）4–12B参数指令微调LLM宏观F1最高达0.67，主题分类突出但整体弱于重排序器；（iv）NLI交叉编码器随骨干扩大性能趋于饱和；（v）缩放效益主要体现于重排序器和LLM，而非嵌入模型。 Conclusion: 重排序器已成为零样本文本分类最先进且高效的方法；嵌入模型是精度与效率兼顾的实用选择；指令微调LLM具备潜力但尚需优化；NLI范式已显疲态；BTZSC为后续研究提供了标准化、纯零样本的评估平台。 Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.

[39] Just Use XML: Revisiting Joint Translation and Label Projection

Thennal D K,Chris Biemann,Hans Ole Hatzel

Main category: cs.CL

TL;DR: 本文提出LabelPigeon框架，通过XML标签联合执行机器翻译与标签投影，在提升跨语言迁移效果的同时，还改善了翻译质量。

Details

Motivation: 现有方法通常将标签投影作为机器翻译后的独立步骤，而联合建模的方法此前被报告会损害翻译质量，本文旨在重新评估该结论并提出更优方案。 Method: 提出LabelPigeon框架，利用XML标签联合建模翻译与标签投影；设计直接评估标签投影效果的方案；在多语言、多任务上进行系统实验。 Result: 在11种语言上标签投影性能优于基线且提升翻译质量；203种语言上的翻译质量一致提升；27种语言、3个下游任务中跨语言迁移F1最高提升39.9。 Conclusion: XML标记的标签投影能高效、有效地实现标签迁移，且不损害甚至提升翻译质量。 Abstract: Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.

[40] Translationese as a Rational Response to Translation Task Difficulty

Maria Kunilovskaya

Main category: cs.CL

TL;DR: 本文提出翻译过程中的认知负荷是导致翻译腔（translationese）的根本原因，并通过信息论指标（如大语言模型的惊讶度）等量化翻译任务难度，验证其对翻译腔的预测能力。

Details

Motivation: 现有研究将翻译腔归因于生产倾向、社会文化变量和语言对效应，但缺乏统一的解释框架。本文旨在从认知负荷角度提供新的理论解释。 Method: 采用英德双向语料库（含书面与口语子语料），以自动分类器输出的片段级“翻译度”分数作为翻译腔的操作化定义；翻译难度分为源文本复杂度和跨语言迁移难度两部分，主要使用基于大语言模型惊讶度的信息论指标，并辅以传统句法与语义特征。 Result: 翻译腔可部分由翻译任务难度解释，尤其在英译德方向更显著；跨语言迁移难度的贡献通常大于源文本复杂度；信息论指标在书面语中表现等于或优于传统特征，但在口语中无优势；源文本句法复杂度与翻译解熵是跨语言对和语体中最稳定的翻译腔预测因子。 Conclusion: 翻译腔本质上反映了翻译任务固有的认知负荷，其可观测表现可通过量化任务难度（尤其是跨语言迁移难度和源文本句法复杂度）进行有效预测，为翻译认知机制提供了实证支持。 Abstract: Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.

[41] To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Thomas Hikaru Clark,Carlos Arriaga,Javier Conde,Gonzalo Martínez,Pedro Reviriego

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）在句子层面心理语言学特征（如句子可记忆性和阅读时间）上的建模能力，发现微调后模型能较好拟合人类标注数据，但零样本/少样本提示效果不稳定。

Details

Motivation: 扩展LLM在心理语言学规范估计中的应用，从词/多词层面延伸至句子层面（如句子可记忆性、阅读时间），并检验其是否无需微调即可替代人类认知测量。 Method: 对LLM进行监督式微调，以预测句子可记忆性和阅读时间；同时对比零样本和少样本提示下的表现，并与可解释的基线预测器比较。 Result: 微调后的LLM能显著相关于人类标注的句子可记忆性和阅读时间，且预测力超越基线；但零样本/少样本提示结果高度不稳定。 Conclusion: LLM蕴含关于句子级认知特征的有用信息，但需谨慎使用提示工程作为人类认知测量的代理，微调仍是可靠路径。 Abstract: Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.

[42] SommBench: Assessing Sommelier Expertise of Language Models

William Brach,Tomas Bedej,Jacob Nielsen,Jacob Pichna,Juraj Bedej,Eemeli Saarensilta,Julie Dupouy,Gianluca Barmina,Andrea Blasi Núñez,Peter Schneider-Kamp,Kristian Košťál,Michal Ries,Lukas Galke Poech

Main category: cs.CL

TL;DR: 本文提出SommBench，一个面向多语言、多文化场景的品酒师专业能力评估基准，涵盖葡萄酒理论问答、风味特征补全和餐酒搭配三类任务，用于检验大语言模型通过文本学习感官判断的能力。

Details

Motivation: 现有文化评估基准主要关注可语言编码的基础文化知识，而缺乏对依赖嗅觉与味觉等感官经验的专业领域（如品酒）的评估；需构建能区分模型语言能力与领域专业知识的多语言基准。 Method: 构建多语言SommBench基准，包含WTQA、WFC、FWP三类任务，覆盖8种语言；数据由专业侍酒师与各语种母语者协作标注；在主流闭源与开源大模型上进行评测。 Result: 顶尖闭源模型在WTQA任务上达97%准确率，但WFC最高仅65%，FWP的马修相关系数（MCC）仅为0–0.39，表明感官推理任务更具挑战性。 Conclusion: SommBench是一个新颖且具挑战性的多语言专业能力评估基准，揭示了当前大语言模型在基于文本学习感官判断方面的显著局限性。 Abstract: With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.

[43] Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Tae-Eun Song

Main category: cs.CL

TL;DR: 本文提出了一种名为Cross-Context Review（CCR）的新方法，通过在无原始上下文的新会话中进行审查，显著提升了大语言模型对自身输出错误的检测能力。实验表明，CCR在F1指标上优于多种基线方法，其优势源于上下文隔离，而非简单重复审查。

Details

Motivation: 大语言模型在同一会话中难以发现自身输出中的错误，亟需一种无需额外训练或基础设施、简单有效的审查机制。 Method: 提出Cross-Context Review（CCR）：在全新会话中、不访问原始生成对话历史的前提下进行审查；与Same-session Self-Review（SR）、Repeated Self-Review（SR2）和Context-aware Subagent Review（SA）进行对照实验。 Result: 在360次审查中，CCR达到F1=28.6%，显著优于SR（24.6%, p=0.008）、SR2（21.7%, p<0.001）和SA（23.8%, p=0.004）；SR2未显著优于SR（p=0.11），证实优势来自上下文分离而非重复。 Conclusion: 上下文隔离是提升LLM自我审查效果的关键；CCR通用、零改造、低成本，仅需一次额外会话，具有强实用性与可扩展性。 Abstract: Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.

[44] LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Feiyu Duan,Xuanjing Huang,Zhongyu Wei

Main category: cs.CL

TL;DR: 本文提出LifeSim用户模拟器和LifeSim-Eval基准，用于评估大语言模型在个性化助手任务中的表现，尤其关注隐式意图理解和长期用户偏好建模能力。

Details

Motivation: 现有个性化助手评测基准未能真实反映用户与AI助手在现实世界中的交互复杂性，特别是外部环境上下文和用户认知状态的建模缺失。 Method: 基于信念-愿望-意图（BDI）模型构建LifeSim用户模拟器，在物理环境中模拟用户认知与意图驱动的交互行为；并据此构建涵盖8个生活领域、1200个场景的多轮交互式基准LifeSim-Eval。 Result: 实验表明当前大语言模型在隐式意图识别和长期用户偏好建模方面存在显著局限性，无论是在单场景还是长周期设定下均表现不足。 Conclusion: LifeSim-Eval为个性化AI助手提供了更贴近真实场景的评测框架，揭示了当前LLMs在认知建模与长期交互能力上的关键短板，指明了未来研究方向。 Abstract: The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.

[45] QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Jiayin Lei,Ming Ma,Yunxi Duan,Chenxi Li,Tianming Yang

Main category: cs.CL

TL;DR: 本文提出QAQ框架，通过反向互信息（RMI）评估合成代码数据质量，从答案预测查询（Q|A）角度筛选高质量样本，在 WarriorCoder 数据集上仅用25%数据即达全量训练效果。

Details

Motivation: 现有基于指令遵循难度（IFD）的数据选择方法难以区分合成数据中的任务固有难度与模型幻觉，导致噪声和幻觉难以检测。 Method: 提出QAQ框架，定义反向互信息（RMI）衡量答案对查询的预测能力；分析RMI高低两端对应的质量问题；引入强弱模型分歧策略筛选既有效又具挑战性的样本。 Result: 在WarriorCoder数据集上，仅选取25%数据进行训练即可达到全量数据训练的性能，显著优于IFD等现有方法。 Conclusion: 双向语义一致性（Q↔A）是合成数据质量的关键指标，QAQ为高效、低成本的代码生成模型训练提供了可扩展的数据筛选路径。 Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.

Łukasz Borchmann,Jordy Van Landeghem,Michał Turski,Shreyansh Padarha,Ryan Othniel Kearns,Adam Mahdi,Niels Rogge,Clémentine Fourrier,Siwei Han,Huaxiu Yao,Artemis Llabrés,Yiming Xu,Dimosthenis Karatzas,Hao Zhang,Anupam Datta

Main category: cs.CL

TL;DR: 本文提出MADQA基准测试，用于评估多模态代理在复杂文档工作流中的战略推理能力，发现现有代理虽能达到人类搜索者的准确率，但依赖暴力搜索而非策略性规划。

Details

Motivation: 探究多模态代理是否具备真正的战略推理能力，而非仅依赖随机试错搜索。 Method: 构建包含2250个人类编写问题、基于800份异构PDF文档的MADQA基准；依据经典测验理论设计以增强区分度；提出衡量准确率-努力权衡的新评估协议。 Result: 最佳代理在准确率上可媲美人类搜索者，但解决的问题不同，且依赖暴力搜索弥补策略规划薄弱；未能缩小近20%的oracle性能差距，易陷入无效循环。 Conclusion: 当前多模态代理尚未实现高效、校准的战略推理，需从暴力检索转向更智能的推理范式。 Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

[47] Long-Context Encoder Models for Polish Language Understanding

Sławomir Dadas,Rafał Poświata,Marek Kozłowski,Małgorzata Grębowiec,Michał Perełkiewicz,Paweł Klimiuk,Przemysław Boruta

Main category: cs.CL

TL;DR: 本文提出了一种支持8192长上下文的高质量波兰语编码器模型，通过两阶段训练（位置编码适配+全参数持续预训练）及知识蒸馏压缩变体，在25项任务（含KLEJ、FinBench及长文档理解任务）上全面超越现有波兰语和多语言模型，尤其在长文本任务中表现突出。

Details

Motivation: 经典编码器（如BERT）上下文窗口短，难以处理长文档；而波兰语高质量长上下文编码器尚属空白。 Method: 采用两阶段训练：先进行位置编码适配，再进行全参数持续预训练；并基于知识蒸馏构建轻量级压缩变体。 Result: 在25项任务（含KLEJ、FinBench及长文档理解任务）上平均性能最优，显著优于现有波兰语与多语言模型，尤其在长上下文任务中优势明显，短文本性能保持相当。 Conclusion: 该工作填补了波兰语长上下文编码器的空白，验证了扩展上下文能力与模型效率可兼顾，为资源受限语言的长文档理解提供了有效方案。 Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

[48] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai,Qian Dong,Ting Jiang,Xin Lv,Zhengxiao Du,Aohan Zeng,Jie Tang,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出IndexCache，通过在多层稀疏注意力中复用部分层的索引结果，显著减少DeepSeek Sparse Attention（DSA）中 indexer 的计算开销，在几乎不损失质量的前提下提升推理速度。

Details

Motivation: DSA虽降低了注意力复杂度，但其indexer仍为O(L^2)且每层独立运行，而各层选出的top-k token高度相似，存在跨层冗余。 Method: 提出IndexCache：将模型层划分为Full层（运行独立indexer）和Shared层（复用最近Full层的top-k索引）；设计训练无关的贪心搜索法（基于校准集损失选择Full层）和训练相关的多层蒸馏损失（使保留的indexer拟合其所服务所有层的平均注意力分布）。 Result: 在30B DSA模型上移除75% indexer计算，语言质量几乎无损，prefill加速1.82×，decode加速1.48×；在GLM-5大模型上也验证有效。 Conclusion: IndexCache是一种高效、轻量、通用的稀疏注意力优化方案，通过挖掘跨层索引冗余，在保持性能的同时大幅降低计算开销，适用于长上下文大模型推理部署。 Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

[49] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Alexandre Le Mercier,Thomas Demeester,Chris Develder

Main category: cs.CL

TL;DR: 本文提出CLASP模型，通过分析Mamba状态空间模型的块输出嵌入（BOEs）并结合XGBoost分类器，在token级别高效检测隐藏状态中毒攻击（HiSPA），在简历筛选等真实场景中实现高F1分数和强泛化能力，且计算开销低，适合实际部署。

Details

Motivation: 隐藏状态中毒攻击（HiSPA）严重威胁状态空间模型（如Mamba）及其混合架构的安全性，亟需轻量、高效、鲁棒的防御方法。 Method: 将HiSPA检测建模为token级二分类任务；提取Mamba块输出嵌入（BOEs）中的判别性模式，使用XGBoost分类器识别恶意token；在真实简历筛选场景下评估，并采用leave-one-out与聚类交叉验证检验泛化性。 Result: 在9.5M token简历数据集上达到95.9% token级F1和99.3%文档级F1；跨攻击模式泛化能力强（leave-one-out下96.9%，结构新颖触发下平均91.6%文档级F1）；推理速度1032 tokens/s，显存占用<4GB。 Conclusion: CLASP是一种计算高效、泛化性强、可即插即用的轻量级前哨防御方案，适用于SSM及混合架构的实际安全部署。 Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

[50] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Priyanka Kargupta,Shuhaib Mehri,Dilek Hakkani-Tur,Jiawei Han

Main category: cs.CL

TL;DR: 本文提出Idea-Catalyst框架，旨在通过系统性跨学科洞察支持人类与大语言模型的创造性推理，提升科学发现的原创性与洞见性。

Details

Motivation: 现有AI辅助科研方法多聚焦于快速实验设计与自动化，忽视了驱动跨学科突破所需的探索性、协作式推理过程；而跨学科研究对长期学术影响至关重要，却常受限于学科壁垒。 Method: Idea-Catalyst从抽象研究目标出发，分解为目标领域核心问题→识别未解挑战→将其泛化为领域无关的概念问题→跨学科检索类比解决方案（如从心理学、社会学等）→合成并重构回原领域→按跨学科潜力排序源领域。 Result: 实证表明，该方法使产出的新颖性提升21%，洞见性提升16%，同时保持与原始问题的高度相关性。 Conclusion: Idea-Catalyst并非替代科研者，而是增强其元认知层面的跨学科推理能力，为AI赋能科学发现提供了‘增强推理’而非‘替代发现’的新范式。 Abstract: Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.

[51] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen,Yilun Zhao,Chengye Wang,Rilyn Han,Manasi Patwardhan,Arman Cohan

Main category: cs.CL

TL;DR: 本文提出synthesis-and-reground框架，用于构建大规模、忠实且真实的科学多模态文档推理数据集SciMDR（300K QA对）及其评测基准SciMDR-Eval（专家标注），显著提升模型在复杂文档级科学问答任务上的性能。

Details

Motivation: 构建科学多模态文档推理数据集面临规模、忠实性和真实性之间的固有权衡，现有方法难以兼顾三者。 Method: 提出两阶段的synthesize-and-reground框架：第一阶段为基于声明的QA合成（Claim-Centric QA Synthesis），生成忠实且聚焦片段的QA对及推理链；第二阶段为文档级重定位（Document-Scale Regrounding），将QA对程序化嵌入完整文档中以保持真实复杂性。基于该框架构建SciMDR训练集与SciMDR-Eval评测集。 Result: SciMDR包含300K带显式推理链的QA对，覆盖20K篇科学论文；SciMDR-Eval为专家标注的全流程多模态理解评测基准；实验表明，基于SciMDR微调的模型在多个科学QA基准上显著提升，尤其在需复杂文档级推理的任务中效果突出。 Conclusion: synthesize-and-reground框架有效缓解了多模态科学文档数据集构建中的规模-忠实-真实三难困境，SciMDR及其评测集为科学多模态基础模型训练与评估提供了高质量资源。 Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

cs.CV [Back]

[52] RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

Shijie Zhou,Bin Zhu,Jiarui Yang,Xiangyu Zhao,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出Robot-Conditioned Normalizing Flow（RC-NF），一种用于机器人异常检测与干预的实时监控模型，通过解耦任务感知的机器人与物体状态，在仅需正样本的无监督训练下实现高精度异常评分，并在仿真与真实场景中验证其有效性与低延迟响应能力。

Details

Motivation: 现有基于模仿学习的视觉-语言-动作（VLA）模型在动态环境和分布外（OOD）条件下鲁棒性差，难以可靠运行。 Method: 提出Robot-Conditioned Normalizing Flow（RC-NF），在归一化流中解耦处理任务相关的机器人状态与物体运动轨迹；仅用正样本进行无监督训练；利用概率密度函数实时计算异常分数；并构建LIBERO-Anomaly-10仿真异常评测基准。 Result: RC-NF在LIBERO-Anomaly-10所有异常类型上达到SOTA性能；真实实验中作为即插即用模块（如集成于pi0）可提供<100ms延迟的OOD信号，支持状态级回滚或任务级重规划。 Conclusion: RC-NF显著提升了VLA驱动机器人系统在动态环境中的鲁棒性与适应性，为实时异常监控与干预提供了有效解决方案。 Abstract: Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.

[53] GGPT: Geometry Grounded Point Transformer

Yutong Chen,Yiming Wang,Xucong Zhang,Sergey Prokudin,Siyu Tang

Main category: cs.CV

TL;DR: 本文提出Geometry-Grounded Point Transformer (GGPT)，通过引入基于稀疏几何引导的Transformer架构，将显式多视图几何约束融入前馈式稀疏视角3D重建中，在保持高效性的同时显著提升几何一致性与细节完整性。

Details

Motivation: 现有前馈网络在稀疏视角3D重建中虽能直接预测稠密点云，但因缺乏显式多视角几何约束，常出现几何不一致和细节精度不足问题。 Method: 提出两阶段方法：1）改进的Structure-from-Motion流程，利用稠密特征匹配与轻量几何优化获取准确相机位姿与稀疏点云；2）设计几何引导的3D点Transformer，以优化后的几何编码对稠密点图进行显式部分几何监督下的精细化重构。 Result: 在ScanNet++上仅用VGGT预测训练，GGPT在域内与跨域设置下均显著超越当前最优前馈式3D重建方法，重建结果几何一致、空间完整，能恢复细结构并填补无纹理区域。 Conclusion: GGPT为融合几何先验与前馈预测提供了一种原理清晰、通用性强的有效框架，验证了显式几何引导对提升稀疏视角重建质量的关键作用。 Abstract: Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.

[54] Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction

Jingxing Zhong,Qingtao Pan,Xuchang Zhou,Jiazhen Lin,Xinguo Zhuang

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的乳腺肿瘤分割模型TextBCS，通过分阶段的视觉-语言交互和证据学习，提升低对比度和边界模糊场景下的肿瘤分割精度。

Details

Motivation: 现有基于深度学习的乳腺肿瘤分割方法在低对比度和边界模糊情况下难以准确定位肿瘤轮廓，而文本提示信息有望改善分割效果。 Method: 提出TextBCS模型，包含分阶段视觉-语言交互机制（在下采样各阶段实现图文特征互信息）和证据学习（采用变分狄利克雷分布量化分割不确定性，尤其针对模糊边界）。 Result: 在公开数据集上实验表明，TextBCS优于其他分割网络，实现了当前最优的乳腺肿瘤分割性能。 Conclusion: 文本引导与不确定性建模相结合可有效提升乳腺MRI图像中肿瘤分割的鲁棒性与精度，为临床辅助诊断提供了新思路。 Abstract: Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.

[55] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Haihua Luo,Xuming Ran,Jiangrong Shen,Timo Hämäläinen,Zhonghua Chen,Qi Xu,Fengyu Cong

Main category: cs.CV

TL;DR: 本文提出了一种简单高效的增量学习框架SimE，利用带适配器的视觉-语言模型（如CLIP），发现适配器连接数量与增量学习能力呈非线性关系，并在TinyImageNet和CIFAR-100上显著优于现有方法。

Details

Motivation: 解决现有基于预训练视觉-语言模型的增量学习方法存在的训练效率低、依赖记忆库、需强骨干网络三大问题。 Method: 提出SimE框架，采用带定制化适配器的视觉-语言模型；系统分析适配器在Transformer块间与块内连接数量对增量学习性能的影响；探索用更大数据集（如LAION2B）和更强架构（如ViT-L/14）训练的CLIP替换编码器以增强零样本能力。 Result: SimE在TinyImageNet上比传统方法提升9.6%，在CIFAR-100上比其他CLIP基线方法提升5.3%；发现适配器连接数与IL能力呈非线性关系：块间增加有益，块内过多反而损害性能。 Conclusion: 适配器结构设计对视觉-语言模型用于增量学习至关重要；合理配置适配器连接可显著提升效率与性能，无需记忆库或复杂训练策略；结合更强CLIP编码器可进一步释放零样本潜力。 Abstract: Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).

[56] Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Yuehao Song,Shaoyu Chen,Hao Gao,Yifan Zhu,Weixiang Yue,Jialv Zou,Bo Jiang,Zihao Lu,Yu Wang,Qian Zhang,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出Senna-2，一种通过三阶段一致性训练范式显式对齐视觉语言模型（VLM）高层决策与端到端（E2E）低层规划的驾驶策略，显著提升双系统一致性与驾驶安全性。

Details

Motivation: 现有视觉语言模型（VLM）增强端到端（E2E）驾驶策略的方法忽视了VLM高层决策与E2E低层规划之间的双系统一致性，导致轨迹与决策不匹配，削弱自上而下的指导能力与决策遵循能力。 Method: 提出Senna-2，采用一致性导向的三阶段训练范式：第一阶段为驾驶预训练，通过决策适配器将VLM决策以隐式嵌入形式传递给E2E策略；第二阶段在开环设置下对齐VLM与E2E策略；第三阶段在3DGS环境中通过自底向上的分层强化学习进行闭环对齐，强化安全性和效率。 Result: 实验表明，Senna-2在双系统一致性上提升19.3%（F1分数），在开环设置中最终位移误差（FDE）降低5.7%，在闭环设置中平均失败碰撞率（AF-CR）降低30.6%。 Conclusion: Senna-2通过显式对齐VLM与E2E策略，有效解决了双系统不一致问题，显著提升了驾驶策略的安全性、一致性与决策跟随能力。 Abstract: Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).

[57] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Qingtao Pan,Zhihao Dou,Shuo Li

Main category: cs.CV

TL;DR: 本文提出FMVR，一种即插即用的频率调制视觉恢复策略，通过分离和调制视觉表征的高低频成分，在减少视觉token的同时保留并恢复视觉语义，显著降低计算量而不损失精度。

Details

Motivation: 大型多模态模型（LMMs）因视觉token数量多而难以适应不同计算预算；现有方法在减少token时不可避免地丢失视觉语义。 Method: FMVR将少量视觉token的表征通过AvgPool和MaxPool解耦为低频与高频分量，并用轻量可学习参数进行调制：AvgPool高频作为显著性滤波器增强显著语义，MaxPool低频作为反显著性滤波器强化弱语义；进一步结合Matryoshka表示学习实现推理时弹性调整token数量。 Result: FMVR-LLaVA在10个图像和4个视频基准上将LLaVA-1.5-7B的FLOPs降低89%，同时保持近100%原始精度。 Conclusion: FMVR是一种简单高效、即插即用的视觉token压缩增强方法，兼顾计算效率与语义保真度，适用于多种LMM架构。 Abstract: Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

[58] When Slots Compete: Slot Merging in Object-Centric Learning

Christos Chatzisavvas,Panagiotis Rigas,George Ioannakis,Vassilis Katsouros,Nikolaos Mitianoudis

Main category: cs.CV

TL;DR: 本文提出了一种名为slot merging的轻量级操作，用于在基于slot的对象中心学习中合并重叠的slot，从而提升对象分解和分割质量。

Details

Motivation: 现有基于slot的方法通常使用固定数量的slot，导致多个slot竞争同一物体的重叠区域，而非关注不同区域，影响对象分解效果。 Method: 引入slot merging操作，通过Soft-IoU度量slot注意力图之间的重叠，并采用重心更新方式合并选定slot对；合并策略基于重叠统计自动设定阈值，无需额外可学习模块。 Result: 在DINOSAUR特征重建框架中集成该方法后，在对象发现与分割基准上优于其他自适应方法，提升了对象分解能力和掩码质量。 Conclusion: Slot merging是一种简单有效、即插即用的技术，能显著改善slot-based模型的对象因子化性能，且不增加模型复杂性。 Abstract: Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.

[59] Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models

Benedikt Schwab,Thomas H. Kolbe

Main category: cs.CV

TL;DR: 本文提出了一种利用多源移动激光雷达（LiDAR）数据提取语义3D城市模型中物体表面“辐射指纹”的方法，以推断材料特性，并将6368个LOD3级语义对象与3.12亿条LiDAR回波自动关联，揭示了类内材料一致性规律。

Details

Motivation: 现有语义3D城市模型缺乏材料信息，而多时相、多传感器移动LiDAR数据蕴含丰富的表面材质响应特征，亟需一种结构化方法融合二者以增强数字孪生分析能力。 Method: 提出‘辐射指纹’概念，通过按语义对象聚合来自不同距离、入射角、环境、传感器及扫描活动的LiDAR辐射强度观测；基于CityGML 3.0构建高精度（厘米级）LOD3语义城市模型；开发地理数据库3DSensorDB实现数据关联与管理。 Result: 成功将A2D2数据集4次扫描、5种LiDAR传感器获取的3.124亿条光束自动匹配至6368个语义对象；辐射指纹分析揭示了同类物体（如沥青路、混凝土墙）内部显著一致的强度响应模式，表明其主导材料可识别。 Conclusion: 辐射指纹是连接语义3D城市模型与物理材料属性的有效桥梁，所提方法可规模化扩展，支撑城市数字孪生在热建模、声学仿真、可持续性评估等领域的深化应用；相关模型、代码与数据库已开源。 Abstract: Although semantic 3D city models are internationally available and becoming increasingly detailed, the incorporation of material information remains largely untapped. However, a structured representation of materials and their physical properties could substantially broaden the application spectrum and analytical capabilities for urban digital twins. At the same time, the growing number of repeated mobile laser scans of cities and their street spaces yields a wealth of observations influenced by the material characteristics of the corresponding surfaces. To leverage this information, we propose radiometric fingerprints of object surfaces by grouping LiDAR observations reflected from the same semantic object under varying distances, incident angles, environmental conditions, sensors, and scanning campaigns. Our study demonstrates how 312.4 million individual beams acquired across four campaigns using five LiDAR sensors on the Audi Autonomous Driving Dataset (A2D2) vehicle can be automatically associated with 6368 individual objects of the semantic 3D city model. The model comprises a comprehensive and semantic representation of four inner-city streets at Level of Detail (LOD) 3 with centimeter-level accuracy. It is based on the CityGML 3.0 standard and enables fine-grained sub-differentiation of objects. The extracted radiometric fingerprints for object surfaces reveal recurring intra-class patterns that indicate class-dominant materials. The semantic model, the method implementations, and the developed geodatabase solution 3DSensorDB are released under: https://github.com/tum-gis/sensordb

[60] Towards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery

Yu Chung Lee,David G. Black,Ryan S. Yeung,Septimiu E. Salcudean

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB图像的自动化患者注册与解剖信息引导的初始探头放置（PIPG）框架，用于辅助心脏和肺部超声检查，尤其适用于远程超声场景。

Details

Motivation: 心脏和肺部超声操作技术要求高，尤其在远程超声中，新手或机器人缺乏现场专家指导，难以准确定位肋间声窗并完成标准切面导航。 Method: 利用混合现实（MR）头戴设备采集患者RGB图像，边缘服务器重建患者特异性体表与骨骼模型，并基于预测骨骼的关键骨性标志估计肋间区域，将探头引导姿态投影回重建体表。 Result: 在健康志愿者上的试点实验表明，该方法可实现解剖学可接受范围内的稳定初始探头定位，定量放置误差验证了其可行性。 Conclusion: 该框架仅依赖RGB图像即可提供解剖感知的探头初始定位指导，为远程超声及机器人辅助超声提供了实用、低成本的技术路径。 Abstract: Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup

[61] InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction

Dingqiang Ye,Jiacong Xu,Jianglu Ping,Yuxiang Guo,Chao Fan,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出InstantHDR，一种前馈网络，用于从未经校准的多曝光LDR图像集合中单次前向传递重建3D HDR场景，解决了现有HDR方法依赖已知相机位姿、密集点云初始化和耗时逐场景优化的问题。

Details

Motivation: 现有HDR NVS方法严重依赖已知相机位姿、良好初始化的稠密点云以及耗时的逐场景优化；而当前前馈方法忽略了HDR问题，假设曝光不变外观，存在明显差距。 Method: 设计了基于几何引导的多曝光融合外观建模和可泛化的场景特定色调映射元网络，并构建了包含168个Blender渲染场景的HDR-Pretrain预训练数据集，涵盖多种光照类型和相机响应函数。 Result: InstantHDR在合成性能上媲美当前最优的基于优化的HDR方法，且单次前向推理速度提升约700倍，后优化设置下提升约20倍。 Conclusion: InstantHDR实现了高效、通用的HDR新视角合成，为无需相机标定和逐场景优化的实时HDR重建提供了新范式。 Abstract: High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes from multi-exposure low dynamic range (LDR) images. Existing HDR pipelines heavily rely on known camera poses, well-initialized dense point clouds, and time-consuming per-scene optimization. Current feed-forward alternatives overlook the HDR problem by assuming exposure-invariant appearance. To bridge this gap, we propose InstantHDR, a feed-forward network that reconstructs 3D HDR scenes from uncalibrated multi-exposure LDR collections in a single forward pass. Specifically, we design a geometry-guided appearance modeling for multi-exposure fusion, and a meta-network for generalizable scene-specific tone mapping. Due to the lack of HDR scene data, we build a pre-training dataset, called HDR-Pretrain, for generalizable feed-forward HDR models, featuring 168 Blender-rendered scenes, diverse lighting types, and multiple camera response functions. Comprehensive experiments show that our InstantHDR delivers comparable synthesis performance to the state-of-the-art optimization-based HDR methods while enjoying $\sim700\times$ and $\sim20\times$ reconstruction speed improvement with our single-forward and post-optimization settings. All code, models, and datasets will be released after the review process.

[62] Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

Jun Yu,Yunxiang Zhang,Naixiang Zheng,Lingsi Zhu,Guoyuan Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于分层粒度对齐与状态空间模型的新型多模态框架，用于解决野外环境下面部动作单元（AU）检测的挑战。该框架利用DINOv2和WavLM等基础模型提取高质量视听特征，并通过动态对齐全局语义与局部活跃区域、Vision-Mamba建模超长时序依赖、以及非对称交叉注意力机制实现音视频深度同步，在Aff-Wild2数据集上达到SOTA性能，并在ABAW10竞赛AU检测赛道中夺冠。

Details

Motivation: 野外环境下面部动作单元（AU）检测面临空间-时间异质性严重、姿态不受控、音视频依赖复杂等挑战；现有多模态方法受限于编码器容量和浅层融合机制，难以捕捉细粒度语义变化和超长时序上下文。 Method: 提出一种新型多模态框架：1）采用DINOv2和WavLM作为视觉与音频特征提取器；2）设计分层粒度对齐模块，动态对齐全局面部语义与局部活跃块；3）引入Vision-Mamba架构替代传统TCN，实现O(N)线性复杂度的超长时序建模；4）设计非对称交叉注意力机制，深度融合副语言音频线索与细微视觉运动。 Result: 在Aff-Wild2数据集上显著超越现有基线，达到当前最优性能；并在第十届野外情感行为分析竞赛（ABAW10）AU检测赛道中获得第一名。 Conclusion: 所提框架通过结合强表征能力的基础模型、分层对齐策略、高效状态空间建模及深度音视频交互机制，有效应对野外AU检测的核心难点，验证了其先进性与实用性。 Abstract: Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.

[63] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Ziyao Wang,Chen Chen,Jingtao Li,Weiming Zhuang,Jiabo Huang,Ang Li,Lingjuan Lyu

Main category: cs.CV

TL;DR: 本文提出了一种名为UniCompress的统一视觉令牌压缩算法，通过可学习的全局元令牌引导的压缩与解压机制，在大幅减少视觉令牌数量的同时，保持图像理解和生成任务的性能，显著降低计算和内存开销。

Details

Motivation: 统一多模态模型因需大量视觉令牌而导致计算和内存开销大，难以部署于资源受限场景（如具身AI系统）。 Method: 提出UniCompress算法，引入可学习的全局元令牌指导的轻量级、模块化插件式压缩与解压机制，无需全模型重训练即可集成到现有统一模型中。 Result: 视觉令牌减少达4倍，显著降低推理延迟和训练成本，仅带来轻微性能下降。 Conclusion: UniCompress验证了高效令牌压缩在统一多模态建模中的可行性与实用性，为真实世界多模态应用提供了新路径。 Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.

[64] UNet-AF: An alias-free UNet for image restoration

Jérémy Scanvic,Quentin Barthélemy,Julián Tachella

Main category: cs.CV

TL;DR: 本文提出了一种无混叠（alias-free）的UNet架构，通过选用具有平移等变性的先进层来提升模型对平移变换的等变性，在图像恢复任务中实现了与基线模型相当的性能，同时显著提升了实测等变性。

Details

Motivation: UNet虽被默认具有平移等变性，但其传统组件易产生混叠，损害实际等变性。 Method: 设计并集成多种先进的平移等变层，构建无混叠UNet，并通过消融实验验证各组件作用。 Result: 在图像恢复任务上达到与非等变基线相当的性能，且实测等变性显著提升。 Conclusion: 精心选择等变层可有效提升UNet的平移等变性，而无需牺牲性能；各改进模块均对等变性有关键贡献。 Abstract: The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at https://github.com/jscanvic/UNet-AF

[65] Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis

Zhenxuan Zhang,Peiyuan Jing,Ruicheng Yuan,Liwei Hu,Anbang Wang,Fanwen Wang,Yinzhe Wu,Kh Tohidul Islam,Zhaolin Chen,Zi Wang,Peter Lally,Guang Yang

Main category: cs.CV

TL;DR: 本文提出了一种可靠性感知的扩散模型ReDiff，用于低场到高场MRI图像合成，通过可靠性引导采样和不确定性感知的多候选选择策略，提升结构保真度并减少解剖不一致伪影。

Details

Motivation: 现有扩散模型在低场到高场MRI合成中难以兼顾细节恢复与结构保真，易在结构模糊区域生成解剖不一致的伪影（如虚假边缘、纹理异常），影响下游定量分析和临床可信度。 Method: 提出ReDiff框架：1）可靠性引导的采样策略，在去噪过程中抑制不可靠响应；2）不确定性感知的多候选选择方案，提升最终预测的可靠性。 Result: 在多中心MRI数据集上实验表明，相比SOTA方法，ReDiff显著提升了结构保真度，减少了伪影。 Conclusion: ReDiff通过在采样与后生成阶段引入可靠性建模，实现了更空间可靠、解剖一致的MRI合成，增强了临床实用性。 Abstract: Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.

[66] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

Yuto Shibata,Kashu Yamazaki,Lalit Jayanti,Yoshimitsu Aoki,Mariko Isogawa,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: 本文提出AssistMimic，一种基于多智能体强化学习的方法，用于模仿人与人之间紧密互动、力交换的辅助动作，首次在标准基准上成功实现对辅助交互动作的跟踪。

Details

Motivation: 现有通用运动跟踪方法难以应对辅助场景中需持续感知人类伙伴姿态与动态并快速适应的需求。 Method: 将辅助交互动作模仿建模为多智能体强化学习问题，在物理仿真器中联合训练支持者（助手）与接受者两个智能体的伙伴感知策略；引入基于单人运动控制器的策略初始化方案、动态参考重定向机制及接触促进奖励函数。 Result: AssistMimic成为首个在标准基准上成功跟踪辅助交互动作的方法，验证了多智能体RL在具身化、社会感知人形控制中的有效性。 Conclusion: 多智能体强化学习框架能有效提升人形机器人在物理交互与社会协作任务中的适应性与鲁棒性。 Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.

Mingzhe Tao,Ruiping Liu,Junwei Zheng,Yufan Chen,Kedi Ying,M. Saquib Sarfraz,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文提出DriveXQA多模态数据集和MVX-LLM模型，用于提升自动驾驶中异常驾驶场景的理解能力，尤其在传感器失效与恶劣天气条件下表现优异。

Details

Motivation: 现有MLLMs未充分探索利用多传感器信息理解自动驾驶中的异常驾驶场景，存在研究空白。 Method: 构建包含四种视觉模态、五种传感器失效情况和五种天气条件的DriveXQA数据集（102,505个QA对）；设计MVX-LLM模型，采用双交叉注意力（DCA）投影器实现多模态高效融合。 Result: DCA在雾天等挑战性条件下显著提升性能（GPTScore：53.5 vs. 基线25.1）；DriveXQA数据集与源码将开源。 Conclusion: DriveXQA和MVX-LLM为多模态自动驾驶理解提供了新基准与有效架构，尤其适用于复杂、异常驾驶场景。 Abstract: Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

[68] High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping

Sehoon Tak,Keunhee Cho,Sangpil Kim,Jae-Sang Hyun

Main category: cs.CV

TL;DR: 本文提出了一种高精度位姿估计方法，通过在移动DFP系统上增加一个固定且内参标定的全局投影仪，利用其相位导出的像素约束和PnP式重投影目标，在固定参考系中估计DFP系统位姿，无需依赖确定性特征提取，并实验证明了其采样不变性。

Details

Motivation: 传统DFP在大尺度三维重建中面临六自由度位姿估计精度不足的问题；ICP配准在大规模点云中效率低且依赖降采样或特征提取，易损失细节并降低精度；现有漂移校正方法无法解决密集DFP点云对采样的敏感性。 Method: 引入固定、内参已知的全局投影仪，结合其相位信息提供的像素约束与PnP风格的重投影优化目标，实现移动DFP系统在固定参考系下的高精度位姿估计，不依赖人工特征提取，并验证坐标保持型子采样下的采样不变性。 Result: 实验表明该方法达到亚毫米级位姿精度（含量化不确定性边界），在强降采样下具有高重复性，对均匀表面和低重叠视角鲁棒，并能有效降低ICP轨迹的误差累积。 Conclusion: 该方法推动DFP向准静态场景（如检测与计量）中的高精度三维建图拓展，代价是需时间复用方式采集额外投影仪数据。 Abstract: Digital fringe projection (DFP) enables micrometer-level 3D reconstruction, yet extending it to large-scale mapping remains challenging because six-degree-of-freedom pose estimation often cannot match the reconstruction's precision. Conventional iterative closest point (ICP) registration becomes inefficient on multi-million-point clouds and typically relies on downsampling or feature-based selection, which can reduce local detail and degrade pose precision. Drift-correction methods improve long-term consistency but do not resolve sampling sensitivity in dense DFP point clouds.We propose a high-precision pose estimation method that augments a moving DFP system with a fixed, intrinsically calibrated global projector. Using the global projector's phase-derived pixel constraints and a PnP-style reprojection objective, the method estimates the DFP system pose in a fixed reference frame without relying on deterministic feature extraction, and we experimentally demonstrate sampling invariance under coordinate-preserving subsampling. Experiments demonstrate sub-millimeter pose accuracy against a reference with quantified uncertainty bounds, high repeatability under aggressive subsampling, robust operation on homogeneous surfaces and low-overlap views, and reduced error accumulation when used to correct ICP-based trajectories. The method extends DFP toward accurate 3D mapping in quasi-static scenarios such as inspection and metrology, with the trade-off of time-multiplexed acquisition for the additional projector measurements.

[69] DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Ravi Mosalpuri,Mohammed Abdelsamea,Ahmed Karam Eldaly

Main category: cs.CV

TL;DR: 本文提出DeepHistoViT，一种基于视觉Transformer的可解释框架，用于组织病理图像自动分类，在肺癌、结肠癌和急性淋巴细胞白血病数据集上达到接近100%的多项指标性能。

Details

Motivation: 传统组织病理学诊断耗时、劳动密集且存在观察者间差异，亟需可靠、可解释的计算机辅助诊断工具。 Method: 提出DeepHistoViT，一种定制化Vision Transformer架构，集成注意力机制以捕获细粒度细胞结构，并通过注意力定位实现诊断相关区域的可解释性。 Result: 在三个公开组织病理数据集（肺癌、结肠癌、急性淋巴细胞白血病）上取得SOTA性能：肺癌和结肠癌数据集各项指标（准确率、精确率、召回率、F1分数、ROC-AUC）均为100%；ALL数据集各项指标均超99.8%，ROC-AUC达99.99%，所有结果均附95%置信区间。 Conclusion: Transformer架构在组织病理图像分析中极为有效，DeepHistoViT具备高精度与可解释性，有望成为支持病理医生临床决策的实用辅助工具。 Abstract: Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.

[70] Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

Nazia Tasnim,Keanu Nichols,Yuting Yang,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan A. Plummer

Main category: cs.CV

TL;DR: 本文提出了一个名为DORI的认知导向分层基准，专门用于评估视觉-语言模型对物体朝向的理解能力，揭示了当前模型在朝向推理任务上存在严重不足。

Details

Motivation: 现有视觉-语言基准大多将朝向与位置和场景理解混淆，缺乏对物体朝向这一核心认知能力的专门评估；人类对朝向的理解是渐进式的，而现有模型未能体现这种能力。 Method: 提出DORI基准，基于人类朝向认知的四个阶段，从粗粒度（分类）到细粒度（度量）两个层次评估朝向理解；构建包含13652张图像、33656道多选题的数据集，覆盖67类物体，并通过边界框隔离、标准化空间参考系和结构化提示控制混杂因素。 Result: 在24个SOTA视觉-语言模型上的评测显示：模型在通用空间任务表现良好，但在物体朝向任务上接近随机水平；最佳模型在粗粒度和细粒度任务上分别仅达54.2%和45.0%，尤其在复合旋转和参照系切换任务上失败最多；存在显著的粗-细粒度性能差距。 Conclusion: 物体朝向理解仍是多模态系统未解决的关键挑战，现有基准掩盖了模型依赖类别启发式而非几何推理的根本缺陷，该问题对机器人操作、3D场景重建和人机交互具有重要影响。 Abstract: Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.

[71] Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Fatemeh Naeinian,Ali Hamza,Haoran Zhu,Anna Choromanska

Main category: cs.CV

TL;DR: 本文研究端到端自动驾驶模型在未见过城市的零样本跨城泛化能力，发现监督预训练主干网络易依赖城市特有线索，导致跨城迁移性能严重下降；而自监督视觉表征（如I-JEPA、DINOv2、MAE）可显著缩小该泛化差距，提升轨迹规划鲁棒性。

Details

Motivation: 端到端自动驾驶模型在多城市数据上训练后，其对未见城市的泛化能力尚未被充分检验；地理混合训练可能掩盖真实域偏移下的失效模式，亟需评估零样本跨城迁移能力。 Method: 将多种自监督主干网络（I-JEPA、DINOv2、MAE）集成到端到端轨迹规划框架中，在nuScenes（开环）和NAVSIM（闭环）数据集上采用严格的地理划分进行评估。 Result: 监督主干在波士顿→新加坡迁移时L2位移比达9.77倍、碰撞率19.43倍；自监督预训练将其分别降至1.20倍和0.75倍；闭环评估中PDMS指标最高提升4%。 Conclusion: 自监督表征学习显著增强跨城规划鲁棒性；零样本地理迁移应成为端到端自动驾驶系统评估的必要基准。 Abstract: End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.

[72] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang,Zhe Wang,Xuyi Yang,Songchun Zhang,Xianghao Kong,Taiyi Wu,Xiaotong Zhao,Ran Zhang,Alan Zhao,Anyi Rao

Main category: cs.CV

TL;DR: 本文提出ShotVerse框架，通过数据驱动的'规划-控制'范式解决文本驱动视频生成中多镜头场景下的相机控制难题，利用VLM规划器生成全局对齐轨迹，并通过控制器渲染多镜头视频。

Details

Motivation: 文本驱动视频生成在多镜头电影场景中缺乏精确的相机控制能力：隐式文本提示不够精准，而显式轨迹条件设定又带来过高人工成本且易导致模型执行失败。 Method: 提出'Plan-then-Control'框架ShotVerse，包含基于VLM的Planner（利用空间先验从文本生成全局对齐的电影化轨迹）和Controller（通过相机适配器将轨迹渲染为多镜头视频）；构建自动化多镜头相机标定流水线，建立统一全局坐标系，并发布高质量数据集ShotVerse-Bench。 Result: 实验表明ShotVerse显著提升多镜头视频的相机精度与跨镜头一致性，在电影美学与控制可靠性上均优于现有方法。 Conclusion: 数据驱动的（Caption, Trajectory, Video）三元组联合分布建模是解决文本到多镜头视频相机控制瓶颈的有效路径，ShotVerse为可控视频生成提供了新范式。 Abstract: Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

[73] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li,Xin Zhu,Zechao Guan,Peipeng Chen,Jian Yao

Main category: cs.CV

TL;DR: 本文提出R-MSD（可靠多样本蒸馏）框架，通过建模教师采样方差、构建任务自适应教师池、结合质量感知信号匹配与对抗蒸馏目标，提升LVLMs知识蒸馏的稳定性与效果，在多个视频理解基准上显著优于单样本蒸馏方法。

Details

Motivation: 传统黑箱蒸馏依赖单个教师响应，易导致高方差和格式不一致，尤其在多模态或时序场景下监督不可靠。 Method: 提出R-MSD框架：1）构建任务自适应教师池替代单一教师；2）引入质量感知信号匹配过滤噪声；3）设计对抗蒸馏目标增强知识迁移鲁棒性。 Result: 在VideoMME、Video-MMMU、MathVerse等视频理解基准上，4B学生模型分别提升+1.5%、+3.2%、+3.6%；显著优于单样本蒸馏及同训练预算下的SFT+RL基线。 Conclusion: 多教师样本建模与质量感知对抗蒸馏可有效缓解教师响应方差问题，提升LVLMs蒸馏稳定性与性能。 Abstract: Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

[74] Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Seung Hyup Baek,Jimin Lee,Hyeongkeun Lee,Jae Won Cho

Main category: cs.CV

TL;DR: 本文提出了一种基于角色特异性查询的密集视频描述生成方法，通过分离定位与描述任务、对比对齐、重叠抑制和概念增强模块，提升了多事件定位精度与描述语义丰富性。

Details

Motivation: 现有基于查询的密集视频描述方法因共享查询导致定位与描述任务间干扰严重，且存在时间冗余问题。 Method: 提出角色特异性查询以解耦定位与描述任务；引入对比对齐保证语义一致性；设计重叠抑制机制惩罚查询间时间重叠；加入轻量级概念提取模块增强描述语义。 Result: 在YouCook2和ActivityNet Captions两个主流基准上验证了方法有效性，显著提升定位准确率与描述质量。 Conclusion: 角色解耦、对比对齐、重叠抑制与概念增强协同提升了密集视频描述的整体性能，为多任务干扰与时间冗余问题提供了有效解决方案。 Abstract: Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.

[75] Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

Mehmet Kerem Turkcan

Main category: cs.CV

TL;DR: 本文提出DART框架，无需训练即可将SAM3转化为实时多类检测器，通过共享视觉主干计算和批量解码等优化，在不修改模型权重的情况下显著提升推理速度，并在COCO数据集上取得优异的AP和FPS表现。

Details

Motivation: 现有方法如SAM3每次只能处理一个文本提示，检测N个类别需N次独立前向传播，导致计算开销大、效率低。 Method: 利用视觉主干对类别无感知的结构不变性，共享其计算；结合批量多类解码、仅检测推理及TensorRT FP16部署；极端低延迟场景下采用适配器蒸馏与冻结编解码器。 Result: 在COCO val2017上，4类检测达55.8 AP、15.8 FPS（RTX 4080）；80类时累计加速达25倍；极低延迟下实现38.7 AP与13.9 ms主干延迟。 Conclusion: DART是一种训练无关、高效可扩展的实时开放词汇检测框架，显著优于同类训练密集型方法，且保持模型原始精度。 Abstract: Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.

[76] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Seung hee Choi,MinJu Jeon,Hyunwoo Oh,Jihwan Lee,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出STaRC框架，通过高亮检测模块监督帧级显著性，利用显著性分数统一指导检索与字幕生成，提升密集视频字幕任务中时间分割的准确性与上下文相关性。

Details

Motivation: 现有基于检索增强的密集视频字幕（DVC）方法在时间分割上难以准确对齐真实事件边界，因其依赖忽略真实标注的启发式策略。 Method: 提出STaRC框架：1）构建基于DVC真值二值标签训练的高亮检测模块以监督帧级显著性；2）将显著性分数作为统一时间信号，用于显著性引导的时间分割和注入解码器的显著性提示（Saliency Prompts）；3）实施显著性约束分割以提升时序一致性。 Result: 在YouCook2和ViTT基准上全面评估，STaRC在多数指标上达到当前最优性能。 Conclusion: STaRC通过引入可学习、真值驱动的显著性建模，有效解决了DVC中时间分割不准与检索-生成脱节的问题，显著提升了字幕生成的时序准确性和语义连贯性。 Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC

[77] INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Junqi Yang,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出INFACT基准，用于诊断视频大语言模型（Video-LLMs）在忠实性与事实性方面的幻觉问题，并在多种干扰模式下评估其可靠性。

Details

Motivation: 现有基准对事实性幻觉覆盖有限，且主要在干净环境下评估，缺乏对模型在真实复杂场景中可靠性的系统评测。 Method: 构建包含9800个QA实例的INFACT基准，涵盖真实与合成视频，细粒度划分忠实性与事实性幻觉；设计四种评测模式（Base、视觉退化、证据污染、时间干预），并引入抵抗率（RR）与时间敏感性得分（TSS）量化可靠性。 Result: 在14个主流Video-LLMs上的实验表明：基础准确率高并不保证干扰下的可靠性；证据污染显著降低稳定性；时间干预导致最大性能下降；多个开源模型在事实性TSS上接近零，显示其对时序敏感问题存在严重时间惯性。 Conclusion: INFACT揭示了当前Video-LLMs在事实性与时间敏感性方面存在严重可靠性缺陷，为未来提升模型鲁棒性与可信度提供了关键评测工具与实证依据。 Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

[78] SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation

Xiaogang Du,Jiawei Zhang,Tongfei Liu,Tao Lei,Yingbo Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SPEGC的持续测试时自适应（CTTA）方法，用于解决医学图像分割中因域偏移导致的性能下降问题。该方法通过语义提示增强特征和可微图聚类求解器，提升模型在无标签连续变化域上的鲁棒性与适应性。

Details

Motivation: 医学图像分割中训练与测试数据间的域差异导致预训练模型临床部署困难；现有CTTA方法依赖不可靠监督信号，易引发错误累积和性能崩溃。 Method: 提出SPEGC方法：1）设计语义提示特征增强机制，利用解耦的共性与异质性提示池注入全局上下文信息；2）构建可微图聚类求解器，将边稀疏化建模为最优传输问题，端到端生成高阶结构表征；3）用该结构表征指导模型自适应，实现簇级预测一致性和决策边界动态调整。 Result: 在两个医学图像分割基准上，SPEGC显著优于现有最先进CTTA方法。 Conclusion: SPEGC通过语义提示增强与图聚类优化，有效缓解域偏移下的噪声干扰与错误累积，提升了CTTA在医学图像分割中的鲁棒性与实用性。 Abstract: In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei-Z/SPEGC-for-MIS.

[79] OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi,Wenhua Wu,Fei Shen,Xiaogang Zhu,Kun Hu,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出OrthoEraser方法，利用稀疏自编码器（SAE）实现高分辨率特征解耦，并通过解析正交化投影进行概念擦除，在消除有害内容的同时保留生成流形的完整性，显著优于现有方法。

Details

Motivation: 现有文本到图像模型的概念擦除方法在抑制敏感神经元时易损害良性属性，因敏感与良性语义在激活子空间中非正交叠加、相互纠缠。 Method: 提出OrthoEraser：先用稀疏自编码器（SAE）分解密集激活并分离敏感神经元；再通过耦合神经元检测识别易受干预的非敏感特征；最后采用解析梯度正交化策略，将擦除向量投影到耦合神经元的零空间，实现敏感概念与关键良性子空间的正交解耦。 Result: 实验表明OrthoEraser具有高擦除精度，能有效移除有害内容并保持生成流形完整性，在安全性任务上显著优于SOTA基线。 Conclusion: OrthoEraser通过正交化投影机制实现了更精细、更安全的概念擦除，在保障模型安全性的同时避免了对良性语义的损伤。 Abstract: Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.

[80] ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation

Lijun Guo,Qian Zhou,Zidi Shi,Hua Zou,Gang Ke

Main category: cs.CV

TL;DR: 本文提出ActiveFreq框架，结合主动学习与频域分析，通过AcSelect模块选择最具信息量的误标区域，并利用FreqFormer骨干网络引入傅里叶变换增强特征提取，在减少人工交互的同时提升医学图像交互式分割精度。

Details

Motivation: 现有交互式分割方法未能充分利用用户交互知识，且忽视频域信息，对误标区域无差别处理，导致效率和性能受限。 Method: 提出ActiveFreq框架，包含AcSelect模块（基于主动学习选择关键误标区域）和FreqFormer骨干网络（引入傅里叶变换模块实现空-频域联合特征提取）。 Result: 在ISIC-2017和OAI-ZIB数据集上NoC@90分别达3.74和9.27，较SOTA提升23.5%和12.8%；仅用2次点击即达mIoU 85.29%和75.76%。 Conclusion: ActiveFreq有效融合主动学习与频域建模，显著降低人工干预需求，同时提升交互式医学图像分割的精度与鲁棒性。 Abstract: Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.

[81] Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices

Rambod Azimi,Yuri Grinberg,Dan-Xia Xu,Odile Liboiron-Ladouceur

Main category: cs.CV

TL;DR: 本文提出Gen-Fab，一种基于Pix2Pix的条件生成对抗网络（cGAN），用于预测硅光子器件制造中的纳米级工艺变异，输入为GDS版图，输出为类SEM图像，能建模不确定性并实现高精度与强泛化能力。

Details

Motivation: 硅光子器件因制造偏差（如过刻蚀、欠刻蚀、拐角圆化）导致性能波动，且偏差具有非均匀性与几何依赖性，亟需能准确建模制造不确定性的数字孪生方法。 Method: 提出Gen-Fab模型，基于Pix2Pix架构的条件GAN；以GDS版图作为条件输入，注入潜在噪声向量实现一对多映射，生成多样化高分辨率类SEM图像；对比三种基线方法：确定性U-Net、MC-Dropout U-Net和U-Net集成。 Result: 在分布外测试集上，Gen-Fab取得最高IoU（89.8%），显著优于各基线；KL散度与Wasserstein距离更低，更贴近真实制造结果分布；分布偏移分析表明其对未见几何结构具有良好泛化能力。 Conclusion: Gen-Fab为硅光子制造提供了首个可生成多样化、高保真、不确定性感知的数字孪生方案，兼具高预测精度与强鲁棒性，有望推动光子设计自动化与工艺协同优化。 Abstract: Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.

[82] Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance

Zexi Jia,Pengcheng Luo,Zhengyao Fang,Jinchao Zhang,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出Manifold-Optimal Guidance (MOG)框架，通过黎曼几何校正Classifier-Free Guidance在欧氏空间外推导致的流形偏离问题，并引入Auto-MOG实现自适应指导强度调度，提升生成质量且无需重训练或额外计算开销。

Details

Motivation: 标准Classifier-Free Guidance（CFG）在高指导尺度下易导致过饱和、纹理伪影和结构崩溃，根本原因在于其在环境空间中进行欧氏外推，使采样轨迹偏离高密度数据流形。 Method: 将指导重新建模为局部最优控制问题，推导出几何感知的黎曼更新公式；进一步提出Auto-MOG，一种动态能量平衡调度策略，自适应调节指导强度。 Result: MOG在保真度与条件对齐方面显著优于基线方法，几乎不增加计算开销；Auto-MOG消除了人工调参需求。 Conclusion: MOG从流形优化视角重构扩散模型指导机制，为高保真可控生成提供了理论更坚实、实践更鲁棒的新范式。 Abstract: Classifier-Free Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.

Chenchen Zhao,Jianhuan Zhuo,Muxi Chen,Zhaohua Zhang,Wenyu Jiang,Tianwen Jiang,Qiuyong Xiao,Jihong Zhang,Qiang Xu

Main category: cs.CV

TL;DR: 本文提出FBCIR方法来解释多模态模型在组合图像检索（CIR）任务中的注意力失衡问题，并设计了一种针对困难负样本的数据增强策略，以提升模型在挑战性场景下的鲁棒性与性能。

Details

Motivation: 现有CIR模型在面对语义上与查询图像或文本对齐的困难负样本时性能下降，作者认为这是由于模型在图文模态间存在注意力失衡所致。 Method: 提出FBCIR——一种多模态注意力解释方法，用于识别影响检索决策的关键图文组件；并基于该分析构建面向困难负样本的数据增强流程，以促进平衡的跨模态推理。 Result: FBCIR验证了现有CIR模型普遍存在注意力失衡现象，尤其在困难负样本下；所提数据增强策略在多个CIR模型上显著提升了挑战性场景下的检索准确率，同时不损害标准基准性能。 Conclusion: 注意力失衡是制约CIR模型鲁棒性的关键因素；FBCIR解释方法与针对性数据增强共同为CIR模型诊断与优化提供了新思路。 Abstract: Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.

Shuo Jiang,Gaojia Zhang,Min Tan,Yufei Yin,Gang Pan

Main category: cs.CV

TL;DR: 本文提出了一种统一的无监督伪装目标检测（UCOD）框架，通过多线索原生感知模块、伪标签演化融合、谱张量注意力融合和局部伪标签优化等技术，提升伪标签可靠性与特征保真度，在多个数据集上达到SOTA性能。

Details

Motivation: 现有UCOD方法受限于目标与背景高度相似性及噪声伪标签，导致边界溢出、结构模糊或细节丢失。 Method: 提出Multi-Cue Native Perception模块融合低层纹理与中层语义；Pseudo-Label Evolution Fusion结合师生交互与深度可分离卷积进行语义去噪；Spectral Tensor Attention Fusion通过多层注意力图的谱聚合平衡语义与结构信息；Local Pseudo-Label Refinement利用注意力多样性优化局部细节与边界。 Result: 在多个UCOD数据集上实现SOTA性能，显著提升细节感知能力、边界对齐鲁棒性及复杂伪装场景下的泛化性。 Conclusion: 所提框架有效协同伪标签优化与特征保真，为无监督伪装目标检测提供了可靠且精细的新范式。 Abstract: Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.

[85] MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

Jian Zou,Xiaoyu Xu,Zhihua Wang,Yilin Wang,Balu Adsumilli,Kede Ma

Main category: cs.CV

TL;DR: 本文提出MDS-VQA方法，通过模型引导的数据选择机制，在有限标注预算下筛选出对基线VQA模型既困难又内容多样的未标注视频，用于主动微调，显著提升跨域VQA模型性能。

Details

Motivation: 现有学习型视频质量评估（VQA）研究受限于模型设计与数据集构建之间的脱节：模型中心方法依赖固定基准迭代，而数据中心方法缺乏针对当前模型弱点的系统性样本筛选。 Method: 提出MDS-VQA——一种模型引导的数据选择机制：1）用基于排序目标训练的失败预测器估计样本难度；2）利用深度语义视频特征度量内容多样性；3）在标注预算约束下，通过贪心策略联合优化难度与多样性。 Result: 在多个VQA数据集和模型上验证，仅选取5%的子集进行主动微调，即可将平均SRCC从0.651提升至0.722，并获得最优gMAD排名，表明强适应性与泛化能力。 Conclusion: MDS-VQA实现了模型能力驱动的数据高效筛选，弥合了模型与数据协同演进的鸿沟，为数据高效的VQA模型优化提供了新范式。 Abstract: Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.

[86] Mobile-GS: Real-time Gaussian Splatting for Mobile Devices

Xiaobiao Du,Yida Wang,Kun Zhan,Xin Yu

Main category: cs.CV

TL;DR: 本文提出Mobile-GS，一种面向移动设备的实时3D高斯泼溅渲染方法，通过深度感知的顺序无关渲染、神经视图相关增强、球谐蒸馏、神经矢量量化和基于贡献的剪枝，显著降低计算与存储开销，实现在边缘设备上的高质量实时渲染。

Details

Motivation: 3D高斯泼溅（3DGS）虽渲染质量高，但计算密集、存储开销大，难以部署于资源受限的移动设备。 Method: 1）提出深度感知的顺序无关渲染以消除alpha混合中的高成本高斯深度排序；2）引入神经视图相关增强策略建模视图依赖效应；3）采用一阶球谐蒸馏、神经矢量量化和贡献驱动的高斯剪枝压缩表示。 Result: 在保持高视觉质量的同时，实现移动端实时渲染与紧凑模型尺寸，实验验证其高效性与实用性。 Conclusion: Mobile-GS有效平衡了渲染质量、速度与模型大小，为3DGS在移动/边缘设备上的落地提供了可行方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of applications.However, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.

[87] Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Hongyi Lin,Wenxiu Shi,Heye Huang,Dingyi Zhuang,Song Zhang,Yang Liu,Xiaobo Qu,Jinhua Zhao

Main category: cs.CV

TL;DR: 本文提出RiskMV-DPO，一种物理信息驱动、风险可控的多视角驾驶场景生成方法，通过融合目标风险等级与物理建模生成高风险动态轨迹，并结合几何-外观对齐与区域感知直接偏好优化（RA-DPO），显著提升长尾危险场景生成质量与3D检测性能。

Details

Motivation: 真实世界数据中罕见长尾风险场景，人工设计难以覆盖；现有生成方法将风险视为后验标签，且难以保持多视角场景的几何一致性。 Method: 提出RiskMV-DPO框架：1）物理建模驱动的风险可控轨迹生成，作为扩散视频生成的几何锚点；2）几何-外观对齐模块保障空间-时间一致性；3）区域感知直接偏好优化（RA-DPO）配合运动感知掩码，聚焦动态区域学习。 Result: 在nuScenes上实现3D检测mAP从18.17提升至30.50，FID降至15.70；支持自由生成多样化的长尾高风险多视角场景，视觉质量达SOTA。 Conclusion: RiskMV-DPO将世界模型从被动环境预测转向主动、风险可控的场景合成，为具身智能的安全开发提供了可扩展的方法论与工具链。 Abstract: Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.

[88] ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation

Md Jahidul Islam

Main category: cs.CV

TL;DR: 本文提出ReHARK框架，通过在再生核希尔伯特空间中引入全局邻近正则化，解决大模型在单样本场景下的稳定性-可塑性困境，显著提升单样本视觉语言迁移性能。

Details

Motivation: 大型视觉语言模型（如CLIP）在极低数据（尤其单样本）下游任务中面临稳定性与可塑性难以兼顾的问题；现有无训练方法（如Tip-Adapter）存在边界偏差和缺乏全局结构正则化等缺陷。 Method: 提出ReHARK无训练框架，包含四个阶段：（1）混合先验构建（融合CLIP文本知识与GPT-3及视觉原型）；（2）支撑集增强（跨模态插值生成中间样本）；（3）自适应分布校正（对齐测试特征与增强支撑集统计量）；（4）多尺度RBF核（集成多尺度核以建模复杂特征几何）。 Result: 在11个基准上验证有效性，单样本平均准确率达65.83%，创当前最优性能。 Conclusion: ReHARK通过全局RKHS正则化与多阶段精细化设计，有效缓解了单样本VLM适配中的稳定性-可塑性矛盾，为无训练小样本迁移提供了新范式。 Abstract: The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.

[89] Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang,Haowei Zhu,Jun-hai Yong,Hao Pan,Bin Wang

Main category: cs.CV

TL;DR: Mango-GS是一种基于多帧、节点引导的高保真4D场景重建方法，利用时序Transformer建模短时帧间运动依赖，通过稀疏控制节点实现高效且一致的形变估计，显著提升动态3D场景的重建质量与实时渲染性能。

Details

Motivation: 现有基于高斯泼溅的动态场景建模方法多采用逐帧优化，易过拟合瞬时状态，难以捕获底层运动动力学，导致时间一致性差。 Method: 提出Mango-GS框架：引入时序Transformer建模短窗口内帧间运动依赖；使用稀疏控制节点（含解耦的规范位置与潜在码）作为语义锚点以稳定运动传播；结合输入掩码策略及两种多帧损失进行端到端训练。 Result: 在多个动态场景数据集上达到SOTA重建质量，并支持实时渲染，可实现高保真重建与交互式渲染。 Conclusion: Mango-GS通过节点引导的多帧建模有效平衡了动态场景重建的保真度、时间一致性与效率，为4D内容生成提供了新范式。 Abstract: Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.

[90] PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation

Xiangyu Li,Chenglin Wang,Qiantong Shen,Fanding Li,Wei Wang,Kuanquan Wang,Yi Shen,Baochun Zhao,Gongning Luo

Main category: cs.CV

TL;DR: 本文提出了一种PCA增强的概率U-Net（PEP U-Net），用于解决模糊医学图像分割中的不确定性建模问题，通过PCA降维与逆PCA重建优化潜在空间，兼顾分割精度与预测多样性。

Details

Motivation: 解决现有cVAE方法在高维潜在空间冗余、单后验网络表达能力有限等问题，提升医学图像分割中对固有不确定性的建模能力。 Method: 提出PCA增强的概率U-Net（PEP U-Net）：在后验网络中引入PCA进行降维以减少冗余并提升计算效率，并利用逆PCA重建关键信息以增强潜在空间表征能力。 Result: 相比传统生成模型，PEP U-Net在保持生成多样化分割假设能力的同时，实现了更高的分割精度与更优的预测变异性平衡。 Conclusion: PEP U-Net有效提升了生成式模型在模糊医学图像分割任务中的性能，为不确定性建模提供了新思路。 Abstract: Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.

[91] MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che,Shuo Wen,Shan Huang,Chuang Wang,Yuzhe Yang,Gregory Dudek,Xueqian Wang,Jian Su

Main category: cs.CV

TL;DR: 本文提出了MANSION框架，用于生成多楼层、建筑规模的3D环境，以支持真实世界中跨楼层、长时程的具身智能任务，并发布了包含1000+建筑的MansionWorld数据集及语义场景编辑智能体，揭示了现有最先进智能体在该场景下的性能显著下降。

Details

Motivation: 现有具身智能基准局限于单层室内环境，无法反映真实世界中多楼层、长时程任务所需的复杂空间推理能力。 Method: 提出MANSION语言驱动框架，建模垂直结构约束，生成可导航、人类友好的全楼3D环境；构建MansionWorld数据集，并设计基于开放词汇命令的Task-Semantic Scene Editing Agent进行场景定制。 Result: 发布了含1000多个多样化建筑（如医院、办公楼）的MansionWorld数据集和配套编辑智能体；基准测试表明当前SOTA智能体在跨楼层长时程任务上性能急剧下降。 Conclusion: MANSION为评估和推动下一代空间推理与规划能力提供了关键且更具现实挑战性的测试平台。 Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

[92] Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Xinyu Nan,Ning Wang,Yuyao Zhai,Mei Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的双监督图像美学增强方法DIAE，通过多模态美学感知（MAP）将模糊的美学指令转化为明确指导，并构建弱配对数据集IIAEData及双分支监督框架，以解决美学增强中指令理解难和高质量配对数据稀缺的问题。

Details

Motivation: 现有图像编辑模型在美学增强方面表现不佳，主要受限于：1）难以准确理解并遵循具有美学感知能力的编辑指令；2）缺乏内容一致但美学质量差异显著的“完美配对”图像数据。 Method: 提出Dual-supervised Image Aesthetic Enhancement (DIAE)：1）引入Multimodal Aesthetic Perception (MAP)，利用标准化多属性美学文本指令与对应文本-图像控制信号；2）构建弱配对数据集IIAEData；3）设计双分支监督框架实现弱监督训练。 Result: 实验表明，DIAE在图像美学评分和内容一致性评分上均优于基线模型。 Conclusion: DIAE有效提升了图像美学增强能力，验证了多模态美学感知与弱监督学习结合的有效性，为解决美学增强中的指令理解和数据稀缺问题提供了新思路。 Abstract: Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.

[93] TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision

Robinson Umeike,Cuong Pham,Ryan Hausen,Thang Dao,Shane Crawford,Tanya Brown-Giammanco,Gerard Lemson,John van de Lindt,Blythe Johnston,Arik Mitschang,Trung Do

Main category: cs.CV

TL;DR: TornadoNet是一个面向灾后街道级建筑损毁评估的综合基准，通过3333张高分辨率街景图像和8890个标注实例，系统比较了YOLO系列CNN模型与RT-DETR等Transformer模型在五级损伤分类任务中的表现，并提出软序数分类目标与序数距离惩罚，显著提升损伤严重程度估计的准确性与一致性。

Details

Motivation: 现有方法缺乏对建筑损毁多级严重程度（有序性）建模的能力，且缺乏在真实灾后街景条件下统一评估检测架构与监督策略影响的可控基准。 Method: 构建TornadoNet基准数据集（基于2021年美国中西部龙卷风事件），采用IN-CORE五级损伤框架进行专家交叉标注；对比YOLO系列CNN与RT-DETR等Transformer模型；引入软序数分类目标和显式序数距离损失以增强序数一致性。 Result: YOLO模型在检测精度（最高46.05% mAP@0.5）与推理速度（66–276 FPS）上占优；RT-DETR在序数一致性上更优（88.13% Ordinal Top-1 Accuracy，MAOE=0.65）；加入序数感知监督后，RT-DETR的mAP提升4.8个百分点，Ordinal Top-1 Accuracy达91.15%，MAOE降至0.56。 Conclusion: 序数感知监督策略能有效提升损伤严重程度估计的可靠性，其效果依赖于与检测器架构的协同设计；TornadoNet为灾后响应提供了可部署的工具与方法论启示。 Abstract: We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet

[94] SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang,Junkun Hong,Hongrong Wang,Honghao Cai,Xunpeng Ren,Ge Wang,Mingcong Lei,Shenhao Yan,Jiahao Yang,Chengsi Yao,Xi Li,Yiming Zhao,Yatong Han,Jinke Ren

Main category: cs.CV

TL;DR: 本文提出了一种分阶段视觉语言学习框架（SVLL）和一种改进的偏好优化方法（Bias-DPO），以提升具身任务规划中动作序列的视觉接地性与因果连贯性，显著提高任务成功率并减少物理约束违规。

Details

Motivation: 现有具身任务规划方法在端到端训练中易出现过早时间绑定，而强化学习方法则存在优化不稳定问题；同时标准DPO忽略最优轨迹的绝对似然约束，导致不安全或幻觉行为。 Method: 提出三阶段SVLL框架：前两阶段解耦空间接地与时间推理，第三阶段引入Bias-DPO——在DPO基础上显式最大化专家动作似然、惩罚过度自信的幻觉，以锚定策略于专家流形并缓解因果错位。 Result: 在AI2-THOR基准和真实机器人部署中，SVLL超越Qwen2.5-VL-7B、GPT-4o、Gemini-2.0-flash等SOTA模型，任务成功率更高，物理约束违规显著减少。 Conclusion: SVLL结合Bias-DPO能有效实现物理接地、因果一致的具身规划，解决了传统方法在时间建模与策略对齐上的关键缺陷。 Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

[95] R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia,Yousen Tang,Yongtao Wang,Zhifeng Wang,Weijun Qin

Main category: cs.CV

TL;DR: 本文提出R4Det，通过全景深度融合模块提升深度估计质量，设计可变形门控时间融合模块避免依赖自车姿态，并引入实例引导动态细化模块提取语义原型，显著提升4D雷达-相机融合的3D目标检测性能。

Details

Motivation: 现有4D雷达-相机融合的3D目标检测方法存在三大问题：绝对深度估计不鲁棒准确、时间融合模块严重依赖不准或缺失的自车位姿、稀疏雷达点云对小物体反射失败导致仅能依赖视觉单模态先验。 Method: 提出R4Det框架，包含三个核心模块：（1）全景深度融合模块（Panoramic Depth Fusion），增强绝对与相对深度的相互强化；（2）可变形门控时间融合模块（Deformable Gated Temporal Fusion），摆脱对自车姿态的依赖；（3）实例引导动态细化模块（Instance-Guided Dynamic Refinement），从2D实例引导中提取语义原型。 Result: 在TJ4DRadSet和VoD数据集上，R4Det实现了4D雷达-相机融合3D目标检测的最先进性能。 Conclusion: R4Det有效解决了当前多模态融合检测中的深度估计、时间建模与小目标感知瓶颈，为鲁棒、高精度的4D雷达-相机联合感知提供了新范式。 Abstract: 4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

[96] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Hui Zhang,Juntao Liu,Zongkai Liu,Liqiang Niu,Fandong Meng,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: WeEdit提出了一种面向文本编辑的图像编辑系统，通过HTML自动生成330K训练对、双语/多语基准测试及两阶段训练（字形引导微调+多目标强化学习），显著提升文本修改、翻译与重排的准确性与清晰度。

Details

Motivation: 现有模型在文本中心图像编辑任务中常出现字符模糊或幻觉，主因缺乏专用训练范式、大规模数据集和标准化评测基准。 Method: 提出WeEdit系统：1）基于HTML的自动编辑流水线生成330K多语言训练对；2）构建双语与多语评测基准；3）采用字形引导的监督微调 + 多目标强化学习（兼顾指令遵循、文字清晰度与背景保留）。 Result: WeEdit在多种文本编辑任务上显著超越现有开源模型，验证了其在文字修改、翻译、重排等操作中的有效性与鲁棒性。 Conclusion: WeEdit通过数据、基准与算法三方面协同创新，建立了首个面向文本中心图像编辑的闭环训练与评估体系，为该方向提供了系统性解决方案。 Abstract: Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

[97] LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

Junkun Jiang,Ho Yin Au,Jingyu Xiang,Jie Chen

Main category: cs.CV

TL;DR: 本文提出LabanLite运动表示法和LaMoGen生成框架，通过符号化推理实现可解释、可控的语言驱动运动合成。

Details

Motivation: 现有基于文本-动作嵌入的方法难以生成时间准确、细节丰富的动作，且缺乏可解释性。 Method: 提出LabanLite——一种基于Labanotation扩展的离散符号化动作表示；构建LaMoGen框架，利用大语言模型进行符号推理生成动作序列；并建立基于Labanotation的评估基准与多维指标。 Result: LaMoGen在自建基准及两个公开数据集上均超越先前方法，显著提升可解释性与可控性。 Conclusion: 符号化推理与基于智能体的设计为语言驱动动作合成提供了更优路径。 Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.

[98] Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints

Lijun Guo,Haoyu Zhao,Xingyue Zhao,Rong Fu,Linghao Zhuang,Siteng Huang,Zhongyu Li,Hua Zou

Main category: cs.CV

TL;DR: 本文提出Articulat3D框架，仅需单目视频即可构建高保真关节物体数字孪生体，通过运动先验驱动初始化与几何/运动约束优化，实现几何准确且时序一致的重建。

Details

Motivation: 现有方法依赖多视角、静态离散状态采集，难以在真实世界大规模应用；亟需从随意拍摄的单目视频中高效构建关节物体数字孪生体。 Method: 提出Motion Prior-Driven Initialization（利用3D点轨迹和紧凑运动基实现软刚性分组）与Geometric and Motion Constraints Refinement（基于可学习运动学原语，参数化关节轴、枢轴点和帧级运动标量）联合优化。 Result: 在合成数据集和真实单目视频上均达到SOTA性能，显著提升无控现实场景下数字孪生构建的可行性。 Conclusion: Articulat3D突破了对多视角和静态采集的依赖，为真实场景中关节物体数字孪生的轻量、鲁棒构建提供了新范式。 Abstract: Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.

[99] DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

Tong Zhao,Mingkun Lei,Liangyu Yuan,Yanming Yang,Chenxi Song,Yang Wang,Beier Zhu,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为DyWeight的轻量级、学习型多步ODE求解器，通过动态加权历史梯度并隐式校准时间步长，显著提升扩散模型采样效率与生成质量。

Details

Motivation: 现有扩散模型采样速度慢，虽有多步ODE求解器改进，但其手工设计的系数无法适应扩散过程中的非平稳动力学特性。 Method: 提出Dynamic Gradient Weighting（DyWeight），采用学习驱动的多步求解范式，放松传统数值约束，学习时变参数以自适应聚合历史梯度，并隐式调节有效步长，实现与模型去噪动力学对齐。 Result: 在CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion和FLUX.1-dev等多个基准上，DyWeight以更少函数评估次数实现了更高视觉保真度与稳定性，达到高效扩散求解器新SOTA。 Conclusion: DyWeight通过数据驱动的动态梯度加权与隐式时间校准，为扩散模型提供了一种高效、稳定且易于部署的新型求解框架。 Abstract: Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight

[100] SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

Muyi Sun,Yifan Gao,Ziang Jia,Xingqun Qi,Qianli Zhang,Qian Liu,Tianzheng Deng

Main category: cs.CV

TL;DR: 本文提出SemiTooth，一种面向多源CBCT数据的半监督牙齿结构分割框架，通过构建多源半监督数据集MS3Toothset，并设计多教师-多学生架构与严格加权置信约束，显著提升无标注数据利用效率与跨域分割精度。

Details

Motivation: CBCT牙齿分割面临全标注数据获取难、多源数据采集差异大导致的标注利用率低、体素级不一致及域间偏差等问题，亟需高效利用多源未标注数据。 Method: 构建含三类标注水平的多源数据集MS3Toothset；提出多教师-多学生半监督框架SemiTooth，各学生网络分别学习对应来源的无标注数据，并受各自教师监督；引入更严格的加权置信约束提升多源准确性。 Result: 在MS3Toothset上实验验证，SemiTooth在半监督与多源牙齿分割任务中达到SOTA性能。 Conclusion: SemiTooth为临床CBCT多源半监督牙齿分割提供了通用、鲁棒且高效的解决方案，显著提升了模型泛化能力与跨机构适用性。 Abstract: With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source accuracy.Extensive experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.

[101] Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Lu Niu,Cheng Xue

Main category: cs.CV

TL;DR: 本文提出NA-MVP框架，通过双向多视角提示对齐实现噪声感知的少样本学习，提升视觉-语言模型在标签噪声下的鲁棒性。

Details

Motivation: 现有视觉-语言模型在少样本学习中易受噪声标签影响，难以建模细粒度语义并自适应区分干净与噪声信号。 Method: NA-MVP采用三方面设计：(1) 多视角提示结合非平衡最优传输实现细粒度区域-提示对齐并抑制不可靠区域；(2) 双向提示设计分别捕获面向干净数据和噪声感知的互补线索；(3) 对齐引导的选择性精炼策略，仅修正误标样本，保留可靠数据。 Result: 在合成与真实噪声基准上的实验表明，NA-MVP持续优于当前最优方法。 Conclusion: NA-MVP通过区域感知、双向提示与选择性精炼，有效提升了少样本视觉-语言模型在噪声监督下的鲁棒性与跨模态对齐能力。 Abstract: Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.

[102] Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

Jiin Im,Sisung Liu,Je Hyeong Hong

Main category: cs.CV

TL;DR: 本文提出Shape-of-You (SoY)框架，通过融合Gromov-Wasserstein（FGW）优化与3D基础模型几何先验，改进无监督语义对应学习中的伪标签生成，克服了传统2D最近邻方法忽略结构关系和几何歧义的缺陷；采用锚点线性化近似求解FGW，并设计软目标损失增强对噪声伪标签的鲁棒性，在SPair-71k和AP-10k上达到SOTA。

Details

Motivation: 现有基于2D基础模型和最近邻伪标签的无监督语义对应方法局限于局部外观匹配，无法处理因对称性或重复纹理导致的几何歧义，且忽视特征间的结构关系。 Method: 将伪标签生成建模为Fused Gromov-Wasserstein（FGW）优化问题，联合优化跨图像特征相似性与图像内几何结构一致性；利用3D基础模型构建几何空间内的内在结构；采用锚点线性化降低FGW计算复杂度；设计动态混合网络预测与FGW传输计划的软目标损失函数。 Result: 在SPair-71k和AP-10k数据集上取得当前最优性能，显著提升无显式几何标注下的语义对应精度。 Conclusion: SoY证明了引入几何结构先验与结构化伪标签生成机制可有效缓解无监督语义对应中的几何歧义问题，为后续结合3D与2D基础模型的研究提供了新范式。 Abstract: Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.

[103] MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models

Shengyuan Liu,Zanting Ye,Yunrui Lin,Chen Hu,Wanting Geng,Xu Han,Bulat Ibragimov,Yefeng Zheng,Yixuan Yuan

Main category: cs.CV

TL;DR: 本文提出MedPruner，一种无需训练、模型无关的分层token剪枝框架，用于高效3D医学图像理解，通过两阶段机制（片间锚点过滤+动态信息核选择）显著减少视觉token数量（<5%），同时保持或提升性能。

Details

Motivation: 现有3D医学视觉语言模型因直接拼接2D切片导致解剖冗余严重，且固定剪枝比无法适应不同切片的信息密度差异，计算效率低，制约临床部署。 Method: 提出MedPruner框架：第一阶段为片间锚点过滤模块，消除切片级时间冗余；第二阶段为动态信息核选择策略，基于累积注意力权重实现自适应token级压缩。该方法无需训练、适配多种模型。 Result: 在三个3D医学基准和三种不同医学VLM上验证，发现现有模型存在大量token冗余；MedPruner使MedGemma等模型仅保留不足5%的视觉tokens，仍维持甚至超越原性能，大幅降低计算开销。 Conclusion: 动态token选择对提升3D医学VLM推理效率与临床实用性至关重要，MedPruner为高效、可部署的医学多模态理解提供了通用、轻量、有效的解决方案。 Abstract: While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.

[104] Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

Yichi Zhang,Le Xue,Wenbo Zhang,Lanlan Li,Feiyang Xiao,Yuchen Liu,Xiaohui Zhang,Hongwei Zhang,Shuqi Wang,Gang Feng,Liling Peng,Xin Gao,Yuanfan Xu,Yuan Qi,Kuangyu Shi,Hong Zhang,Yuan Cheng,Mei Tian,Zixin Hu

Main category: cs.CV

TL;DR: 本文提出SegAnyPET，一种基于大规模3D全身PET数据集训练的通用分割基础模型，旨在解决PET影像解剖对比度低、标注成本高等挑战，支持零样本跨任务器官与病灶分割，并支持临床人机协同工作流。

Details

Motivation: PET影像缺乏解剖对比度、数据获取与标注成本高，导致深度学习在定量PET分析中发展受限。 Method: 构建迄今最大最全面的3D全身PET数据集（11041例扫描，59831个分割掩码），并基于此提出3D架构+提示工程的通用分割基础模型SegAnyPET。 Result: SegAnyPET在多中心、多示踪剂、多疾病数据上展现出优异的零样本分割性能，支持高效人工修正与临床人机协同流程。 Conclusion: SegAnyPET为PET影像的通用、可扩展、临床就绪的分割提供了新范式，有望推动分子影像的临床应用发展。 Abstract: Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.

[105] MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

Baicheng Li,Dong Wu,Jun Li,Shunkai Zhou,Zecui Zeng,Lusong Li,Hongbin Zha

Main category: cs.CV

TL;DR: 本文提出MV-SAM3D，一种无需训练的多视角一致且物理合理的布局感知3D生成框架，通过3D潜在空间中的多扩散融合与自适应加权策略，并引入物理约束优化，显著提升重建保真度与布局合理性。

Details

Motivation: 现有单视图统一3D生成方法无法利用多视角互补信息，且独立估计的对象位姿易导致穿模、悬浮等物理不合理布局。 Method: 提出MV-SAM3D框架：1）将多视图融合建模为3D潜在空间的Multi-Diffusion过程；2）设计注意力熵加权和可见性加权两种自适应权重策略实现置信度感知融合；3）引入物理感知优化，在生成中及生成后注入碰撞与接触约束。 Result: 在标准基准和真实多物体场景实验中，显著提升了重建保真度与布局合理性，且完全无需额外训练。 Conclusion: MV-SAM3D验证了无需训练即可通过多视图一致性建模与物理约束注入，有效提升布局感知3D生成的质量与合理性，为实用化场景级3D生成提供了新思路。 Abstract: Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.

[106] Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Sizhong Qin,Ramon Elias Weber,Xinzheng Lu

Main category: cs.CV

TL;DR: 本文提出HouseMind，一种统一建筑平面图理解、生成和编辑的多模态大语言模型，通过离散房间实例标记实现几何、语义与空间层次的联合推理，显著提升布局的有效性与可控性。

Details

Motivation: 现有AI系统在建筑平面图设计中难以同时处理几何、语义和空间层次的联合推理，尤其在空间一致性与可控生成方面存在不足。 Method: 提出HouseMind模型，引入离散房间实例标记构建统一符号词汇，结合多模态对齐与指令微调，实现从文本指令到连贯、可控平面图的端到端生成。 Result: 实验表明该框架在几何有效性与可控性上优于现有方法，且具备高效性和本地部署能力。 Conclusion: HouseMind为建筑平面图的智能设计提供了统一、可控、可部署的新范式，弥合了符号推理与视觉生成之间的鸿沟。 Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

Chongxiao Wang,Junjie Liang,Peng Cao,Jinzhu Yang,Osmar R. Zaiane

Main category: cs.CV

TL;DR: 本文提出IDRL框架，通过解耦多模态表征并引入个体感知的模态融合模块，提升抑郁症检测的鲁棒性和准确性。

Details

Motivation: 现有方法存在模态间不一致、抑郁无关干扰以及个体抑郁表现差异大导致融合不可靠的问题。 Method: IDRL框架包含两部分：1）将多模态表征解耦为模态共有的抑郁空间、模态特异的抑郁空间和抑郁无关空间；2）设计个体感知的模态融合模块（IAF），动态调整各抑郁相关特征权重以实现自适应跨模态融合。 Result: 在多个数据集上实验表明，IDRL在多模态抑郁症检测任务中性能优越且鲁棒。 Conclusion: IDRL有效缓解了模态冲突与干扰，并适配个体差异，为可靠抑郁症诊断提供了新思路。 Abstract: Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.

[108] OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han,Bin Zhu,Shiqi Hu,Franklin Mingzhe Li,Patrick Carrington,Roger Zimmermann,Jingjing Chen

Main category: cs.CV

TL;DR: 本文提出OSCBench基准，专门评估文本到视频（T2V）模型对物体状态变化（OSC）的理解能力，发现现有模型在OSC任务上表现薄弱，尤其在新颖和组合场景中，揭示OSC是T2V生成的关键瓶颈。

Details

Motivation: 现有T2V基准主要关注感知质量、文本-视频对齐或物理合理性，却忽视了文本明确指定的物体状态变化（OSC）这一关键动作理解维度。 Method: 构建基于烹饪教学数据的OSCBench基准，涵盖常规、新颖和组合三类动作-物体交互场景；结合人工用户研究与多模态大语言模型（MLLM）自动评估，评测六种主流开源与闭源T2V模型。 Result: 当前T2V模型在语义与场景对齐上表现良好，但在准确且时序一致地生成物体状态变化方面普遍不足，尤其在新颖和组合设置下性能显著下降。 Conclusion: 物体状态变化（OSC）是当前T2V生成的核心瓶颈，OSCBench为推进具备状态感知能力的视频生成模型提供了诊断性基准。 Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

[109] FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation

Meilu Zhu,Zhiwei Wang,Axiu Mao,Yuxing Li,Xiaohan Xing,Yixuan Yuan,Edmund Y. Lam

Main category: cs.CV

TL;DR: 本文提出了首个面向医学图像分割的联邦学习基准FL-MedSegBench，涵盖9个任务、10种模态、2D/3D数据，并系统评估了8种通用FL与5种个性化FL方法在精度、公平性、通信效率、收敛性及跨域泛化等方面的性能，揭示了个性化方法（如FedBN）优势明显、无绝对最优方法、通信鲁棒性与泛化能力相关等关键发现。

Details

Motivation: 缺乏标准化的医学图像分割联邦学习基准，导致方法评估不公且不全面。 Method: 构建FL-MedSegBench基准，包含九个分割任务、十种成像模态、2D/3D数据及临床异质性；系统评估八种通用FL和五种个性化FL方法在分割精度、公平性、通信效率、收敛行为和跨域泛化五个维度的表现。 Result: （i）个性化FL方法（如FedBN）持续优于通用方法；（ii）无单一方法在所有数据集上占优；（iii）基于归一化的个性化方法对降低通信频率具有强鲁棒性；（iv）Ditto和FedRDN等方法可保护表现较差客户端，提升公平性；（v）方法在未见域上的泛化能力与其在参与客户端上的整体性能强相关。 Conclusion: FL-MedSegBench为医学图像分割联邦学习提供了首个全面、开源、可复现的评估基准，支持实证驱动的临床部署指南，并推动面向真实医疗场景的FL研究与应用。 Abstract: Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.

[110] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang,Zhuoran Jin,Yupu Hao,Yubo Chen,Kang Liu,Yulong Ao,Jun Zhao

Main category: cs.CV

TL;DR: 本文提出Think While Watching框架，通过内存锚定的流式视频推理方法，实现观看与思考的并行处理，提升多轮交互下长程依赖建模能力，并在多个基准上显著提升准确率和效率。

Details

Motivation: 现有MLLMs在离线视频理解中表现良好，但在在线流式视频推理和多轮交互方面存在局限，尤其是交织式感知-生成范式导致早期记忆衰减、无法并发处理、长程依赖建模弱。 Method: 提出内存锚定的流式视频推理框架Think While Watching，构建三阶段多轮思维链数据集，采用阶段匹配训练策略，并引入段级流式因果掩码与流式位置编码以保证严格因果性；推理时设计观看与思考重叠的高效流水线，并自适应选择最优注意力后端。 Result: 在StreamingBench单轮设置下准确率提升2.6%，OVO-Bench提升3.79%；多轮设置下性能保持稳定且输出token减少56%。 Conclusion: Think While Watching有效解决了流式视频多轮交互中的记忆维持与并发处理难题，显著提升了长程依赖建模能力与推理效率。 Abstract: Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

[111] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang,Yijiang Li,Ningzhi Gao,Xingfu Yan,Leyu Shi

Main category: cs.CV

TL;DR: 本文提出BackdoorIDS，一种零样本、推理时检测预训练视觉编码器后门样本的方法，利用注意力劫持与恢复现象，通过输入掩码轨迹上的嵌入序列变化和密度聚类识别后门样本，无需重训练，兼容多种架构。

Details

Motivation: 下游用户常使用来源不明的第三方预训练视觉编码器，面临后门攻击风险；现有防御方法多需重训练或无法泛化，亟需零样本、即插即用的推理时检测方案。 Method: 基于注意力劫持与恢复现象，对输入图像渐进掩码，提取各掩码比例下的图像嵌入序列；利用DBSCAN等密度聚类算法对嵌入序列聚类，若形成多个簇则判定为后门样本。 Result: BackdoorIDS在多种攻击类型、数据集和模型家族上持续优于现有防御方法；具备零样本、无需重训练、即插即用特性，兼容CNN、ViT、CLIP、LLaVA-1.5等各类编码器架构。 Conclusion: BackdoorIDS是一种简单有效、高度通用的零样本后门检测方法，为部署第三方视觉编码器提供了可靠的安全保障。 Abstract: Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.

[112] Linking Perception, Confidence and Accuracy in MLLMs

Yuetian Du,Yucheng Wang,Rongyu Zhang,Zhijie Xu,Boyu Yang,Ming Kong,Jie Liu,Qiang Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于置信度驱动的强化学习方法（CDRL）和置信度感知的测试时缩放方法（CA-TTS），以解决多模态大语言模型（MLLMs）中普遍存在的置信度误校准问题，并显著提升其性能。

Details

Motivation: 现有MLLMs虽在视觉感知精度上取得进展，但缺乏对自身不确定性的认知能力（即‘知道自己不知道’），存在严重的置信度误校准问题。 Method: 提出置信度驱动的强化学习（CDRL），利用原始-噪声图像对和新型置信度奖励函数；进一步设计置信度感知的测试时缩放（CA-TTS），通过专家模型动态协调Self-Consistency、Self-Reflection和Visual Self-Check模块。 Result: 在四个基准上实现一致8.8%的性能提升，达到新SOTA；消融实验验证各模块有效性及缩放优势。 Conclusion: 置信度校准不仅是可靠性保障，还可作为免费午餐显著增强测试时推理能力，为MLLMs可信推理提供新范式。 Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.

Haohua Chen,Tianze Zhou,Wei Zhu,Runqi Wang,Yandong Guan,Dejia Song,Yibo Chen,Xu Tang,Yao Hu,Lu Sheng,Zhiyong Wu

Main category: cs.CV

TL;DR: 本文提出PROMO框架，基于Flow Matching DiT主干网络与潜在多模态条件拼接，将虚拟试穿（VTON）建模为结构化图像编辑任务，在保持主体、精准纹理迁移和无缝融合三方面实现高效高质量合成，显著提升推理效率并超越现有方法。

Details

Motivation: 扩散模型虽能生成高保真虚拟试穿结果，但架构复杂、采样慢，难以兼顾质量与效率；同时VTON的配对数据可作为通用图像编辑的优质监督资源。 Method: 将VTON视为结构化图像编辑问题，提出PROMO框架：采用Flow Matching DiT作为主干，引入潜在空间中的多模态条件拼接（如人物图、服装图、姿态图等），并结合自参考机制提升条件效率与推理速度。 Result: 在标准基准上，PROMO在视觉保真度上超越先前VTON方法及通用图像编辑模型，同时在质量与速度间取得更好平衡。 Conclusion: 基于流匹配的Transformer架构，配合潜在多模态条件建模与自参考加速策略，是实现高质量、训练与推理高效VTON的有效路径，并具备向通用图像编辑迁移的潜力。 Abstract: Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.

[114] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai,Yujie Zhou,Long Xing,Jiazi Bu,Xilin Wei,Yuhong Liu,Beichen Zhang,Kai Chen,Yuhang Zang

Main category: cs.CV

TL;DR: 本文提出Endogenous Chain-of-Thought（EndoCoT）框架，通过迭代式思维引导与终端思维锚定模块，激活多模态大语言模型（MLLM）的推理能力，并将其深度推理过程与扩散变换器（DiT）的去噪步骤动态对齐，显著提升空间推理等复杂任务的准确率。

Details

Motivation: 现有将MLLM作为文本编码器嵌入扩散模型的范式存在两大缺陷：一是MLLM文本编码缺乏足够推理深度，无法激活链式思维（Chain-of-Thought）；二是解码过程中指导信号静态不变，阻碍DiT逐步分解复杂指令。 Method: 提出EndoCoT框架：1）迭代思维引导模块，通过反复精炼潜在思维状态，激活MLLM内在推理能力；2）终端思维锚定模块，将最终思维状态对齐真实答案，确保推理轨迹受文本监督。二者协同实现MLLM推理指导与DiT去噪过程的动态耦合。 Result: 在Maze、TSP、VSP、Sudoku等多个空间推理基准上平均准确率达92.1%，较最强基线提升8.3个百分点。 Conclusion: EndoCoT有效克服了MLLM在扩散模型中推理浅层化与指导僵化的问题，验证了将深度、动态、可接地的链式思维引入多模态生成框架的可行性与优越性。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

[115] UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Cao Thien Tan,Phan Thi Thu Trang,Do Nghiem Duc,Ho Ngoc Anh,Hanyang Zhuang,Nguyen Duc Dung

Main category: cs.CV

TL;DR: UCAN是一种轻量级图像超分辨率网络，通过统一卷积与注意力机制、引入Hedgehog注意力和蒸馏大核模块，并采用跨层参数共享，在保持高精度的同时显著降低计算开销。

Details

Motivation: 现有混合CNN-Transformer架构在图像超分辨率中效果好，但扩大注意力窗口或卷积核会显著增加计算成本，难以部署于资源受限设备。 Method: 提出UCAN网络：结合窗口化空间注意力与Hedgehog注意力以兼顾局部纹理与长程依赖；设计蒸馏式大核模块保留高频结构；采用跨层参数共享降低模型复杂度。 Result: 在Manga109（4×）上UCAN-L达31.63 dB PSNR（仅48.4G MACs），优于近期轻量模型；在BSDS100上达27.79 dB，超过参数量大得多的方法。 Conclusion: UCAN在精度、效率与可扩展性之间实现了更优平衡，适用于实际高清图像重建任务。 Abstract: Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

[116] PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures

Chi Chen,Tianle Jiang,Xiaodong Wei,Yanming Wang

Main category: cs.CV

TL;DR: 本文提出PolyCrysDiff框架，基于条件隐式扩散模型生成可计算、可控的三维多晶微观结构，显著优于MRF和CNN方法，并通过CPFEM验证其物理有效性，助力数据驱动的材料设计。

Details

Motivation: 真实、可控地构建多晶材料三维微观结构对揭示结构-性能关系至关重要，但目前仍具挑战性。 Method: 提出基于条件隐式扩散模型的PolyCrysDiff框架，实现端到端生成可计算的三维多晶微观结构。 Result: PolyCrysDiff能准确复现目标晶粒形貌、取向分布与三维空间关联性，在晶粒属性（如尺寸、球形度）控制上R²超0.972，优于MRF和CNN方法；CPFEM模拟验证其物理有效性。 Conclusion: PolyCrysDiff为加速数据驱动的多晶材料优化与设计提供了关键工具。 Abstract: The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.

[117] COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection

Guillem González,Guillem Alenyà,Sergi Foix

Main category: cs.CV

TL;DR: 本文提出COTONET，一种增强型自定义YOLO11模型，通过引入多种注意力机制（如SE模块、CARAFE、SimAM、PHAM）提升棉花铃在不同生育期的检测精度，兼顾轻量化与鲁棒性，适用于边缘设备和农业机器人。

Details

Motivation: 棉花采摘过程中易因机械操作导致纤维降解，需仿照人工轻柔抓取；而自动化采摘依赖于对多阶段棉花铃的精准识别，现有方法对困难样本检测能力不足。 Method: 提出COTONET模型：基于YOLO11架构，替换卷积块为Squeeze-and-Excitation模块，重构含注意力机制的骨干网络，采用CARAFE替代标准上采样，并在主干和颈部路径分别引入SimAM和PHAM实现多维度注意力聚合；所有修改均保留梯度可导性以支持端到端训练。 Result: COTONET参数量7.6M、计算量27.8 GFLOPS，在保持轻量化的同时达到mAP50=81.1%，mAP50-95=60.6%，显著优于标准YOLO基线。 Conclusion: COTONET通过结构化注意力设计有效提升了复杂农田场景下棉花铃的检测鲁棒性与精度，为低资源边缘平台上的智能棉花采摘提供了可行技术路径。 Abstract: Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.

[118] Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction

Ammar Kheder,Helmi Toropainen,Wenqing Peng,Samuel Antão,Zhi-Song Liu,Michael Boy

Main category: cs.CV

TL;DR: CRAN-PM是一种双分支视觉Transformer，通过跨分辨率注意力机制高效融合全球气象数据（25 km）与本地高分辨率PM2.5数据（1 km），并引入高程感知自注意力和风向引导的交叉注意力，提升物理一致性与预测精度；在欧洲2900万像素空气质量图上实现1.8秒单卡推理，RMSE显著降低。

Details

Motivation: Vision Transformer在时空预测中表现优异，但在超高清、洲际尺度环境监测任务中面临可扩展性瓶颈，例如欧洲1 km分辨率空气质量图含2900万像素，远超朴素自注意力机制处理能力。 Method: 提出CRAN-PM双分支Vision Transformer：采用跨分辨率注意力融合25 km气象数据与1 km当前PM2.5数据；引入高程感知自注意力和风向引导的交叉注意力，使模型学习符合物理规律的特征表示；整体架构全可训练且内存高效。 Result: 在2022年欧洲每日PM2.5预测任务（362天、2971个EEA站点）上，相比最优单尺度基线，T+1和T+3时刻RMSE分别降低4.7%和10.7%，复杂地形偏差降低36%；单GPU可在1.8秒内生成完整2900万像素欧洲地图。 Conclusion: CRAN-PM通过物理信息嵌入与跨分辨率建模，在保持高效推理的同时显著提升超大规模环境预测的准确性与物理一致性，为洲际尺度高分辨率环境建模提供了新范式。 Abstract: Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.

[119] VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On

Xiaoye Liang,Zhiyuan Qu,Mingye Zou,Jiaxin Liu,Lai Jiang,Mai Xu,Yiheng Zhu

Main category: cs.CV

TL;DR: 本文提出VTEdit-Bench基准和VTEdit-QA评估器，系统评测通用多参考图像编辑模型在虚拟试衣（VTON）任务中的性能，发现顶级通用编辑模型在常规任务上可媲美专用模型，且在复杂场景中泛化更稳定，但在多衣物条件等复杂参考配置下仍有挑战。

Details

Motivation: 现有专用虚拟试衣（VTON）模型难以应对日益增长的真实场景需求，而通用多参考图像编辑模型展现出强泛化能力，但其在VTON任务中的优势与局限尚缺乏系统性评估，主因是缺乏相应基准。 Method: 构建了包含24,220对测试图像、覆盖5类典型VTON任务的综合性基准VTEdit-Bench；提出基于参考感知视觉语言模型的自动评估器VTEdit-QA，从模型一致性、服装一致性和整体图像质量三方面量化评估；系统评测8个通用编辑模型与7个专用VTON模型。 Result: 顶级通用编辑模型在常规VTON任务上性能接近专用模型，在更难场景中泛化更稳定；但在多衣物条件等复杂参考配置下表现明显下降。 Conclusion: 通用多参考图像编辑模型具备替代专用VTON模型的潜力，尤其在泛化性方面优势明显，但需进一步提升对复杂参考（如多衣物）的建模能力；VTEdit-Bench与VTEdit-QA为未来研究提供了可靠评估框架。 Abstract: As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.

[120] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Dingcheng Zhen,Xu Zheng,Ruixin Zhang,Zhiqi Jiang,Yichao Yan,Ming Tao,Shunshun Yin

Main category: cs.CV

TL;DR: 本文提出Neighbor Forcing和ConvKV内存机制，解决自回归扩散模型在长时序人类动画生成中的学习信号不一致与历史状态无界增长问题，实现小时级实时生成与高效推理。

Details

Motivation: 现有自回归扩散模型在小时级实时人类动画生成中面临两大挑战：一是强制策略导致扩散状态不匹配、学习信号不稳定；二是历史表征无界增长且缺乏结构，难以高效复用缓存状态。 Method: 提出Neighbor Forcing——一种扩散步一致的自回归建模方法，将同一噪声条件下的相邻帧作为潜在邻居传播；并设计结构化ConvKV内存机制，将因果注意力中的键值压缩为固定长度表示。 Result: LiveAct可在2块H100/H200 GPU上实现小时级实时人类动画生成与20 FPS流式推理，在唇动同步精度、动画质量与情感表现力上达到SOTA，且推理成本最低。 Conclusion: Neighbor Forcing与ConvKV共同提升了训练收敛性、长时生成质量与推理效率，为AR扩散模型在长视频生成中的实用化提供了新范式。 Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.

[121] Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Chenyangguang Zhang,Botao Ye,Boqi Chen,Alexandros Delitzas,Fangjinhua Wang,Marc Pollefeys,Xi Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏3D手部关节点控制的运动可控视频生成框架，解决了现有方法在自视角（egocentric）场景下因遮挡导致的3D不一致、伪影及跨具身泛化能力差等问题。

Details

Motivation: 现有方法依赖2D轨迹或隐式姿态，在严重自视角遮挡下易产生运动不一致和幻觉伪影，且难以泛化到机器人手等非人形具身。 Method: 提出一种新型框架：以单帧参考图像为输入，利用稀疏3D手部关节点作为具身无关控制信号；设计高效控制模块，通过抑制被遮挡关节的不可靠视觉信号、引入3D加权机制处理动态遮挡，并将3D几何嵌入直接注入潜在空间以保证结构一致性；构建百万级自动标注数据集及跨具身基准。 Result: 在自视角视频生成任务上显著优于SOTA方法，生成高保真、真实交互的视频，并展现出优异的跨具身泛化能力（如迁移到机器人手）。 Conclusion: 稀疏3D关节点作为具身无关、语义与几何清晰的控制信号，结合遮挡感知特征提取与3D几何嵌入，可有效提升自视角视频生成的3D一致性与泛化性。 Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.

[122] HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification

Marjan Stoimchev,Boshko Koloski,Jurica Levatić,Dragi Kocev,Sašo Džeroski

Main category: cs.CV

TL;DR: 本文提出HELM框架，通过层次特定类标记、图卷积网络和自监督分支，解决遥感图像中多路径层次结构建模及未标注数据利用问题，在多个数据集上达到SOTA性能。

Details

Motivation: 现有方法难以处理遥感图像中实例属于多个分支的多路径层次结构，且很少利用未标注数据。 Method: HELM框架包含三部分：(i) 在Vision Transformer中使用层次特定类标记以捕捉精细标签交互；(ii) 利用图卷积网络显式编码层次结构并生成层次感知嵌入；(iii) 引入自监督分支有效利用未标注遥感影像。 Result: 在UCM、AID、DFC-15和MLRSNet四个遥感图像数据集上，HELM在有监督和半监督设置下均超越强基线，尤其在低标签场景下表现突出。 Conclusion: HELM成功解决了HMLC中多路径层次建模与未标注数据利用的关键挑战，为遥感图像分类提供了新范式。 Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.

[123] Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

Alaa Yasser,Kittipat Phunjanna,Marcos Escudero Viñolo,Catarina Barata,Jenny Benois-Pineau

Main category: cs.CV

TL;DR: 本文提出了一种机制性公平审计方法，通过结合投影残差流分解、零样本概念激活向量和偏差增强的TextSpan分析，定位视觉Transformer中个体注意力头层面的人口统计偏差，并在CLIP ViT-L-14模型上验证了其对性别与年龄偏差的可定位性差异。

Details

Motivation: 标准公平性审计只能量化模型是否存在偏差，但无法定位偏差在网络内部的具体位置；本文旨在实现偏差的机制性、细粒度（如单个注意力头）定位。 Method: 融合投影残差流分解（projected residual-stream decomposition）、零样本概念激活向量（zero-shot Concept Activation Vectors）和偏差增强的TextSpan分析，对CLIP ViT-L-14编码器在FACET基准42职业类别上进行性别与年龄偏差的机制性审计。 Result: 成功定位到4个终端层注意力头显著影响性别偏差（消融后Cramer's V从0.381降至0.362，准确率微升0.42%），且效果具有特异性；而年龄偏差则表现出更弥散的编码模式，消融效应较弱且不一致。 Conclusion: 在判别式视觉编码器中，注意力头级别的偏差定位是可行的，但不同受保护属性（如性别vs.年龄）的偏差局部化程度存在差异。 Abstract: Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness

[124] Intrinsic Concept Extraction Based on Compositional Interpretability

Hanyu Shi,Hong Tao,Guoheng Huang,Jianbin Jiang,Xuhang Chen,Chi-Man Pun,Shanhu Wang,Pan Pan

Main category: cs.CV

TL;DR: 本文提出CI-ICE新任务，旨在从单张图像中提取可组合、可解释的内在概念，并设计HyperExpress方法，利用双曲空间建模与概念级优化实现概念解耦与组合重构。

Details

Motivation: 现有无监督概念提取方法无法提取可组合的内在概念，限制了概念的可解释性与重建能力。 Method: 提出HyperExpress方法：1）利用双曲空间的层次建模能力实现概念解耦并保持层次与关系结构；2）引入概念级优化，映射概念嵌入空间以维持复杂概念关系并保障可组合性。 Result: 在单图像中成功提取出具有组合性与可解释性的内在概念，性能优异。 Conclusion: CI-ICE任务及HyperExpress方法为图像概念理解提供了新范式，显著提升了概念的可组合性、可解释性与重建能力。 Abstract: Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.

[125] OSM-based Domain Adaptation for Remote Sensing VLMs

Stefan Maria Ailuro,Mario Markov,Mohammad Mahdi,Delyan Boychev,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 本文提出OSMDA框架，利用基础视觉语言模型（VLM）结合OpenStreetMap（OSM）数据自动生成高质量伪标签，实现无需人工标注、无需大教师模型的遥感领域自适应。

Details

Motivation: 遥感领域缺乏高质量图像-文本配对标注，现有伪标签方法依赖大教师模型，成本高、可扩展性差且性能受限于教师能力。 Method: 提出OSMDA：利用基础VLM自身OCR与图表理解能力，将航拍图像与渲染的OSM瓦片配对生成富含地理元数据的描述；仅用卫星图像在该自建语料上微调，得到OSMDA-VLM。 Result: 在10个跨模态遥感基准上全面评测，与9个强基线对比；等量混合真实数据时达到SOTA，训练成本显著低于教师依赖方法。 Conclusion: 基于强基础模型并对其对齐众包地理数据（如OSM），是遥感领域自适应的一条实用、可扩展路径。 Abstract: Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

[126] CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing

Yue Shi,Rui Shi,Yuxuan Xiong,Bingbing Ni,Wenjun Zhang

Main category: cs.CV

TL;DR: 本文提出CEI-3D，一种面向编辑的3D重建流程，通过协同显式-隐式重建与物理属性解耦，实现更真实、细粒度且高效的3D编辑。

Details

Motivation: 现有3D编辑方法因重建网络高度耦合，导致结果不真实、粗糙，缺乏对局部区域的精细可控编辑能力。 Method: 提出协同显式-隐式重建（SDF隐式网络 + 可微采样、局部可控的handler点）；设计物理属性解耦模块（分离颜色为独立物理属性）和双扩散-反照率网络（分别处理编辑/非编辑区域）；引入空间感知编辑模块，结合跨视角传播式3D分割实现部件级编辑。 Result: 在真实与合成数据集上实验表明，该方法相比SOTA方法生成更真实、更细粒度的编辑结果，且编辑耗时更少。 Conclusion: CEI-3D通过显隐协同表征与属性解耦，有效解耦全局结构与局部编辑，显著提升了3D编辑的真实性、可控性与效率。 Abstract: Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on https://github.com/shiyue001/CEI-3D.

[127] Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning

Robin Peretzke,Marlin Hanstein,Maximilian Fischer,Lars Badhi Wessel,Obada Alhalabi,Sebastian Regnery,Andreas Kudak,Maximilian Deng,Tanja Eichkorn,Philipp Hoegen Saßmannshausen,Fabian Allmendinger,Jan-Hendrik Bolten,Philipp Schröter,Christine Jungk,Jürgen Peter Debus,Peter Neher,Laila König,Klaus Maier-Hein

Main category: cs.CV

TL;DR: 本文提出RICE-NET模型，结合纵向MRI与放疗剂量图，利用常规T1加权MRI数据自动区分胶质母细胞瘤治疗后肿瘤复发与放射性增强，F1达0.92，证实放疗图对分类至关重要。

Details

Motivation: 临床中难以区分胶质母细胞瘤治疗后的肿瘤复发与放射性增强；现有方法依赖稀缺的扩散MRI或未整合放疗剂量图（现已被肿瘤多学科会诊重视）。 Method: 提出RICE-NET：一种融合纵向MRI与放疗剂量分布的多模态3D深度学习模型，仅基于常规T1加权MRI进行病变分类；在92例患者队列上训练验证，并开展消融实验与遮挡可解释性分析。 Result: 在独立测试集上F1分数达0.92；消融实验证实放疗图对分类性能起主导作用；遮挡分析显示模型关注临床相关区域。 Conclusion: 多模态深度学习（尤其整合放疗剂量图）有望显著提升神经肿瘤学中的诊断准确性与临床决策支持能力。 Abstract: The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.

[128] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

Jiahao Li,Qingwang Zhang,Qiuyu Chen,Guozhan Qiu,Yunzhong Lou,Xiangdong Zhou

Main category: cs.CV

TL;DR: FutureCAD 是一个结合大语言模型（LLM）与B-Rep接地变换器（BRepGround）的文本到CAD生成框架，可生成可执行的CadQuery脚本，并通过自然语言实现几何元素选择与定位，显著提升AI驱动CAD建模能力。

Details

Motivation: 现有CAD生成方法分为参数化建模与B-Rep合成两类，二者割裂导致难以支持复杂工业产品设计；而实际CAD系统中二者本应紧密耦合，因此需弥合该范式鸿沟。 Method: 提出FutureCAD框架：1）利用LLM生成CadQuery脚本并以自然语言描述几何选择；2）引入BRepGround模块将自然语言查询接地到B-Rep原始几何体；3）基于真实CAD模型构建新数据集，采用监督微调（SFT）与强化学习（RL）联合训练LLM。 Result: 在多个指标上达到当前最优（SOTA）的CAD生成性能，验证了文本驱动、高保真、可执行CAD建模的可行性。 Conclusion: FutureCAD成功融合参数化建模与B-Rep表示，通过LLM与BRepGround协同实现语义可控、几何精确的端到端CAD生成，为AI辅助工业设计提供了新范式。 Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.

[129] A Decade of Generative Adversarial Networks for Porous Material Reconstruction

Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani

Main category: cs.CV

TL;DR: 本文综述了2017年至2026年初发表的96篇论文，系统分析了生成对抗网络（GAN）在多孔材料数字重建中的发展与应用，将GAN架构分为六类，并总结了在孔隙率精度、渗透率预测和重建体积等方面的进展及现存挑战。

Details

Motivation: 传统多孔材料重建方法（如微CT和统计重建）存在局限，而深度学习尤其是GAN为高精度、高效重建带来新机遇，亟需系统梳理其发展脉络与适用场景。 Method: 对96篇同行评议论文进行系统性文献综述，按架构将GAN分为六类，并定量评估其在孔隙率、渗透率预测和重建尺度等关键指标上的性能表现。 Result: 归纳出六类GAN架构；孔隙率误差控制在1%以内，渗透率预测相对误差降低达79%，最大重建体积从64³提升至2200³体素；识别出计算效率、内存限制和2D到3D结构连续性等核心挑战。 Conclusion: GAN显著提升了多孔材料重建能力，但需根据具体应用场景（如精度、尺度、硬件条件）选择合适架构，本综述为该领域方法选型提供了系统性指导框架。 Abstract: Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.

[130] ZeroSense:How Vision matters in Long Context Compression

Yonghan Gao,Zehong Chen,Lijian Xu,Jingzhi Chen,Jingwei Guan,Xingyu Zeng

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-文本压缩（VTC）质量评估框架，通过解耦多模态大语言模型（MLLMs）的能力，并引入ZeroSense基准，以更准确地衡量文本保真度，避免下游任务性能带来的偏差。

Details

Motivation: 现有VTC评估方法过度依赖下游任务性能，受MLLM固有语言先验影响，无法准确衡量文本保真度。 Method: 提出解耦MLLM能力的评估框架，并构建低语义相关性的ZeroSense基准，消除上下文依赖以纯化VTC质量评估。 Result: 实验证明VTC质量与下游任务准确率显著偏离，验证了新评估框架的必要性与有效性。 Conclusion: 需采用解耦式评估框架来真实反映VTC方法的文本压缩保真能力，而非依赖下游任务表现。 Abstract: Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

[131] Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration

Zhaocheng Yu,Xiang Chen,Runzhe Li,Zihan Geng,Guanglu Sun,Haipeng Li,Kui Jiang

Main category: cs.CV

TL;DR: 本文提出Derain-Agent，一种即插即用的动态去雨精炼框架，通过规划网络和强度调制机制实现对不同区域残差误差的自适应校正，显著提升现有去雨模型在合成与真实数据集上的性能。

Details

Motivation: 现有深度学习单图像去雨模型采用静态推理范式，无法适应真实雨天图像中复杂的耦合退化（如噪声、模糊、色彩偏差），导致复原图像存在残差伪影和感知质量不一致问题。 Method: 提出Derain-Agent框架，包含两个核心组件：1）规划网络，为每个输入图像智能调度最优的修复工具序列；2）强度调制机制，以空间自适应强度应用这些工具，实现高效、精准的区域级误差修正。 Result: 该方法在合成与真实世界基准测试中均显著提升了当前最先进去雨模型的性能，展现出强泛化能力，且无需高代价的迭代搜索。 Conclusion: Derain-Agent成功将去雨任务从静态处理转向动态、基于智能体的恢复范式，为解决复杂真实退化提供了新思路，并具备良好的即插即用性和实用性。 Abstract: While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.

[132] Single-View Rolling-Shutter SfM

Sofía Errázuriz Muñoz,Kim Kiehn,Petr Hruby,Kathlén Kohn

Main category: cs.CV

TL;DR: 本文提出了一种针对滚动快门（RS）相机的单视图几何建模方法，系统分析了从单张RS图像中可恢复的运动与场景参数，并推导出若干最小重建问题，通过实验验证了其可行性与局限性。

Details

Motivation: 滚动快门相机广泛存在，但其结构光度法（RS SfM）尚未被完全解决，亟需对其单视图几何特性进行建模与分析。 Method: 刻画RS相机下单个图像中世界点/线的单视图几何关系，据此分析可恢复的运动与场景参数，并系统推导最小重建问题；设计并实现多个典型情形的原理验证求解器。 Result: 明确了单张RS图像中可解的运动与结构参数组合，构建了若干最小解问题，并通过实验展示了其理论可行性及实际应用中的限制。 Conclusion: RS单视图几何为RS SfM提供了理论基础，所提出的最小问题与求解器为后续RS重建算法设计提供了新思路和实用工具。 Abstract: Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.

[133] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team,Xiaoyu Zhang,Weihong Pan,Zhichao Ye,Jialin Liu,Yipeng Chen,Nan Wang,Xiaojun Xiang,Weijian Xie,Yifu Wang,Haoyu Ji,Siji Pan,Zhewen Le,Jing Guo,Xianbin Liu,Donghui Shen,Ziqiang Zhao,Haomin Liu,Guofeng Zhang

Main category: cs.CV

TL;DR: InSpatio-WorldFM 是一种开源实时帧模型，通过独立生成每帧并结合3D锚点与空间记忆实现低延迟、多视角一致的空间智能推理，可在消费级GPU上实现实时交互式世界模拟。

Details

Motivation: 解决现有基于视频的世界模型因窗口级序列帧生成导致的高延迟问题，满足实时空间推理需求。 Method: 采用帧独立生成范式；引入显式3D锚点和隐式空间记忆保障多视角空间一致性；设计三阶段渐进训练流程（预训练图像扩散模型→可控帧模型→少步蒸馏为实时生成器）。 Result: 在保持强多视角一致性的同时，支持消费级GPU上的实时交互探索，显著降低延迟。 Conclusion: InSpatio-WorldFM为实时世界模拟提供了一种高效、低延迟的替代方案，优于传统视频式世界模型。 Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

[134] PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qin,Michele Magno

Main category: cs.CV

TL;DR: PicoSAM3是一个轻量级、可提示的视觉分割模型，专为边缘和传感器端实时部署设计，参数仅1.3M，在COCO和LVIS上分别达65.45%和64.01% mIoU，支持IMX500传感器上11.82ms低延迟INT8推理。

Details

Motivation: 满足智能眼镜、IoT等对低延迟和隐私保护要求高的场景下，实现实时、端侧、传感器级图像分割的需求。 Method: 提出PicoSAM3模型，融合密集CNN架构、ROI提示编码、高效通道注意力机制，并通过知识蒸馏从SAM2/SAM3中学习；支持INT8量化以适配IMX500硬件约束。 Result: 在COCO和LVIS数据集上mIoU分别达65.45%和64.01%，优于同复杂度的SAM基线和边缘模型；INT8量化后在IMX500上实现11.82ms实时推理，且精度几乎无损；消融实验显示知识蒸馏带来最高+14.5% mIoU提升。 Conclusion: 证明了高质量、空间灵活的可提示分割可在图像传感器端直接实现，为隐私敏感与低延迟应用提供了可行的轻量化解决方案。 Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

[135] Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments

Pankaj Deoli,Karthik Ranganath,Karsten Berns

Main category: cs.CV

TL;DR: 本文评估了经典和深度学习方法在林区越野应用中的RGB-NIR图像配准性能，发现NeMAR存在GAN损失不稳定问题，MURF在大尺度特征对齐上表现良好但细节处理不足，需进一步改进以实现鲁棒的多尺度配准。

Details

Motivation: RGB-NIR图像配准在传感器融合、图像增强和越野自主系统中至关重要，尤其在复杂林区越野场景中缺乏适配性评估。 Method: 对经典与深度学习（DL）图像配准方法进行实证评估，重点分析NeMAR（6种训练配置）与MURF在真实越野林区数据上的表现。 Result: NeMAR部分成功但GAN损失不稳定，几何一致性难保障；MURF能较好完成大尺度特征对齐，但在茂密植被区域细节匹配效果差。 Conclusion: 现有方法尚不能满足越野林区多尺度鲁棒配准需求，亟需针对性改进与优化。 Abstract: RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.

[136] AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

Jennifer Nolan,Travis Driver,John Christian

Main category: cs.CV

TL;DR: 本文提出AstroSplat，一种结合行星反射模型的物理驱动高斯点绘框架，用于提升小天体表面的自主重建与光度表征精度，并在NASA黎明号任务真实图像上验证了其优于传统球谐参数化方法的渲染与重建性能。

Details

Motivation: 基于图像的表面重建与表征对小天体（如小行星）探测任务至关重要，但现有高斯点绘方法仅依赖外观驱动的球谐强度参数化，缺乏对材质属性和光-面相互作用的显式建模。 Method: 提出AstroSplat框架，将行星反射模型嵌入高斯点绘中，实现物理驱动的神经场景表示；在NASA黎明号任务获取的真实图像上进行验证。 Result: AstroSplat在真实小天体图像上展现出优于传统球谐参数化的渲染性能和表面重建精度。 Conclusion: AstroSplat通过引入物理反射模型，显著提升了小天体表面的自主重建与光度表征能力，为深空探测任务提供更可靠的视觉感知基础。 Abstract: Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.

[137] Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Junhyeong Byeon,Jeongyeol Kim,Sejoon Lim

Main category: cs.CV

TL;DR: 本文提出了一种用于野外视频情感识别的多模态框架，结合冻结的CLIP视觉编码器、Wav2Vec 2.0音频编码器、时序卷积网络（TCN）建模面部动态，并引入双向交叉注意力融合模块与文本引导的对比学习目标，显著提升ABAW 10th EXPR任务性能。

Details

Motivation: 单一模态（如面部或语音）难以应对野外视频中外观、姿态、光照、背景噪声及情感动态性等复杂变化，亟需鲁棒的多模态建模方法。 Method: 采用冻结的CLIP和Wav2Vec 2.0分别提取视觉与音频特征；用TCN建模固定长度视频窗口内的时序面部变化；设计双向交叉注意力融合模块实现视听对称交互；加入基于CLIP文本特征的文本引导对比损失以增强语义对齐。 Result: 在ABAW 10th EXPR基准上，该框架显著优于单模态方法，提供了强多模态基线。 Conclusion: 融合时序视觉建模、音频表征学习与交叉模态融合，可有效提升野外环境下情感识别的鲁棒性与准确性。 Abstract: Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

[138] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu,Zhongxiang Sun,Zilu Zhang,Xiao Zhang,Jun Xu

Main category: cs.CV

TL;DR: 本文提出了HomeSafe-Bench基准和HD-Guard安全监控架构，用于提升家庭环境中VLM对不安全动作的实时检测能力。

Details

Motivation: 现有安全评估方法难以应对家庭环境中动态、不可预测的风险，且VLM在感知延迟和常识缺失下易出错。 Method: 构建了结合物理仿真与视频生成的HomeSafe-Bench基准（含438个案例）；提出分层流式HD-Guard架构，融合轻量FastBrain与异步大模型SlowBrain进行实时多模态安全监测。 Result: HD-Guard在延迟与性能间取得更优权衡；分析揭示了当前VLM在家庭安全检测中的关键瓶颈。 Conclusion: HomeSafe-Bench和HD-Guard为家庭机器人安全提供了新基准与实用架构，推动VLM向真实场景安全部署迈进。 Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

[139] Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

Chongyang Xu,Yixian Zou,Ziliang Feng,Fanman Meng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出Ada3Drift方法，通过将迭代优化从推理阶段转移到训练阶段，在单步生成（1 NFE）下实现高保真多模态动作预测，显著降低延迟并提升机器人控制性能。

Details

Motivation: 扩散模型虽能建模多模态动作分布，但推理延迟高；流匹配与一致性模型虽快，却牺牲多模态保真度。作者利用机器人中训练与推理计算预算不对称的特点，提出将迭代细化前移到训练阶段以兼顾速度与多模态性。 Method: Ada3Drift学习一个训练时的漂移场：吸引预测动作靠近专家演示模式、排斥其他样本；引入sigmoid调度损失，从粗粒度分布学习逐步过渡到细粒度模式锐化；并采用多尺度场聚合以捕获不同空间粒度的动作模式。输入为3D点云观测。 Result: 在Adroit、Meta-World和RoboTwin三个仿真基准及真实机器人操作任务上，Ada3Drift达到SOTA性能，且函数评估次数仅为扩散模型的1/10。 Conclusion: Ada3Drift成功在单步生成下恢复多模态动作保真度，解决了实时机器人控制中速度与行为多样性难以兼顾的关键矛盾。 Abstract: Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.

[140] CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation

Ziqi Ye,Ziyang Gong,Ning Liao,Xiaoxing Hu,Di Wang,Hongruixuan Chen,Chen Huang,Yiguo He,Yuru Jia,Xiaoxing Wang,Haipeng Wang,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了CrossEarth-SAR，首个面向合成孔径雷达（SAR）图像跨域语义分割的十亿级视觉基础模型，采用物理引导的稀疏混合专家（MoE）架构，并构建了大规模数据集CrossEarth-SAR-200K与首个统一SAR领域泛化基准套件（22个子基准）。

Details

Motivation: SAR图像因成像机制多样、传感器与地域差异导致严重域偏移，限制其语义泛化能力。 Method: 提出物理引导的稀疏Mixture-of-Experts（MoE）架构，融合物理描述符；构建弱监督与全监督融合的大规模数据集CrossEarth-SAR-200K；设计覆盖8类域差距的22子基准统一评估套件。 Result: 在22个基准中的20个上达到SOTA性能，多域迁移下部分基准mIoU提升超10%。 Conclusion: CrossEarth-SAR显著提升了SAR图像跨域语义分割的泛化能力，为SAR视觉基础模型研究奠定新基准与开源资源基础。 Abstract: Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.

[141] Pano360: Perspective to Panoramic Vision with Geometric Consistency

Zhengdong Zhu,Weiyi Xue,Zuyuan Yang,Wenlve Zhou,Zhiheng Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于3D光束法平差空间和Transformer架构的全景图像拼接新方法，通过在3D空间中进行全局对齐与多特征联合优化，显著提升了弱纹理、大视差和重复纹理场景下的拼接精度与视觉质量。

Details

Motivation: 现有全景拼接方法依赖两两图像间的2D特征匹配，缺乏多视角几何一致性建模，导致在弱纹理、大视差和重复纹理等挑战性场景中出现严重畸变和错位。 Method: 将图像对齐任务扩展至3D摄影测量空间，利用已知相机位姿指导图像在3D空间中的扭曲与全局对齐；设计基于Transformer的网络实现3D感知与跨视角全局信息聚合；引入多特征联合优化策略计算拼接缝。 Result: 在自建大规模真实场景数据集上实验表明，该方法在对齐精度和感知质量上均显著优于现有方法。 Conclusion: 在3D空间中建模多视角几何一致性并结合Transformer进行全局优化，是提升复杂场景下全景拼接鲁棒性与质量的有效途径。 Abstract: Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.

[142] Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

Nicholas Schaub,Andriy Kharchenko,Hamdah Abbasi,Sameeul Samee,Hythem Sidky,Nathan Hotaling

Main category: cs.CV

TL;DR: 本文介绍了Nyxus，一个专为大规模2D/3D图像数据设计的可扩展、支持内存外（out-of-core）计算的特征提取库，具备跨CPU/GPU的计算可扩展性，并提供多种用户接口（Python包、命令行、Napari插件、OCI容器），支持程序化调优以适配机器学习与深度学习应用。

Details

Motivation: 现代成像仪器产生TB至PB级图像数据，传统算法在效率、鲁棒性和准确性间难以兼顾；深度学习提升了分割精度，但领域专用特征提取库繁多且缺乏统一性能评估标准。 Method: 从零设计Nyxus特征提取库，支持2D/3D图像的可扩展、内存外特征提取；通过多平台（CPU/GPU）优化与严格基准测试验证；提供Python包、CLI、Napari插件及OCI容器等多种部署形式；支持程序化调优特征集以平衡计算效率与覆盖度。 Result: Nyxus实现了对放射组学和细胞分析等多生物医学领域的全覆盖特征集；在计算可扩展性、易用性（面向不同技能用户）和方法灵活性（支持ML/DL定制）方面均取得显著进展；已通过标准基准严格测试。 Conclusion: Nyxus有效解决了大规模图像数据分析中特征提取的可扩展性、一致性与易用性瓶颈，为科学图像分析提供了统一、高效、可复现的新范式。 Abstract: Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.

[143] Single Pixel Image Classification using an Ultrafast Digital Light Projector

Aisha Kanwal,Graeme E. Johnstone,Fahimeh Dehkhoda,Johannes H. Herrnsdorf,Robert K. Henderson,Martin D. Dawson,Xavier Porte,Michael J. Strain

Main category: cs.CV

TL;DR: 本文提出了一种结合单像素成像（SPI）与低复杂度机器学习模型（如ELM）的超高速图像分类方法，实现多kHz帧率下的实时MNIST数字分类，无需传统图像重建，适用于自动驾驶和异常检测等场景。

Details

Motivation: 为满足自动驾驶等应用对复杂动态环境信息实时分类的需求，需突破传统图像处理在速度与计算开销上的瓶颈。 Method: 采用微LED-on-CMOS数字光投影仪实现超快单像素成像（SPI），结合极简结构的极限学习机（ELM）和轻量级反向传播深度神经网络，在不进行图像重建的前提下直接进行时空域特征分类。 Result: 系统在MNIST数据集上实现了高准确率的多kHz实时分类；ELM模型在二分类任务中展现出优异的异常检测潜力；推理开销与图像编码时间相当。 Conclusion: 基于SPI与低复杂度模型的端到端分类范式可有效绕过图像重建，显著提升超高速视觉任务的效率与实用性。 Abstract: Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.

[144] Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Chiyuan He,Zihuan Qiu,Fanman Meng,Runtong Zhang,Linfeng Xu,Qingbo Wu,Hongliang Li

Main category: cs.CV

TL;DR: 本文提出SeGP-CL方法，在无样本持续学习中通过对抗锚点探测语义漂移区域，并利用锚点引导的跨模态几何蒸馏与文本语义几何正则化，有效缓解预训练视觉语言模型在新任务微调中的灾难性遗忘与几何失真问题。

Details

Motivation: 现有持续学习方法未显式保持预训练及历史阶段继承的跨模态语义几何结构，导致新任务监督引发几何失真，尤其在新旧语义交界处出现显著漂移。 Method: 提出Semantic Geometry Preservation for Continual Learning（SeGP-CL）：1）用双目标投影梯度下降（DPGD）构建紧凑对抗锚点集以探测易漂移区域；2）锚点引导的跨模态几何蒸馏（ACGD）保持跨模态结构；3）轻量级文本语义几何正则化（TSGR）稳定文本参考系；4）基于锚点估计原始空间漂移，迁移旧视觉原型并融合双路径推理。 Result: 在五个持续学习基准上显著提升稳定性与前向迁移能力，达到SOTA性能，且更有效地保留VLMs的语义几何结构。 Conclusion: 显式建模和保护跨模态语义几何结构是缓解VLMs持续学习中灾难性遗忘与几何失真的关键，SeGP-CL为无样本约束下的持续视觉语言学习提供了有效范式。 Abstract: Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.

[145] Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Yanghao Wang,Ziqi Jiang,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的粗到细视觉生成方法，利用h-transform在扩散模型采样过程中引入条件引导，通过动态调整漂移项和噪声感知调度，在不依赖配对数据和已知退化算子的前提下，实现高质量、高泛化性的图像与视频生成。

Details

Motivation: 现有基于训练的方法存在高成本和泛化受限问题；而训练免费方法要么依赖已知前向退化算子（如双三次下采样），要么难以平衡引导强度与生成质量。 Method: 提出基于h-transform的引导机制，修改扩散采样过程中的转移概率：在原始SDE中加入漂移函数以引导生成方向，并设计噪声水平感知的衰减调度策略，逐步降低漂移项权重以抑制近似误差。 Result: 在多种图像与视频粗到细生成任务上验证了方法的有效性和强泛化能力，无需训练、无需已知退化算子，且生成质量优于现有训练免费方法。 Conclusion: h-transform为训练免费的条件生成提供了新范式，所提方法在保持生成质量的同时显著提升了灵活性与适用性。 Abstract: Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

[146] NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction

David Svitov,Mahtab Dahaghin

Main category: cs.CV

TL;DR: NBAvatar是一种结合定向平面基元与神经渲染的新方法，用于高质量、真实感地渲染受手脸交互影响的头部虚拟形象，显著提升新视角与新姿态渲染质量。

Details

Motivation: 现有方法难以同时建模手脸交互引起的非刚性形变与高保真外观细节，尤其在高分辨率下渲染质量受限。 Method: 提出NBAvatar：融合显式（定向平面基元）与隐式（神经渲染）表示，显式建模时序一致和姿态一致的几何结构，隐式学习手脸交互导致的颜色变化及精细外观。 Result: 在高分辨率（兆像素级）渲染中，相比基于高斯的虚拟人方法LPIPS降低最多30%，PSNR和SSIM提升；相比InteractAvatar，在结构相似性上更优。 Conclusion: NBAvatar通过混合表征有效平衡了几何一致性与外观真实性，为手脸交互下的头部虚拟人渲染提供了新范式。 Abstract: We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.

[147] Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Shuo Sun,Unal Artan,Malcolm Mielle,Achim J. Lilienthaland,Martin Magnusson

Main category: cs.CV

TL;DR: 本文提出了一种用于多自由移动相机下稠密动态场景重建与相机位姿估计的两阶段优化框架，通过构建时空连接图和宽基线初始化策略提升鲁棒性，并在新提出的MultiCamRobolab真实数据集上验证了其优越性。

Details

Motivation: 现有方法仅支持单相机输入或需刚性校准的相机阵列，难以适用于多自由移动相机捕获共享事件的实际场景。 Method: 采用两阶段优化框架：第一阶段扩展单相机视觉SLAM至多相机设置，构建时空连接图并引入基于前馈重建模型的宽基线初始化；第二阶段利用宽基线光流优化密集跨相机与单相机一致性以精化深度与位姿。 Result: 在合成与真实世界基准上显著优于当前最优前馈模型，且内存占用更低；并在新构建的MultiCamRobolab真实数据集（含动捕真值位姿）上完成验证。 Conclusion: 该方法有效解决了多自由移动相机下的动态场景重建与位姿估计难题，兼具鲁棒性、精度与实用性。 Abstract: We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

[148] Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing

Simone Cammarasana

Main category: cs.CV

TL;DR: 本文提出了一种系统性分类法，将替代或扩展标准卷积的算子分为五类：基于分解、自适应加权、基自适应、积分/核函数及注意力机制算子，并对其性质、适用任务与挑战进行了全面分析。

Details

Motivation: 标准卷积作为CNN核心，虽具简洁性与平移等变性，但其固定、线性、局部平均特性限制了对低秩结构、自适应基表示和非均匀空间依赖等信号特性的建模能力。 Method: 构建五类替代卷积算子的系统性分类框架，对每类给出形式化定义、结构性对比（相对于卷积）、适用任务分析，并在多个维度（线性、局部性、等变性、计算成本、任务类型）进行横向比较。 Result: 形成了覆盖主流扩展卷积算子的统一分类体系，明确了各类算子的理论特性与实际适用场景，并指出了当前研究的开放问题与未来方向。 Conclusion: 替代卷积算子的研究正从经验设计走向系统化理解；五类算子各有优势与局限，需根据具体任务需求与约束条件进行选择与组合，未来需进一步探索理论保证、高效实现与跨模态泛化。 Abstract: The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.

[149] Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments

Zhaoyang Jiang,Zhizhong Fu,David McAllister,Yunsoo Kim,Honghan Wu

Main category: cs.CV

TL;DR: LoV3D是一种用于纵向脑MRI分析的3D视觉-语言模型训练流程，可实现区域级解剖评估、纵向对比及三分类诊断，并通过临床加权验证器提升诊断准确性和泛化能力。

Details

Motivation: 现有深度学习方法在纵向脑MRI分析中存在碎片化问题：分类器仅输出标签、体积分析缺乏解释性、视觉-语言模型易产生幻觉结果。 Method: 提出LoV3D流水线，结合3D视觉-语言建模、区域级解剖评估、纵向扫描对比，并引入临床加权Verifier进行无监督偏好优化，无需人工标注。 Result: 在ADNI测试集上三分类准确率达93.7%，显著优于基线；零样本迁移至MIRIAD和AIBL数据集也表现出高泛化性。 Conclusion: LoV3D通过多阶段结构化推理与生物医学约束，有效缓解幻觉问题，在准确率与可解释性之间取得良好平衡，具备临床落地潜力。 Abstract: Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.

[150] Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs

Hiran Sarkar,Liming Kuang,Yordanka Velikova,Benjamin Busam

Main category: cs.CV

TL;DR: Node-RF 结合神经ODE与动态NeRF，实现连续时空建模，支持长时外推且内存开销恒定。

Details

Motivation: 现有方法仅能在观测边界内捕捉场景动态，难以泛化到训练序列之外的长时预测。 Method: 将神经ODE嵌入动态NeRF框架，通过ODE求解器演化隐式场景状态，用NeRF渲染器合成任意视角视图。 Result: 在多运动序列上训练后能泛化至未见场景；可识别系统关键点，实现无显式物理模型的抽象行为刻画与未来预测。 Conclusion: Node-RF 提供了一种内存高效、具泛化能力的连续时空表示方法，显著提升视觉驱动的长时场景动态预测能力。 Abstract: Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.

[151] Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

Xiaolong Qian,Qi Jiang,Yao Gao,Lei Sun,Zhonghua Yi,Kailun Yang,Luc Van Gool,Kaiwei Wang

Main category: cs.CV

TL;DR: 本文提出UniCAC基准和ODE评估框架，系统评测24种CAC算法，揭示影响性能的三大关键因素（先验利用、网络架构、训练策略），推动跨镜头通用计算像差校正研究。

Details

Motivation: 现有计算像差校正（CAC）方法泛化能力差、需针对新镜头重新训练，且缺乏覆盖广泛光学像差的综合基准，难以评估和提升跨镜头通用性。 Method: 构建大规模摄影相机基准UniCAC（基于自动光学设计）；提出光学退化评估器（ODE）量化像差难度；对24种图像恢复与CAC算法进行系统实验与对比分析。 Result: 识别出影响CAC性能的三大关键因素（先验利用、网络架构、训练策略），并量化其各自影响；提供了首个面向消费级摄影镜头的跨镜头CAC综合基准与评估框架。 Conclusion: UniCAC基准与ODE框架为CAC研究提供了可复现、可量化的基础支撑，所揭示的关键影响因素为设计更通用、鲁棒的CAC方法指明方向，推动该领域向实用化发展。 Abstract: Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.

[152] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

Yan Li,Ning Liao,Xiangyu Zhao,Shaofeng Zhang,Xiaoxing Wang,Yifan Yang,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出EvoTok，一种通过残差进化过程在共享潜在空间中统一视觉理解和生成的图像分词器，解决了现有方法中因粒度差异导致的干扰和不一致问题。

Details

Motivation: 现有统一多模态大语言模型（MLLMs）面临视觉理解与生成之间的粒度鸿沟挑战：理解需高层语义抽象，而生成需像素级细节表示；现有方法或强制同一表征承载两类监督，或分离表征空间，分别导致干扰与不一致。 Method: 提出EvoTok，采用残差向量量化将图像编码为级联的残差token序列，在共享潜在空间中构建从低层细节到高层语义的演化轨迹，而非维护独立的像素/语义token空间。 Result: 在仅13M图像数据集上训练，EvoTok在ImageNet-1K（256×256）上实现0.43 rFID重建质量；集成至大语言模型后，在9个视觉理解基准中的7个表现优异，并在GenEval和GenAI-Bench等图像生成基准上取得显著结果。 Conclusion: 将视觉表征建模为演化轨迹是一种有效且原理清晰的统一视觉理解与生成的方法。 Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.

[153] Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

Agniv Sharma,Xianghui Xie,Tom Fischer,Eddy Ilg,Gerard Pons-Moll

Main category: cs.CV

TL;DR: Hoi3DGen是一个从文本生成高质量、高保真3D人-物交互模型的框架，通过多模态大语言模型构建高质量交互数据，并设计端到端文本到3D生成流程，显著提升文本一致性与3D质量。

Details

Motivation: 现有基于图像扩散模型分数蒸馏的方法存在Janus问题且难以忠实遵循文本提示，主因是高质量人-物交互3D数据稀缺。 Method: 提出Hoi3DGen框架：首先利用多模态大语言模型构建真实、高质量的人-物交互数据集；然后构建完整的文本到3D生成流水线，直接生成带纹理的3D网格。 Result: 在文本一致性上超越基线4–15倍，在3D模型质量上提升3–7倍，具备对多种类别和交互类型的强泛化能力，同时保持高质3D生成效果。 Conclusion: Hoi3DGen有效解决了文本驱动3D人-物交互生成中的保真度与数据瓶颈问题，为AR/XR/游戏等应用提供了更可靠、可控的生成方案。 Abstract: Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.

[154] HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Rui Shao,Ruize Gao,Bin Xie,Yixing Li,Kaiwen Zhou,Shuai Wang,Weili Guan,Gongwei Chen

Main category: cs.CV

TL;DR: 本文提出HATS框架，通过硬度感知的轨迹合成解决GUI代理训练中语义模糊动作导致的泛化能力不足问题，提升代理在真实场景中的鲁棒性。

Details

Motivation: 现有GUI代理轨迹合成方法忽视语义模糊动作（如上下文依赖、时序依赖或视觉模糊的动作），导致训练数据语义失准、代理泛化能力差。 Method: 提出HATS框架，包含两个闭环模块：(1) 硬度驱动探索——基于动作语义模糊程度（定义为‘硬度’）主动采集高信息量模糊交互；(2) 对齐引导精炼——迭代验证并修复指令与执行之间的语义对齐。 Result: 在多个基准GUI环境中，HATS生成的数据训练出的代理持续超越当前最优基线。 Conclusion: 语义模糊性是影响GUI代理泛化能力的关键因素，HATS通过硬度建模与闭环对齐机制有效缓解该问题，提升了轨迹数据质量与代理鲁棒性。 Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

[155] O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

Mengfei Duan,Hao Shi,Fei Teng,Guoqiang Zhao,Yuheng Zhang,Zhiyong Li,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出了O3N，首个纯视觉、端到端的全向开放词汇占用预测框架，通过Polar-spiral Mamba、Occupancy Cost Aggregation和Natural Modality Alignment三大模块，实现360°连续空间表征、几何语义一致性建模与像素-体素-文本三元统一表示，在多个基准上达到SOTA并具备强泛化与语义扩展能力。

Details

Motivation: 现有3D占用预测方法受限于窄视角输入和预定义训练分布，难以满足具身智能体在开放世界探索中对全面、安全场景感知的需求。 Method: 提出O3N框架，包含：1）Polar-spiral Mamba（PsM）模块，以极-螺旋拓扑嵌入全向体素，支持连续空间表征与长程上下文建模；2）Occupancy Cost Aggregation（OCA）模块，统一几何与语义监督；3）Natural Modality Alignment（NMA）模块，实现无需梯度的视觉-体素-文本特征对齐。 Result: 在QuadOcc和Human360Occ基准上达到SOTA性能，同时展现出优异的跨场景泛化能力和语义可扩展性。 Conclusion: O3N为通用3D世界建模提供了新范式，推动具身智能与自主代理在开放环境中实现更鲁棒、更语义丰富的三维理解与重建。 Abstract: Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.

[156] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Quanhao Li,Zhen Xing,Rui Wang,Haidong Cao,Qi Dai,Daoguo Dong,Zuxuan Wu

Main category: cs.CV

TL;DR: 本文提出FlashMotion框架，用于少步长轨迹可控视频生成，通过先训练多步长轨迹适配器、再蒸馏为少步长生成器、最后用混合目标微调适配器，显著提升生成效率与轨迹精度。

Details

Motivation: 现有轨迹可控视频生成方法依赖多步去噪过程，计算开销大；而直接将视频蒸馏技术应用于该任务会导致视频质量与轨迹精度明显下降。 Method: 提出FlashMotion训练框架：1）在多步视频生成器上训练轨迹适配器；2）将生成器蒸馏为少步长版本；3）采用扩散+对抗联合目标微调适配器以适配少步长生成器。同时构建新基准FlashBench评估长序列轨迹控制性能。 Result: 在两种适配器架构上，FlashMotion在视频质量和轨迹一致性上均超越现有视频蒸馏方法及多步模型。 Conclusion: FlashMotion有效解决了少步长下轨迹可控视频生成的质量与精度权衡问题，为高效可控视频生成提供了新范式。 Abstract: Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.

[157] EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan,Chi Kit Wong,Yuanhuiyi Lyu,Hanqian Li,Jiahao Huo,Jiacheng Chen,Lutao Jiang,Xu Zheng,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出了EgoIntent基准，用于评估多模态大语言模型在自我中心视频中细粒度步骤级意图理解能力，涵盖局部意图（What）、全局意图（Why）和下一步计划（Next）三个维度，实验表明现有模型在此任务上表现仍很有限。

Details

Motivation: 现有基准仅关注片段级意图推理，忽视了更精细的步骤级意图理解；而智能助手、机器人模仿学习和增强现实指导等实际应用需要理解每一步‘做什么、为什么做、接下来做什么’。 Method: 构建了EgoIntent基准：包含3014个步骤、覆盖15种室内外日常场景；每个视频片段在关键动作发生前即截断，避免未来帧泄露，确保对前瞻性步骤理解和下一步规划的干净评估。 Result: 在15个主流多模态大语言模型上的评测显示，最优模型在三个意图维度上的平均得分仅为33.31。 Conclusion: 自我中心视频中的步骤级意图理解仍是极具挑战性的问题，亟需进一步研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.

[158] GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan,Jiarui Jin,Yue Ma,Shijian Wang,Jiahui Hu,Wenxiang Jiao,Yuan Lu,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出GlyphBanana方法，通过无需训练的智能体工作流，在潜在空间和注意力图中注入字形模板，提升文本与数学公式渲染精度，并配套发布专用基准测试。

Details

Motivation: 当前生成模型在处理分布外提示时指令遵循能力有限，导致复杂文本和数学公式渲染不准确。 Method: GlyphBanana采用基于辅助工具的智能体工作流，将字形模板注入潜在空间和注意力图，实现图像的迭代优化；该方法无需训练，可即插即用于多种文生图模型。 Result: 在自建基准上显著优于现有基线方法，验证了其在复杂字符与公式渲染上的高精度和泛化性。 Conclusion: GlyphBanana是一种通用、高效且无需训练的文本渲染增强方案，为提升T2I模型对复杂符号的理解与生成能力提供了新思路。 Abstract: Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

[159] LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

Haiying Xu,Zihan Wang,Song Dai,Zhengxuan Zhang,Kairan Dou,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出LatentGeo框架，通过学习连续潜在视觉表示来内化辅助几何构造，避免像素级渲染和外部执行器，并设计三阶段课程学习与LaGDPO强化学习方法，在GeoAux等基准上显著提升几何推理性能。

Details

Motivation: 现有方法在表示辅助几何构造时存在空间关系建模不准确、符号与几何结构表征不匹配、依赖外部工具阻碍端到端优化等问题。 Method: 提出LatentGeo框架：1）学习连续潜在视觉表示以隐式编码辅助构造；2）三阶段课程学习（含辅助视觉监督）对齐并内化表征；3）LaGDPO——一种潜变量感知的强化学习策略优化方法。 Result: 在新构建的GeoAux基准及MathVerse上显著提升几何推理性能，尤其在需辅助构造的任务中增益明显；消融实验验证各模块有效性。 Conclusion: LatentGeo通过端到端学习连续潜在几何表征，有效克服传统显式构造范式的局限，为多模态模型处理几何推理任务提供了新范式。 Abstract: Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.

[160] BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Jingyang Ke,Weihan Li,Amartya Pradhan,Jeffrey Markowitz,Anqi Wu

Main category: cs.CV

TL;DR: BehaviorVLM是一个无需任务特定微调、仅需极少人工标注的统一视觉-语言框架，用于自由移动动物的姿态估计与行为理解，通过引导预训练视觉-语言模型（VLM）进行显式、可验证的多步推理，显著提升可扩展性、可解释性与标注效率。

Details

Motivation: 现有姿态估计与行为理解方法严重依赖人工标注或不稳定的无监督流程，限制了可扩展性与可复现性。 Method: 提出BehaviorVLM框架：姿态估计部分结合量子点标记数据，采用时空与跨视角多阶段推理，并通过重投影误差等几何检验暴露低置信度标签；行为理解部分融合深度嵌入聚类、VLM视频逐片段描述及LLM语义整合推理，全程无需关键点输入。 Result: 大幅降低人工标注需求，生成可过滤、可修正、可用于下游模型微调的姿态标签；实现端到端、无需关键点的可解释行为段发现与语义标注；支持多动物、标签轻量、可扩展的行为分析。 Conclusion: BehaviorVLM为神经科学研究中自由行为分析提供了统一、鲁棒、低标注依赖且高度可解释的新范式。 Abstract: Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.

[161] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出ForensicZip，一种无需训练的视觉令牌压缩框架，从伪造驱动的角度出发，通过建模时间令牌演化为出生-死亡最优传输问题，并结合高频先验提升伪造检测性能，显著加速推理并减少计算量。

Details

Motivation: 现有视觉令牌剪枝方法多基于语义驱动，容易丢弃包含伪造痕迹（如高频异常、时序抖动）的背景区域，影响多媒体取证效果。 Method: ForensicZip将时间令牌演化建模为带松弛虚拟节点的出生-死亡最优传输问题，量化物理不连续性；并融合传输驱动的新颖性度量与高频先验，实现伪造证据与语义内容的分离。 Result: 在深度伪造和AIGC基准上，仅保留10%视觉令牌时，实现2.97倍加速与超90% FLOPs降低，同时保持SOTA检测性能。 Conclusion: ForensicZip证明了伪造驱动的令牌压缩优于传统语义驱动方法，在保证检测精度前提下大幅提升MLLMs在多媒体取证任务中的效率。 Abstract: Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

[162] RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Yaoqi Sun,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出RDNet网络，通过引入SwinTransformer替代CNN主干，并设计三个关键模块（DAD、FCE、RPL）来增强遥感图像显著目标检测中对尺度变化的鲁棒性和定位精度。

Details

Motivation: 解决遥感图像显著目标检测中目标尺寸变化大、自注意力计算开销高、CNN难以建模全局上下文和长程依赖等问题。 Method: 提出Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network（RDNet），采用SwinTransformer替代CNN主干，并集成Dynamic Adaptive Detail-aware（DAD）、Frequency-matching Context Enhancement（FCE）和Region Proportion-aware Localization（RPL）三个模块。 Result: RDNet在遥感图像显著目标检测任务上展现出对尺度变化更强的鲁棒性与更精确的定位能力，性能优于现有最先进方法。 Conclusion: RDNet通过动态自适应卷积、频域上下文增强与区域比例感知定位机制，有效缓解了遥感图像中尺度多样性与全局建模不足的挑战，提升了显著目标检测精度与泛化性。 Abstract: Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.

[163] Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Görkay Aydemir,Fatma Güney,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出Verifier元模型，用于评估跟踪器预测的可靠性并指导伪标签生成，从而提升真实世界视频中长期点跟踪模型的微调效果。

Details

Motivation: 现有长期点跟踪模型在合成数据上训练，但在真实视频中性能下降，且缺乏密集真值标注；自训练虽被探索，但伪标签质量依赖教师模型的可靠性，而其在不同帧和场景中差异较大。 Method: 提出Verifier元模型，接收多个预训练跟踪器产生的候选轨迹，逐帧评估并选择最可信的预测，生成高质量伪标签轨迹，用于后续模型微调。 Result: 在四个真实世界基准上实验表明，该方法达到当前最优性能，且所需数据量少于以往自训练方法。 Conclusion: Verifier能有效提升伪标签质量，实现数据高效的真实世界适应，为无监督/半监督点跟踪提供新思路。 Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r

[164] A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Jiajun Sun,Zhe Gao

Main category: cs.CV

TL;DR: 本文提出了一种两阶段音视频双模态模型，用于解决ABAW竞赛中无约束视频中的八类面部表情帧级识别问题，通过DINOv2视觉编码器、PadAug增强、MoE分类头、多尺度重裁剪、Wav2Vec 2.0音频特征及门控融合等技术，提升了鲁棒性与时序一致性，在ABAW验证集上达到0.5368 Macro-F1。

Details

Motivation: 解决ABAW挑战中无约束视频下表情识别面临的面部定位不准、姿态与尺度变化大、运动模糊、帧间时序不稳定等难题。 Method: 两阶段双模态方法：第一阶段采用DINOv2 ViT-L/14为视觉主干，结合Padding-aware Augmentation（PadAug）和Mixture-of-Experts（MoE）训练头；第二阶段进行多尺度人脸重裁剪以生成鲁棒视觉表征，并融合帧对齐的Wav2Vec 2.0音频特征，通过轻量门控融合模块整合双模态信息，并辅以推理时的时间平滑。 Result: 在ABAW官方验证集上Macro-F1达0.5368，在5折交叉验证下为0.5122±0.0277，优于官方基线。 Conclusion: 所提两阶段音视频融合框架有效提升了复杂真实场景下表情识别的鲁棒性与时序一致性，验证了DINOv2预训练视觉表征与Wav2Vec 2.0音频特征协同建模的有效性。 Abstract: This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

[165] HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Andy Li,Aiden Durrant,Milan Markovic,Georgios Leontidis

Main category: cs.CV

TL;DR: 本文提出了一种名为Hierarchical Auto-Pruning (HiAP) 的端到端结构化剪枝框架，通过多粒度随机门（宏观与微观）自动发现高效子网络，无需人工设定稀疏目标或复杂多阶段流程，在ImageNet上实现了精度与效率的优秀权衡。

Details

Motivation: Vision Transformers在边缘设备部署受限于高计算资源和内存带宽需求；现有结构化剪枝方法通常粒度单一、流程复杂且依赖后验阈值设定。 Method: 提出HiAP框架，采用Gumbel-Sigmoid随机门实现宏观（注意力头/FFN块）与微观（头内维度/FFN神经元）两级联合剪枝，并通过融合结构可行性约束与解析FLOPs的损失函数进行端到端优化。 Result: 在ImageNet上验证了HiAP能自动发现高效架构，在DeiT-Small等模型上达到与复杂多阶段方法相当的精度-效率Pareto前沿，同时大幅简化部署流程。 Conclusion: HiAP是一种无需人工先验、单阶段、多粒度、端到端可训练的结构化剪枝方法，有效兼顾内存与计算瓶颈，在边缘部署场景中具有显著优势。 Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

[166] SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo,Jiaxiang Tang,Ruijie Lu,Gang Zeng

Main category: cs.CV

TL;DR: 本文提出SceneAssistant，一种基于视觉反馈的智能体，用于开放词汇的文本到3D场景生成，结合VLM的空间推理与3D生成模型，通过原子操作迭代优化场景布局。

Details

Motivation: 现有文本到3D场景生成方法受限于特定领域或预定义空间关系，难以支持开放词汇、无约束的3D场景合成。 Method: 构建视觉反馈驱动的SceneAssistant智能体，利用VLM进行空间推理与规划，并提供Scale、Rotate、FocusOn等原子操作；每步接收渲染图像反馈，迭代调整3D场景。 Result: 实验表明该方法能生成多样、开放词汇、高质量的3D场景，在定性与定量人工评估中均优于现有方法，并支持自然语言驱动的场景编辑。 Conclusion: SceneAssistant有效突破了开放词汇3D场景生成的限制，为数字内容创作提供了更灵活、可控的新范式。 Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

[167] BiGain: Unified Token Compression for Joint Generation and Classification

Jiacheng Liu,Shengkun Tang,Jiacheng Cui,Dongkuan Xu,Zhiqiang Shen

Main category: cs.CV

TL;DR: 本文提出BiGain框架，通过频域分离思想设计两种频率感知的token压缩操作（Laplacian-gated token merging和Interpolate-Extrapolate KV Downsampling），在不牺牲生成质量的前提下显著提升加速扩散模型的分类性能。

Details

Motivation: 现有扩散模型加速方法（如token合并或下采样）通常只关注生成质量与计算效率的权衡，忽视其判别能力（如分类性能），缺乏对生成与判别能力的联合优化。 Method: 提出训练无关、即插即用的BiGain框架，核心是频域分离：将特征映射为频率感知表示，解耦细节与语义；具体包括（1）Laplacian门控token合并，保留高频边缘纹理；（2）插值-外推KV下采样，在保持query不变前提下可控地压缩key/value。 Result: 在多个骨干网络（DiT/U-Net）和数据集（ImageNet-1K等）上，BiGain在加速条件下一致提升分类准确率（如ImageNet-1K上+7.15%）并维持或改善生成质量（FID降低0.34）；分析表明均衡保留高低频成分是有效压缩的关键准则。 Conclusion: BiGain首次系统性地联合提升加速扩散模型的生成与分类能力，为低成本部署提供新范式；频域感知设计是token压缩的重要指导原则。 Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

[168] One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Dogyun Park,Anil Kag,Michael Vasilkovsky,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: 本文提出Elastic Latent Interface Transformer（ELIT），一种轻量、即插即用的机制，通过引入可变长的潜在接口和重要性排序的注意力机制，解耦图像分辨率与计算量，在保持DiT架构不变前提下实现动态计算-质量权衡。

Details

Motivation: 现有扩散Transformer（DiTs）将FLOPs与图像分辨率强绑定，且对所有空间token均匀分配计算资源，导致无法灵活权衡生成质量与推理延迟，并造成计算浪费。 Method: 引入可学习、可变长度的潜在接口（latent interface）作为中间token序列；设计轻量级Read/Write跨注意力层在空间token与潜变量间交互，并通过随机丢弃尾部潜变量训练出重要性有序表示；推理时可动态调整潜变量数量以适配计算约束。 Result: 在ImageNet-1K 512px上，FID和FDD分别平均提升35.3%和39.6%；在多种DiT变体（U-ViT、HDiT、MM-DiT）和数据集上均表现一致增益。 Conclusion: ELIT是一种最小改动、兼容性强的DiT增强方案，实现了计算可伸缩性与重要性感知建模，为高质量可控生成提供了新范式。 Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/

[169] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Xiangyu Zhao,Peiyuan Zhang,Junming Lin,Tianhao Liang,Yuchen Duan,Shengyuan Ding,Changyao Tian,Yuhang Zang,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出FIRM框架，通过构建高质量数据集、设计专用奖励模型和创新的'基础+奖励'策略，显著提升图像编辑与文本到图像生成的保真度和指令遵循能力。

Details

Motivation: 现有强化学习中的奖励模型存在幻觉和评分噪声问题，导致优化过程被误导。 Method: 1）设计定制化数据整理流程，构建FIRM-Edit-370K和FIRM-Gen-293K高质量评分数据集；2）训练专用8B参数奖励模型FIRM-Edit-8B和FIRM-Gen-8B；3）提出'基础+奖励'奖励策略（CME用于编辑，QMA用于生成）；4）构建FIRM-Bench评测基准。 Result: FIRM奖励模型在人类判断对齐性上优于现有指标；FIRM-Qwen-Edit和FIRM-SD3.5在保真度与指令遵循方面取得显著突破；有效缓解幻觉问题。 Conclusion: FIRM为图像编辑与生成提供了更准确、可靠的奖励建模范式，设立了保真度与指令遵循的新标准。 Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

[170] DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Jing He,Zixin Zhang,Haodong Li,Yihao Liang,Kanghao Chen,Bin Ren,Xu Zheng,Shuai Yang,Kun Zhou,Yinchuan Li,Nicu Sebe,Ying-Cong Chen

Main category: cs.CV

TL;DR: DVD是一种新型视频深度估计框架，通过确定性地适配预训练视频扩散模型，解决了生成式模型几何幻觉与判别式模型数据依赖的固有矛盾，实现了零样本SOTA性能。

Details

Motivation: 现有视频深度估计方法面临生成式模型存在几何幻觉和尺度漂移、判别式模型依赖大量标注数据的两难困境，亟需一种兼顾几何一致性与数据效率的新范式。 Method: DVD提出三方面核心设计：(i) 将扩散时间步作为结构锚点以平衡全局稳定性与高频细节；(ii) 引入潜在流形校正（LMR）施加微分约束，缓解回归导致的过平滑并恢复边界锐度与运动一致性；(iii) 利用全局仿射一致性约束窗口间发散，实现无需复杂时序对齐的长视频推理。 Result: DVD在多个基准上达到零样本SOTA性能，仅用领先基线1/163的任务特定数据即可激活视频基础模型中隐含的深层几何先验。 Conclusion: DVD首次实现了预训练视频扩散模型向单通深度回归器的确定性迁移，为视频几何理解提供了高效、鲁棒且开源的新路径。 Abstract: Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.

[171] Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi,Stephanie Fu,Long Lian,Hanrong Ye,David Eigen,Aaron Reite,Boyi Li,Jan Kautz,Song Han,David M. Chan,Pavlo Molchanov,Trevor Darrell,Hongxu Yin

Main category: cs.CV

TL;DR: AutoGaze是一种轻量级模块，通过自回归选择多尺度视觉块，在保证重建精度的前提下大幅减少视频输入的视觉token数量，显著提升MLLM处理长时高分辨率视频的效率与性能。

Details

Motivation: 现有MLLM在处理长时、高分辨率视频时因对所有像素均匀建模而受限于计算冗余和显存瓶颈。 Method: AutoGaze结合next-token预测与强化学习，训练一个轻量模块，自回归地选择满足用户设定误差阈值的最小多尺度视觉块集合，以重构原始视频。 Result: 视觉token减少4–100倍，ViT/MLLM推理加速最高达19倍；支持1K帧、4K分辨率视频理解；在VideoMME达67.0%，在新提出的HLVid（5分钟4K视频QA基准）上相对基线提升10.1%，超越此前最优MLLM 4.5%。 Conclusion: AutoGaze有效缓解了视频理解中的时空冗余问题，为MLLM高效扩展至长时高分辨率视频提供了可行路径，并推动了该领域评测基准的发展。 Abstract: Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

[172] Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu,Diankun Wu,Jiawei Chi,Yimo Cai,Yi-Hsin Hung,Xumin Yu,Hao Li,Han Hu,Yongming Rao,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出Spatial-TTT，一种面向流式视觉空间智能的测试时训练方法，通过动态更新快速权重来高效建模长时序视频中的3D空间信息，并结合混合架构、滑动窗口注意力与空间预测机制，在视频空间理解任务上达到SOTA。

Details

Motivation: 人类通过连续视觉观测理解真实世界空间，因此模型需具备在无限视频流中持续维护和更新空间证据的能力；核心挑战在于如何随时间选择、组织和保留空间信息，而非单纯扩大上下文窗口。 Method: 提出Spatial-TTT框架：采用测试时训练（TTT）动态更新部分参数（fast weights）以建模空间证据；设计混合架构，融合大块更新与滑动窗口注意力；引入基于3D时空卷积的空间预测机制增强几何对应与时间连续性建模；构建含密集3D空间描述的新数据集，引导fast weights结构化记忆全局3D空间信号。 Result: 在多个视频空间理解基准上显著提升长时序空间理解能力，达到当前最优（state-of-the-art）性能。 Conclusion: Spatial-TTT验证了测试时自适应机制在流式空间智能中的有效性，为长时序视觉空间建模提供了新范式，兼具架构创新、机制设计与数据构建的系统性贡献。 Abstract: Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.

[173] DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei,Xinyu Liu,Shiwei Zhang,Hangjie Yuan,Jinbo Xing,Zhekai Chen,Xiang Wang,Haonan Qiu,Rui Zhao,Yutong Feng,Ruihang Chu,Yingya Zhang,Yike Guo,Xihui Liu,Hongming Shan

Main category: cs.CV

TL;DR: DreamVideo-Omni 提出两阶段训练框架，通过多信号联合控制、条件感知3D位置编码、分层运动注入及组/角色嵌入解决多主体身份与多粒度运动协同控制难题，并引入潜在身份奖励反馈机制提升身份保持能力。

Details

Motivation: 现有大模型在视频生成中难以同时精确控制多主体身份和多粒度运动，存在运动粒度有限、控制模糊和身份退化等问题。 Method: 提出DreamVideo-Omni：第一阶段融合外观、全局/局部运动、相机运动等控制信号，引入条件感知3D旋转位置编码和分层运动注入，并用组/角色嵌入解耦多主体；第二阶段设计基于潜在空间的身份奖励反馈学习范式，训练奖励模型指导身份保持。 Result: 在自建大规模数据集和DreamOmni Bench评测基准上，DreamVideo-Omni在多主体定制与全粒度运动控制任务中显著优于现有方法，生成视频质量高、可控性强。 Conclusion: DreamVideo-Omni通过统一框架与创新模块，有效实现了多主体身份与多粒度运动的协同精准控制，为可控视频生成提供了新范式。 Abstract: While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

[174] Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Yiran Guan,Liang Yin,Dingkang Liang,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出Video Streaming Thinking (VST)，一种支持边看边想的流式视频理解新范式，通过在视频流播放过程中同步激活推理，兼顾实时性与深度理解；并设计了配套的后训练流程（VST-SFT和VST-RL）及基于视频知识图谱的自动数据合成方法，在多个在线/离线基准上实现高效且强泛化的性能提升。

Details

Motivation: 现有在线视频大模型仅关注流式感知，缺乏同步的逻辑推理流；而直接应用测试时扩展方法会导致不可接受的响应延迟，亟需在实时性与推理深度之间取得平衡。 Method: 提出VST范式，实现‘边看边想’的流式视频理解；设计包含VST-SFT（结构化适配因果流式推理）和VST-RL（多轮交互环境下的端到端强化学习）的后训练流程；构建基于视频知识图谱的自动化数据合成 pipeline，生成带实体关系锚定的流式链式推理问答对。 Result: VST-7B在StreamingBench达79.5%，OVO-Bench达59.3%；相比Video-R1响应快15.7倍，在VideoHolmes上提升+5.4%；同时在离线长视频和推理任务中保持竞争力。 Conclusion: VST有效解决了流式视频理解中实时响应与深度推理的矛盾，通过推理延迟摊销、结构化后训练与高质量流式数据合成，实现了高效、连贯、泛化性强的在线视频理解能力。 Abstract: Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

[175] GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Mingxin Liu,Ziqian Fan,Zhaokai Wang,Leyao Gu,Zirun Zhu,Yiguo He,Yuchen Yang,Changyao Tian,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Qibing Ren,Zhihang Zhong,Xuanhe Zhou,Junchi Yan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出了GRADE基准，用于评估图像编辑中学科知识与推理能力，涵盖10个学术领域共520个样本，并设计了多维评估协议，揭示了当前多模态模型在知识密集型编辑任务中的显著局限性。

Details

Motivation: 现有图像编辑基准局限于自然图像和浅层常识推理，难以评估统一多模态模型在结构化、领域特定约束下的联合理解、推理与生成能力。 Method: 构建GRADE基准（含520个跨10个学科领域的样本），提出多维评估协议（学科推理、视觉一致性、逻辑可读性），并在20个SOTA模型上开展实验与消融分析。 Result: 实验发现当前模型在隐式、知识密集型编辑场景下存在显著性能差距；分析揭示了模型在学科编辑中的具体短板与约束。 Conclusion: GRADE为统一多模态模型的发展指明关键方向，推动学科导向的图像编辑与推理研究，基准与代码已开源。 Abstract: Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

[176] OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan,Jilan Xu,Shangzhe Di,Haoning Wu,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出OmniStream，一种统一的流式视觉骨干网络，通过因果时空注意力和3D旋转位置编码实现帧级在线视频处理，并在多任务预训练下展现出跨语义、空间与时间推理的泛化能力。

Details

Motivation: 现代视觉智能体需要具备通用性、因果性和物理结构化的表征以适应实时流式环境，但现有视觉基础模型功能割裂，难以兼顾图像语义、时序建模与空间几何。 Method: 提出OmniStream模型，引入因果时空注意力机制与3D-RoPE位置编码，结合持久KV缓存支持帧级流式处理；采用融合静态/时序表征学习、流式几何重建与视觉-语言对齐的多任务预训练框架，在29个数据集上训练。 Result: 即使骨干网络完全冻结，OmniStream在图像/视频探针、流式几何重建、复杂时空推理及未见机器人操控任务中均达到与专用模型相当的性能。 Conclusion: 证明了单一同质化视觉骨干网络可有效统一语义、空间与时序推理能力，是迈向通用视觉理解与具身智能的重要一步。 Abstract: Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

[177] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen,Shilin Yan,Hongwei Xue,Shuaiqi Lu,Xiaojun Tang,Guannan Zhang,Tiancheng Zhao,Jianwei Yin

Main category: cs.CV

TL;DR: 本文提出MM-CondChain基准，用于评估多模态大语言模型（MLLMs）在视觉驱动的深层组合条件推理任务上的能力，强调对GUI导航等真实视觉工作流中复杂分支条件的理解与执行。

Details

Motivation: 现有基准侧重浅层组合或独立约束，难以评估MLLMs在需多步视觉证据支撑、深度链式条件判断（如GUI中带属性与关系的嵌套条件）任务中的真实能力。 Method: 构建MM-CondChain基准：设计多层推理链，每层含基于多个对象/属性/关系的可验证视觉组合条件；提出基于智能体的合成流程（Planner + VPIR + Composer），实现可扩展、机械可验证的数据生成；覆盖自然图像、图表和GUI轨迹三类视觉域。 Result: 在多个MLLM上实验显示，最强模型路径F1仅53.33%，且在难负样本、推理深度增加或谓词复杂度升高时性能显著下降。 Conclusion: 深层视觉组合推理仍是MLLM的核心瓶颈，MM-CondChain为该方向提供了首个系统性、可验证的评估基准。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

[178] EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Tianwei Xiong,Jun Hao Liew,Zilong Huang,Zhijie Lin,Jiashi Feng,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出EVATok框架，通过自适应视频分词器优化视频生成中的token分配，提升重建质量和生成效率。

Details

Motivation: 传统视频分词器对不同视频的时序块采用固定token分配，导致简单片段浪费token、复杂片段token不足，效率低下。 Method: EVATok框架包含三部分：1）估计每段视频的最优token分配；2）设计轻量级路由器快速预测该分配；3）训练基于路由预测结果进行编码的自适应分词器，并结合视频语义编码器进行先进训练。 Result: 在UCF-101数据集上，EVATok实现优于LARP等基线的重建质量与类到视频生成性能，平均token使用量减少至少24.4%。 Conclusion: EVATok通过动态、视频自适应的token分配显著提升了AR视频生成的效率与质量，为高效视频建模提供了新范式。 Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

Table of Contents

cs.CL [Back]

[1] Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

[2] Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

[3] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

[4] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

[5] Markovian Generation Chains in Large Language Models

[6] Artificial Intelligence for Sentiment Analysis of Persian Poetry

[7] ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

[8] Temporal Text Classification with Large Language Models

[9] Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

[10] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

[11] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects

[12] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

[13] BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion

[14] LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction

[15] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

[16] Tiny Aya: Bridging Scale and Multilingual Depth

[17] Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

[18] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

[19] Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

[20] Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

[21] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

[22] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

[23] QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

[24] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

[25] A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

[26] In the LLM era, Word Sense Induction remains unsolved

[27] SemBench: A Universal Semantic Framework for LLM Evaluation

[28] Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

[29] Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

[30] Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents

[31] Trust Oriented Explainable AI for Fake News Detection

[32] Large Language Models for Biomedical Article Classification

[33] DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

[34] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

[35] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

[36] PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

[37] CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

[38] BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

[39] Just Use XML: Revisiting Joint Translation and Label Projection

[40] Translationese as a Rational Response to Translation Task Difficulty

[41] To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

[42] SommBench: Assessing Sommelier Expertise of Language Models

[43] Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

[44] LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

[45] QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

[46] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

[47] Long-Context Encoder Models for Polish Language Understanding

[48] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

[49] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

[50] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

[51] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

cs.CV [Back]

[52] RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

[53] GGPT: Geometry Grounded Point Transformer

[54] Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction

[55] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

[56] Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

[57] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

[58] When Slots Compete: Slot Merging in Object-Centric Learning

[59] Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models

[60] Towards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery

[61] InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction

[62] Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

[63] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

[64] UNet-AF: An alias-free UNet for image restoration

[65] Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis

[66] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

[67] DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

[68] High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping

[69] DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

[70] Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

[71] Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

[72] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

[73] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

[74] Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

[75] Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

[76] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

[77] INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs