cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar,Juhna Park,Aparna Garimella,Vivek Gupta

Main category: cs.CL

TL;DR: 提出TabReX，一种无需参考表、基于属性驱动的图表生成评估框架，通过图结构对齐实现可解释的评分，并配套发布大规模基准测试TabReX-Bench。

Details

Motivation: 现有表格生成评估指标要么忽略结构信息，将表格展平为文本，要么依赖固定参考表导致泛化能力差，缺乏对结构和事实保真度的可靠、可解释评估。 Method: 将源文本和生成表格转化为规范化的知识图谱，利用大语言模型引导的匹配过程进行对齐，基于评分规则计算结构和事实保真度的可解释分数，并构建涵盖六领域、十二类扰动的基准TabReX-Bench进行系统评估。 Result: TabReX在与专家排名的相关性上表现最优，对复杂扰动具有更强鲁棒性，支持细粒度的模型与提示对比分析，并提供单元级错误溯源。 Conclusion: TabReX为表格生成提供了可信、可解释的新型评估范式，兼具高人类一致性与诊断能力，推动结构化生成系统的可靠评估发展。 Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Joel Mire,Maria Antoniak,Steven R. Wilson,Zexin Ma,Achyutarama R. Ganti,Andrew Piper,Maarten Sap

Main category: cs.CL

TL;DR: 本文提出了SocialStoryFrames，一种用于捕捉读者对故事反应的推理形式化框架，并开发了两个模型SSF-Generator和SSF-Classifier，通过人类调查和专家标注验证，应用于包含6140个社交媒体故事的数据集，揭示了不同社区中叙事意图的频率、相互关系及叙事实践的多样性。

Details

Motivation: 现有计算模型在捕捉读者对故事的丰富解读、情感和评价反应方面存在局限，无法进行细粒度分析，因此需要一种能形式化建模读者反应的新方法。 Method: 基于叙事理论、语言语用学和心理学构建了一个分类体系，提出SocialStoryFrames框架，并开发SSF-Generator与SSF-Classifier两个模型；通过N=382的人类调查和专家标注进行验证，应用到包含6140个社交媒体故事的SSF-Corpus数据集上进行实证分析。 Result: 模型能够有效识别作者意图、解释性与预测性推理、情感反应和价值判断；分析揭示了不同社区中叙事意图的分布模式及其相互依赖关系，并比较了跨社区叙事实践的差异与多样性。 Conclusion: SocialStoryFrames通过结合细粒度的情境敏感建模与通用的读者反应分类体系，为大规模研究在线社区中的 storytelling 提供了新工具，推动了计算叙事学与社会计算的发展。 Abstract: Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Armağan Amcalar,Eyup Cinar

Main category: cs.CL

TL;DR: 本文提出了一种基于Mermaid指令图的结构化提示框架BRAID，通过限制推理过程提升大语言模型在自主代理系统中的推理准确性和成本效率。

Details

Motivation: 由于大语言模型在性能、成本和令牌使用之间存在非线性关系，传统无约束的自然语言推理方式可能导致效率低下和成本过高，因此需要一种更高效的结构化推理方法。 Method: 引入BRAID框架，利用机器可读的Mermaid图结构对推理过程进行约束，并在多个GPT模型层级和基准数据集（如AdvancedIF、GSM-Hard和SCALE MultiChallenge）上进行评估。 Result: 实验结果表明，与传统方法相比，BRAID显著提高了推理准确性和成本效率，尤其在生产环境中的智能代理系统中表现更优。 Conclusion: BRAID是一种有效且可扩展的技术，能够优化自主代理系统的推理效率，具备实际应用潜力。 Abstract: Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

Kieran Henderson,Kian Omoomi,Vasudha Varadarajan,Allison Lahnala,Charles Welch

Main category: cs.CL

TL;DR: 该研究通过分类自我披露句子并构建标注者模型，以预测社会规范判断，发现人口统计学信息比态度、关系和经验更具影响力，且理论驱动的方法优于自动聚类方法，少量相关评论即可有效预测，多样化的自我披露样本表现最佳。

Details

Motivation: 探索何种个人信息对预测主观任务中的标注者标签最具信息量，特别是在社会规范判断任务中。 Method: 对自我披露句子进行分类，并构建标注者模型，通过多种消融实验和分析评估不同类型信息对预测标注模式的影响。 Result: 人口统计学信息比态度、关系和经验更具预测力；理论驱动方法优于自动聚类；仅需少量相关评论即可取得良好效果；更多样化的自我披露样本带来最佳性能。 Conclusion: 在建模标注者判断时，应优先考虑人口统计学特征和多样化的自我披露内容，理论指导的分类策略更为有效。 Abstract: Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng,Sinan Wang,Zhengxiang Cheng,Yao Wan,Dongping Chen

Main category: cs.CL

TL;DR: 本文提出了Sage，一种无需人类标注即可评估LLM作为评判者（LLM-as-a-Judge）质量的新评估套件，通过引入局部自洽性和全局逻辑一致性两个新指标，揭示了当前主流LLM在判断任务中存在显著的不一致性问题，并发现了“情境性偏好”现象。

Details

Motivation: 现有LLM-as-a-Judge的评估依赖人工标注的真实标签，存在人类偏见且难以扩展，需要一种更可靠、可扩展的自动化评估方法。 Method: 受理性选择理论公理启发，提出局部自洽性（成对偏好稳定性）和全局逻辑一致性（偏好间的传递性）两个新评估维度，构建包含650个问题的数据集，完全无需人工标注进行评估。 Result: 实验表明Sage指标稳定且与LLMBar、RewardBench2等监督基准高度相关；发现顶级模型如Gemini-2.5-Pro和GPT-5在近四分之一难题中无法保持偏好一致；发现‘情境性偏好’现象，显示显式评分标准有助于提升一致性；微调、专家组判断和深度推理可提升判断一致性；同时发现人类判断也存在显著不一致。 Conclusion: Sage是一种可靠、无需人类标注的LLM-as-a-Judge评估工具，揭示了当前LLM作为评判者时的可靠性缺陷，挑战了人类标注作为金标准的假设，并为改进LLM评判能力提供了方向。 Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

[6] Convolutional Lie Operator for Sentence Classification

Daniela N. Rim,Heeyoul Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于李群卷积（Lie Convolutions）的新型卷积句法分类器SCLie和DPCLie，通过捕捉语言中复杂的非欧几里得对称性来提升性能，实验表明其优于传统卷积模型。

Details

Motivation: 传统CNN在文本中虽能捕获局部不变特征，但对语言中复杂变换的建模能力有限，因此需要探索能更好建模此类结构的新方法。 Method: 将李群卷积引入基于卷积的句子分类器，构建了SCLie和DPCLie两种模型，利用李群操作捕捉语言中的复杂对称性和变换。 Result: 所提模型在实验中优于传统的卷积句法分类器，显示出更高的准确性。 Conclusion: 基于李群的模型能够有效捕捉语言中非常见的变换结构，有助于提升句子分类性能，推动了语言建模新范式的探索。 Abstract: Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim

Main category: cs.CL

TL;DR: 本文提出了一种语义驱动的强化学习方法（SRL），用于医学报告生成，通过报告级奖励和结构化输出提升临床正确性，在IU X-Ray和MIMIC-CXR数据集上取得了当前最优的临床效能表现。

Details

Motivation: 现有医学报告生成方法多依赖于词元级训练目标，仅模仿放射科医生的语言风格，难以保证生成内容的临床正确性，因此需要一种能够直接优化临床准确性的训练机制。 Method: 提出语义驱动的强化学习（SRL）框架，基于大视觉-语言模型（LVLM），采用Group Relative Policy Optimization（GRPO）优化报告级奖励——基于关键放射学发现提取的余弦相似度（MCCS），并引入轻量级推理格式约束以生成结构化的“思考报告”。 Result: 在IU X-Ray和MIMIC-CXR两个数据集上，MRG-R1分别取得了51.88和40.39的CE-F1分数，显著优于现有方法，验证了语义驱动强化学习在提升临床正确性方面的有效性。 Conclusion: 优化基于临床语义对齐的报告级奖励比传统的词元级监督更能提升医学报告生成的临床正确性，为医学大视觉语言模型（Med-LVLM）的训练提供了新的方向。 Abstract: Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

Yash Bhaskar,Sankalp Bahad,Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: 本文提出了一种用于检测由虚假叙述驱动的仇恨言论（Faux-Hate）的系统，针对代码混合的印地语-英语社交媒体文本，结合多任务学习和领域特定预训练，在二分类检测及目标与严重性预测任务中取得良好效果。

Details

Motivation: 由于社交媒体上虚假信息和仇恨言论的交织传播，传统检测方法难以应对由虚假叙述引发的仇恨言论，因此需要专门针对Faux-Hate现象进行建模。 Method: 采用先进的自然语言处理技术，结合领域特定的预训练模型，并利用多任务学习框架同时处理二元Faux-Hate检测与目标及严重性预测两个子任务。 Result: 在Faux-Hate共享任务中取得了具有竞争力的结果，验证了所提方法在复杂仇恨内容识别中的有效性。 Conclusion: 结合领域预训练与多任务学习的方法能有效提升对由虚假叙事驱动的仇恨言论的检测性能，尤其适用于代码混合的社交文本场景。 Abstract: Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.

Mengfan Shen,Kangqi Song,Xindi Wang,Wei Jia,Tao Wang,Ziqiang Han

Main category: cs.CL

TL;DR: 提出了一种基于LoRA微调Qwen2.5-7B模型的领域自适应信息抽取管道，通过针对性提示工程从微博警方通报中高效提取15个关键字段，准确率超过98%，实现了高精度多任务结构化信息抽取。

Details

Motivation: 从非正式、异构的社交媒体文本（如微博警方通报）中准确提取结构化信息具有挑战性，亟需高效且鲁棒的方法支持社会科学研究中的数据处理。 Method: 采用低秩适应（LoRA）对Qwen2.5-7B模型进行参数高效微调，并结合目标导向的提示工程，构建领域适配的信息抽取流水线，训练数据来自人工标注的4,933条高质量样本。 Result: 在死亡检测任务上准确率达98.36%以上， fatalities和省级地点提取的精确匹配率分别达95.31%和95.54%，显著优于基础和指令微调模型。 Conclusion: 该方法为专业领域内的多任务信息抽取提供了高效、可靠的解决方案，可有效将非结构化社会媒体文本转化为可用于研究的结构化数据。 Abstract: Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam

Main category: cs.CL

TL;DR: 提出一种独立于大语言模型的医疗事实核查模块，结合基于MIMIC-III微调的低秩适配摘要模型，通过数值检验和细粒度逻辑检查降低幻觉率，实验显示其在事实准确性与摘要质量上均表现良好。

Details

Motivation: 大语言模型在医疗领域可能生成幻觉内容，威胁决策安全，需提高输出的可靠性与准确性。 Method: 采用LoRa在MIMIC-III数据集上微调领域特定的摘要模型，并设计一个不依赖LLM的事实核查模块，利用数值测试和基于离散逻辑的NLP方法对电子健康记录中的命题进行验证。 Result: 在104份摘要中提取3,786个命题进行评估，事实核查模块达到0.8904的精确率、0.8234的召回率和0.8556的F1分数；摘要模型获得0.5797的ROUGE-1和0.9120的BERTScore。 Conclusion: 所提方法能有效减少医疗文本生成中的幻觉问题，提升事实一致性，在保证摘要质量的同时显著增强可靠性。 Abstract: In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

Qizhou Chen,Chengyu Wang,Taolin Zhang,Xiaofeng He

Main category: cs.CL

TL;DR: 提出基于信息瓶颈理论的大模型知识编辑框架IBKE，实现高效、泛化且低干扰的知识更新。

Details

Motivation: 现有大模型编辑方法难以在不重新训练的情况下准确、泛化地修正知识，且易引发副作用。 Method: 基于信息瓶颈理论，通过紧凑的潜在表示引导梯度更新，设计了IBKE框架以隔离关键信息并最小化对其他行为的影响。 Result: 在多种大模型架构和基准任务上验证了IBKE的有效性，表现出领先的准确性、更好的编辑泛化性和特异性。 Conclusion: IBKE为开放域知识编辑提供了理论严谨且实用的新范式，提升了大模型在实际应用中的可用性与可信度。 Abstract: Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE's effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu,Yijie Jin,Jiajun Li,Yi Tu,Guoping Long,Dandan Tu,Tianqi Hou,Junchi Yan,Zhijie Deng

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的即插即用算法LoPA，通过优化Token填充顺序（TFO）显著提升扩散大语言模型（dLLM）的并行解码效率，并结合多设备推理系统实现高达10.1 TPF和1073.9 tokens/秒的吞吐量。

Details

Motivation: 当前dLLM的置信度驱动解码策略并行度有限，通常每前向传递仅生成1-3个token，限制了解码速度。作者发现解码并行度对Token填充顺序（TFO）高度敏感，因此提出优化TFO以加速推理。 Method: 提出Lookahead PArallel Decoding（LoPA）算法，通过并行分支同时探索多个候选TFO，并基于分支置信度选择最具未来并行潜力的顺序；同时设计支持分支并行（BP）的多设备推理系统以支持高并发解码。 Result: 在D2F模型上应用LoPA后，D2F-Dream在GSM8K上的TPF提升至10.1，性能仍优于Dream基线；多GPU部署下单样本吞吐达1073.9 tokens/秒。 Conclusion: LoPA通过优化TFO显著提升了dLLM的解码效率，结合专用推理系统实现了前所未有的并行度，为高效推理提供了有效解决方案。 Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

[13] Sigma-Moe-Tiny Technical Report

Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong

Main category: cs.CL

TL;DR: 本文提出了Sigma-MoE-Tiny，一种具有高稀疏性的Mixture-of-Experts语言模型，每层最多包含96个专家但每个token仅激活一个专家，实现20B总参数中仅激活0.5B。为解决极端稀疏带来的专家负载均衡问题，提出渐进式稀疏化策略，并通过稳定训练和后训练提升性能，在极低激活参数下达到同类领先水平。

Details

Motivation: 现有的MoE模型在扩展性方面表现良好，但在极高稀疏度下（如每层大量专家）面临专家负载不均的问题，尤其在底层中常用负载均衡损失失效，限制了模型效率与稳定性。因此需要探索更高效的稀疏MoE结构及相应的训练机制。 Method: 采用细粒度专家分段，每层多达96个专家，每个token仅激活一个专家；引入渐进式稀疏化调度策略以改善专家利用率和训练稳定性；在高质量语料上进行预训练并结合后训练进一步释放模型能力。 Result: 实现了20B总参数中仅激活0.5B的极高稀疏度；训练过程稳定，未出现不可恢复的损失尖峰；在多项评估中，性能优于或媲美更大规模的模型；有效缓解了极端稀疏下的负载失衡问题。 Conclusion: Sigma-MoE-Tiny通过精细设计的专家架构和渐进式稀疏化策略，成功实现了高稀疏度下的高效训练与卓越性能，为未来MoE模型的稀疏化发展提供了实践范例与理论启示。 Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

Yehor Tereshchenko,Mika Hämäläinen,Svitlana Myroniuk

Main category: cs.CL

TL;DR: 本研究比较了OpenAI的GPT模型在芬-乌戈尔语系低资源语言（科米-兹梁、莫克沙、埃尔齐亚和乌德穆尔特）与芬兰语之间的翻译表现，发现推理架构模型的拒绝率比非推理模型低16个百分点。

Details

Motivation: 现有对大语言模型翻译能力的评估主要集中于高资源语言，缺乏对低资源和濒危语言的表现理解，本研究旨在填补这一空白。 Method: 使用文学文本的平行语料库，分析不同GPT模型架构在翻译任务中的拒绝率，比较推理与非推理模型的表现差异。 Result: 推理模型在翻译低资源乌拉尔语言时表现出显著更低的拒绝率，比非推理模型低16个百分点，显示出更强的翻译尝试意愿和潜力。 Conclusion: 推理架构有助于提升大语言模型在低资源和濒危语言翻译中的可用性，为乌拉尔语系语言保护和相关研究提供了重要参考。 Abstract: The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI's GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.

[15] Hacking Neural Evaluation Metrics with Single Hub Text

Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai

Main category: cs.CL

TL;DR: 提出一种方法，用于在离散空间中寻找单个对抗性文本，该文本无论在何种测试情况下均被评估为高质量，从而揭示评估指标中的漏洞。

Details

Motivation: 由于神经网络的黑箱特性，现有的基于嵌入的神经文本评估指标（如COMET）虽然广泛应用，但其可靠性缺乏保证。因此需要关注这些评估指标的可靠性和安全性问题。 Method: 提出一种方法，在离散空间中寻找一个单一的对抗性文本（hub text），该文本能持续获得高评价分数，以此检验评估指标的鲁棒性。 Result: 所找到的枢纽文本在WMT'24英日和英德翻译任务中分别达到79.1 COMET%和67.8 COMET%，优于M2M100模型为每个源句单独生成的翻译，并且该枢纽文本在多种语言对之间具有泛化能力。 Conclusion: 当前基于神经网络的文本评估指标存在安全隐患和脆弱性，单一对抗性文本即可操纵评估结果，需提高评估指标的鲁棒性和透明度。 Abstract: Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi,Javier Garcia Gilabert,Zachary Hopton,Vilém Zouhar,Carlos Escolano,Gerard I. Gállego,Jorge Iranzo-Sánchez,Ahrii Kim,Dominik Macháček,Patricia Schmidtova,Maike Züfle

Main category: cs.CL

TL;DR: 本研究提出了“Hearing to Translate”测试套件，系统评估了5种前沿语音大模型（SpeechLLMs）与16种级联及直接系统的性能，在16个基准、13种语言对和9种复杂条件下发现，当前级联架构整体仍优于SpeechLLMs，表明将大语言模型集成到语音翻译流程中对提升质量至关重要。

Details

Motivation: 探究语音大模型（SpeechLLMs）是否在语音到文本翻译任务上优于传统级联架构，明确当前模型的真实性能水平。 Method: 构建首个全面的基准测试套件Hearing to Translate，对比5种前沿SpeechLLMs与16种结合领先语音基础模型（SFM）和多语言大模型（LLM）的直接与级联系统，在16个基准、13种语言对和9种挑战性条件下进行评估。 Result: 实验结果显示，级联系统在整体表现上最为可靠；当前SpeechLLMs仅在部分场景下可与级联系统相当；独立的语音基础模型（SFM）表现落后；集成大语言模型（无论内置还是流水线形式）对高质量语音翻译至关重要。 Conclusion: 尽管SpeechLLMs具有潜力，但当前级联架构仍是更可靠的语音翻译方案；将大语言模型有效整合进语音处理流程是实现高性能的关键。 Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

Darshil Chauhan,Adityasinh Solanki,Vansh Patel,Kanav Kapoor,Ritvik Jain,Aditya Bansal,Dhruv Kumar,Prateek Narang

Main category: cs.CL

TL;DR: 本文提出了一种高效且隐私保护的低秩自适应框架（LoRA），结合多域经验回放，以解决在资源受限环境中临床语音识别面临的领域偏移、数据隐私和计算资源限制问题，在真实临床音频上实现了17.1%的相对词错误率下降，并减少47%的灾难性遗忘。

Details

Motivation: 由于数据隐私、计算资源有限和严重的声学域偏移，现有的ASR模型在实际临床环境（如农村医疗）中表现不佳，难以部署。作者旨在克服这些障碍，使ASR技术能在高影响的真实场景中可靠运行。 Method: 采用低秩自适应（LoRA）实现边缘设备上的持续学习，保护患者数据隐私；引入多域经验回放机制缓解灾难性遗忘；在IndicWav2Vec多语言模型基础上进行域自适应优化。 Result: 在真实世界临床音频（Gram Vaani）上，LoRA自适应使词错误率（WER）相对改善17.1%；结合经验回放，灾难性遗忘减少了47%；相较原始模型从40.94% WER显著提升性能。 Conclusion: 所提出的框架为在数据隐私敏感、资源受限的现实环境中部署可自我改进、可靠的ASR系统提供了可行路径，尤其适用于农村医疗等高影响力领域。 Abstract: Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Primoz Kocbek,Leon Kopitar,Gregor Stiglic

Main category: cs.CL

TL;DR: 本研究探讨了使用大语言模型（LLM）简化生物医学文本以提高健康素养的方法，比较了基于提示、双AI代理和微调等策略，发现gpt-4o-mini表现最优，且G-Eval与人工评价趋势一致。

Details

Motivation: 生物医学文献通常难以理解，限制了公众的健康素养，因此需要有效的文本简化方法以提升信息可及性。 Method: 采用包含生物医学摘要通俗化版本的公开数据集，设计并评估三种方法：基于提示模板的基线方法、双AI代理方法和微调方法；使用gpt-4o和gpt-4o-mini作为基准模型，并结合Flesch-Kincaid、SMOG、SARI、BERTScore、G-Eval等自动指标及五点Likert量表进行定量与定性评估。 Result: gpt-4o-mini模型在各项指标中表现优于其他方法，而微调方法表现不佳；G-Eval作为基于LLM的自动评估指标，其排序结果与人工评分趋势高度一致。 Conclusion: 轻量级大模型在生物医学文本简化中具有优越表现，且LLM驱动的自动评估指标（如G-Eval）可有效替代部分人工评价，为未来研究提供了高效评估手段。 Abstract: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

Primoz Kocbek,Gregor Stiglic

Main category: cs.CL

TL;DR: 本文介绍了在CLEF 2025 SimpleText任务1中的提交方案，使用gpt-4.1系列模型进行科学文本的句子级和文档级简化，比较了无上下文提示工程与微调方法的效果。

Details

Motivation: 旨在提升科学文本在句子和文档层面的可读性，支持非专业读者理解复杂内容。 Method: 采用gpt-4.1、gpt-4.1-mini和gpt-4.1-nano模型，对比无上下文提示工程与微调两种方法在句子和文档级简化中的表现。 Result: gpt-4.1-mini在无上下文设置下表现出色；微调模型效果不一，其中gpt-4.1-nano-ft在某一文档级任务中表现突出。 Conclusion: 提示工程在多数情况下有效，但微调在特定场景下可能更具优势，不同粒度的文本简化仍具挑战性。 Abstract: This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero,David Montero,Roman Orus

Main category: cs.CL

TL;DR: 本文提出了一种名为“拒绝引导”（Refusal Steering）的推理时方法，可在不重新训练的情况下精细控制大语言模型在政治敏感话题上的拒绝行为。

Details

Motivation: 旨在解决大语言模型在政治敏感话题上过度拒绝的问题，同时保持对有害内容的安全对齐。 Method: 使用LLM-as-a-judge评估拒绝置信度，并通过岭正则化计算更精确的引导向量，以隔离拒绝-顺从方向。 Result: 在Qwen3-Next-80B-A3B-Thinking等模型上成功移除政治敏感话题的拒绝行为，同时在JailbreakBench上保持安全性，在通用基准上接近基线性能。引导向量分析显示拒绝信号集中在Transformer的深层且分布广泛。 Conclusion: 激活引导是一种可行且实用的方法，可在推理时实现可控、透明的内容审核，平衡政治敏感性与安全对齐。 Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He,Zekai Qu,Zeyuan Liu,Yinghao Chen,Yuxin Zuo,Cheng Qian,Kaiyan Zhang,Weize Chen,Chaojun Xiao,Ganqu Cui,Ning Ding,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为JustRL的简化强化学习方法，用于大语言模型训练，仅使用单阶段固定超参数即可在数学推理任务上达到SOTA性能，且计算资源消耗减半，揭示了当前领域可能过度复杂化的问题。

Details

Motivation: 近年来强化学习在大语言模型中的应用趋向复杂化，包括多阶段训练、动态超参数和课程学习等，但这些复杂性是否必要尚不明确。作者旨在探究一个极简框架是否足以取得优异性能，并挑战当前对复杂性的依赖。 Method: 提出JustRL方法，采用单阶段训练和固定超参数，不对不同模型进行调参，避免使用常见的‘标准技巧’如长度惩罚和强验证器，在两个1.5B规模的语言模型上进行实验，评估其在九个数学基准上的表现。 Result: JustRL在九项数学基准测试中分别取得了54.9%和64.3%的平均准确率，达到SOTA水平，同时计算量减少一倍；训练过程稳定，无崩溃或停滞现象；消融实验显示加入常见技巧反而可能损害性能。 Conclusion: 当前强化学习训练中的许多复杂设计可能是不必要的，JustRL作为一个简单而稳定的基线，表明适当扩展的简约方法可以更高效且有效，呼吁社区重新审视复杂性的必要性。 Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

William English,Chase Walker,Dominic Simon,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为GinSign的框架，用于将自然语言（NL）准确地映射到系统签名中的原子命题，从而提升生成时序逻辑（TL）的语义正确性。该方法通过分层分类策略实现高效且精确的接地（grounding），显著优于现有方法。

Details

Motivation: 现有自然语言到时序逻辑的翻译框架在缺乏准确原子接地的情况下，往往生成语义不等价的形式化规范，限制了其在可信自主系统中的应用。 Method: 提出GinSign框架，采用分层分类方法：先预测谓词标签，再选择类型匹配的常量参数，将接地任务从自由生成转化为结构化分类问题，可使用小型掩码语言模型完成，降低对大语言模型的依赖。 Result: 实验表明，忽略接地过程的框架虽能生成语法正确的LTL公式，但语义上与目标表达式不等价；而GinSign在多个领域实现了95.5%的接地逻辑等价率，相比当前最优方法提升了1.4倍。 Conclusion: GinSign通过结构化的接地机制显著提升了自然语言到时序逻辑翻译的语义准确性，支持下游模型检测，为构建可信自主系统提供了更可靠的形式化规约生成手段。 Abstract: Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar

Main category: cs.CL

TL;DR: 提出一种基于推理链增强的RAG框架，通过文档级裁决、冲突分析和有据综合三个阶段实现可解释的答案生成，并引入CATS评估流程来衡量系统的可信度与行为一致性。

Details

Motivation: 现有RAG系统在面对来源冲突、过时或主观信息时表现不佳，且缺乏统一的推理监督机制。 Method: 构建包含三个阶段的推理增强框架：文档级裁决、冲突分析和有据综合；引入带有引用链接的答案生成与拒绝机制，并开发基于LLM-as-a-Judge的CATS评估流程。 Result: 在539个查询的数据集上实验显示显著优于基线模型，Qwen模型经监督微调后端到端正确率从0.069提升至0.883，行为符合性从0.074提升至0.722。 Conclusion: 该框架提升了RAG系统在处理信息冲突时的准确性、可信度和可解释性，为构建可靠检索增强系统提供了新方向。 Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

Primož Kocbek,Azra Frkatović-Hodžić,Dora Lalić,Vivian Hui,Gordan Lauc,Gregor Štiglic

Main category: cs.CL

TL;DR: 该研究比较了多模态检索增强生成（MM-RAG）在生物医学问答中的不同策略，特别是在糖生物学这一视觉密集型领域。研究发现，对于中等规模模型，将图表转换为文本的策略优于OCR-free视觉检索；而对于前沿大模型（如GPT-4o和GPT-5系列），OCR-free方法表现更佳且效率更高。ColFlor在性能上与更大模型相当但更轻量，是高效选择。

Details

Motivation: 探讨在多模态RAG系统中，何时应将图表转为文本、何时直接使用OCR-free视觉检索更为有效，尤其是在视觉信息密集的生物医学领域。 Method: 构建了一个包含120道选择题的糖生物学基准数据集，按检索难度分层，并实现四种增强方式：无增强、文本RAG、多模态转换和基于late-interaction的视觉检索（如ColPali）。使用Docling解析和Qdrant索引，评估多个开源与闭源模型的表现。 Result: Gemma-3-27B-IT模型中，文本和多模态转换优于OCR-free方法（准确率0.722–0.740 vs. 0.510）；GPT-4o上三者接近（多模态最高0.808）；GPT-5系列中ColPali/ColFlor提升至0.828，且各视觉检索器间无显著差异。GPT-5-nano落后约8-10%。 Conclusion: 检索策略的选择依赖于模型容量：中小模型更适合将视觉内容转为文本以降低理解负担，而前沿大模型能有效处理OCR-free视觉检索。ColFlor在保持高性能的同时更轻量，是理想默认选项。 Abstract: Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

William English,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为Grammar Forced Translation (GraFT)的自然语言到时序逻辑翻译框架，通过限制每步输出的合法词元来降低任务复杂度，显著提升了端到端和跨领域翻译的准确性。

Details

Motivation: 现有方法在原子命题提升、共指消解和小样本学习方面存在困难，难以准确实现自然语言到形式语言（如时序逻辑）的翻译。 Method: 提出GraFT框架，利用问题特性逐步约束语言模型输出的合法词元集合，将提升和翻译两个步骤的解空间缩小，从而简化任务并提高学习效率。 Result: 在CW、GLTL和Navi基准上评估显示，GraFT相比现有方法平均提升端到端翻译准确率5.49%，跨领域翻译准确率提升14.06%。 Conclusion: GraFT通过解空间的有效约束显著提升了NL到TL翻译的性能，尤其在跨领域场景下表现更优，为数据受限下的形式化翻译提供了高效解决方案。 Abstract: Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

Aditya Yadavalli,Tiago Pimentel,Tamar I Regev,Ethan Wilcox,Alex Warstadt

Main category: cs.CL

TL;DR: 本文提出一种信息论方法，利用大模型量化语音韵律（如情感、反讽）相较于文本单独传递的额外信息量，发现韵律在缺乏长期语境时对情感和反讽的表达比文本多一个数量级的信息量，而对疑问句的作用较小。

Details

Motivation: 韵律承载了文本无法表达的重要语义信息（如情感、反讽），但其信息量缺乏定量分析，因此需要一种方法来精确衡量韵律独立于文本所传达的信息及其内容。 Method: 使用大型语音和语言模型估计话语意义维度（如情感、反讽、疑问）与其通信通道（音频或文本）之间的互信息，从而量化不同通道传递的信息量。 Result: 在缺乏长期语境的情况下，音频通道（即韵律）关于反讽和情感传递的信息量比文本通道高出一个数量级以上；而对于疑问句，韵律提供的额外信息较少。 Conclusion: 韵律在传达情感和反讽等语义方面起主导作用，尤其当上下文有限时；本文提出的方法可推广至更多意义维度、通信渠道和语言的研究。 Abstract: Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

Main category: cs.CL

TL;DR: 本文提出了LLMCache，一种新颖的层间缓存框架，通过基于输入序列语义相似性重用中间激活来加速Transformer模型推理。该方法具有模型无关性，适用于编码器和解码器架构，并可在任意Transformer层进行缓存。实验显示其在BERT和GPT-2上实现了最高3.1倍的推理速度提升，且精度损失小于0.5%。

Details

Motivation: Transformer模型在多种任务中表现优异，但其高推理延迟限制了在实时和大规模部署中的应用。现有缓存机制（如键值缓存）仅限于自回归解码，适用范围有限。因此需要一种更通用、高效的缓存方案以加速推理。 Method: 提出LLMCache，一种层间缓存框架：1）利用轻量级指纹机制匹配语义相似的输入；2）支持跨编码器与解码器架构、任意Transformer层的缓存；3）采用自适应驱逐策略管理缓存陈旧问题。 Result: 在BERT和GPT-2模型上，于SQuAD、WikiText-103和OpenBookQA数据集进行实验，结果显示推理时间最多加快3.1倍，精度损失低于0.5%。 Conclusion: LLMCache是一种实用且通用的Transformer推理优化方案，能够在保持高准确率的同时显著降低推理延迟，适用于现实世界的应用场景。 Abstract: Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Tzu-Han Lin,Wei-Lin Chen,Chen-An Li,Hung-yi Lee,Yun-Nung Chen,Yu Meng

Main category: cs.CL

TL;DR: 本文提出AdaSearch，一种两阶段强化学习框架，用于提升大语言模型在调用搜索时的自知能力，实现参数化知识与外部搜索间的自适应平衡，减少不必要的搜索调用，同时保持高性能，并提高决策透明度。

Details

Motivation: 现有搜索代理过度依赖搜索或奖励工程，缺乏对自身知识边界的认识，导致不必要搜索或幻觉问题，且决策过程不透明。 Method: 提出AdaSearch，采用两阶段、结果驱动的强化学习框架，将问题求解与是否调用搜索的决策分离，使搜索决策显式且可解释。 Result: 实验表明，AdaSearch显著提升了知识边界意识，减少了不必要的搜索调用，保持了任务性能，并提供了更透明和可解释的决策行为。 Conclusion: AdaSearch通过显式的决策机制，有效平衡了内部知识与外部搜索，在多模型上实现了更高效、可靠且可解释的搜索调用，适用于金融、医疗等高风险领域。 Abstract: Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad

Main category: cs.CL

TL;DR: 本文提出了Multimodal RewardBench 2 (MMRB2)，首个针对多模态理解与交错生成的奖励模型综合基准，涵盖四个任务并提供1000个专家标注的偏好对，用于评估现有判断模型的表现，并揭示未来改进方向。

Details

Motivation: 尽管奖励模型在语言模型训练中至关重要，但在处理图像与文本交错序列的全能模型中仍缺乏探索，因此需要一个专门针对多模态场景的全面评测基准。 Method: 构建了包含四个任务（文生图、图像编辑、交错生成和带图推理）的MMRB2基准，每项任务包含来自23个模型和代理的1000个专家标注偏好对，采用集成过滤策略确保高质量与高共识的数据；并在多个现有奖励模型上进行评估，分析其表现与下游任务的相关性。 Result: 最新模型如Gemini 3 Pro准确率为75-80%，GPT-5和Gemini 2.5 Pro为66-75%，优于GPT-4o（59%），但低于人类的>90%；开源模型Qwen3-VL-32B达到约64%；且MMRB2评分与下游任务表现高度相关。 Conclusion: MMRB2是一个具有挑战性和实用价值的多模态奖励模型基准，揭示了当前模型仍有显著提升空间，为未来多模态奖励建模提供了重要方向和评估标准。 Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

[30] In-Context Algebra

Eric Todd,Jannik Brinkmann,Rohit Gandikota,David Bau

Main category: cs.CL

TL;DR: 研究了在符号意义不固定的情况下，transformers如何通过上下文学习算术运算，并发现了三种机制：交换复制、单位元识别和基于闭包的消去。

Details

Motivation: 探索在符号含义动态变化的序列中，transformers如何进行算术推理，超越已有固定符号设定的研究。 Method: 设计新的任务，使符号到代数群元素的映射每条序列不同，使用针对性的数据分布进行因果机制测试。 Result: transformers在该任务上达到接近完美的准确率，并能泛化到未见的代数群；发现三种一致学习到的机制：交换复制、单位元识别和基于闭包的消去。 Conclusion: 模型在意义不固定的变量环境中发展出符号推理机制，补充了固定符号下几何表示的研究发现。 Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Nikhil Prakash,Donghao Ren,Dominik Moritz,Yannick Assogba

Main category: cs.CL

TL;DR: 提出了一种名为Constructive Circuit Amplification的新方法，通过识别并仅更新与特定任务相关的关键组件，在数学推理任务中显著提升模型准确性，同时几乎不影响其他能力。

Details

Motivation: 基于先前研究发现LLM内部存在负责特定任务的稀疏子网络（即电路），且微调主要通过增强已有电路来提升性能，因此探索是否可以直接干预这些电路以实现精准的能力提升。 Method: 该方法从模型的推理轨迹中识别出关键token和负责目标任务的模型组件，并仅对这些组件进行更新，从而实现对特定能力的定向增强。 Result: 在多个模型上应用于数学推理任务时，准确率最高提升了+11.4%，仅修改了1.59%的模型组件，并在MMLU、TriviaQA和TruthfulQA等基准上对其他能力影响极小。 Conclusion: 通过选择性地更新少量模型组件，可以可靠地增强大语言模型的特定目标能力，验证了电路级干预的有效性和精确性。 Abstract: Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

Yan Yang,George Bebis,Mircea Nicolescu

Main category: cs.CV

TL;DR: 提出了一种两步生成式数据增强框架，结合基于规则的掩码扭曲和使用GAN的无配对图像到图像转换，以生成更真实的戴口罩人脸样本。

Details

Motivation: 解决戴口罩人脸识别中数据稀缺和分布偏移的问题。 Method: 结合基于规则的掩码扭曲与使用GAN的无配对图像到图像转换，并引入非掩码保留损失和随机噪声注入。 Result: 相比仅使用基于规则的扭曲方法，所提方法在定性上有一致改进，并增强了样本多样性。 Conclusion: 该框架有效提升了戴口罩人脸识别的数据增强效果，为未来数据中心化增强方法提供了方向。 Abstract: Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出了一种名为JARVIS的自监督框架，通过引入I-JEPA学习范式增强多模态大语言模型（MLLMs）的视觉理解能力，减少对语言先验的依赖，并在多种基准上验证了其有效性。

Details

Motivation: 现有的MLLMs主要依赖文本描述进行视觉学习，导致监督信号主观且不完整，同时多模态指令调优规模较小，易过拟合语言先验而忽略视觉细节。 Method: 将I-JEPA学习范式融入MLLM的视觉-语言对齐训练流程中，利用冻结的视觉基础模型作为上下文和目标编码器，训练基于LLM前几层的预测器来从图像中学习结构和语义规律，减少对语言监督的依赖。 Result: 在多个标准MLLM基准测试中，JARVIS在不同LLM家族上均显著提升了以视觉为中心的任务性能，同时未损害多模态推理能力。 Conclusion: JARVIS通过自监督方式增强了MLLM的视觉理解能力，缓解了语言先验过拟合问题，为提升MLLM在基础视觉推理任务上的表现提供了有效解决方案。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Dwip Dalal,Utkarsh Mishra,Narendra Ahuja,Nebojsa Jojic

Main category: cs.CV

TL;DR: 本文提出了一个名为“稀疏定位视觉导航”（Sparsely Grounded Visual Navigation）的新任务，并构建了CityNav基准，用于评估多模态大语言模型（MLLMs）在真实城市环境中的导航能力。

Details

Motivation: 现有评估基准大多以语言为中心或依赖仿真环境，难以检验MLLM在真实世界复杂场景中所需的知识密集型推理能力，因此需要更贴近现实的评测方式。 Method: 提出CityNav基准，包含四个全球城市的50多个决策点，要求代理仅依靠视觉输入和内部多模态推理进行导航；并提出VoP方法，通过从MLLM中提取显式的认知地图（关键地标和方向）来增强推理过程。 Result: 实验表明当前最先进的MLLM和标准推理技术（如思维链、反思）在此任务上表现不佳，而VoP显著提升了导航成功率。 Conclusion: 真实环境中的知识密集型导航是评估MLLM代理能力的有效途径，VoP为提升其空间推理与路径规划提供了有效策略。 Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: R4是一个无需训练的框架，通过在四维时空空间中构建结构化的持久记忆，增强视觉-语言模型的推理能力，支持跨智能体共享世界模型，并在具身问答与导航任务中显著优于基线方法。

Details

Motivation: 受人类在四维时空下构建多模态记忆以进行感知和推理的启发，旨在使视觉-语言模型具备持续、结构化的时空记忆能力，以更好地处理动态环境中的复杂任务。 Method: 提出R4框架，通过将对象级语义描述锚定在度量空间和时间中，持续构建4D知识数据库；在推理时，将自然语言查询分解为语义、空间和时间键进行检索，并将结果整合到VLM中进行推理。 Result: 在具身问答和导航基准上，R4显著提升了对时空信息的检索与推理性能，优于现有基线方法。 Conclusion: R4实现了无需训练的4D时空检索增强推理，为动态环境中具身智能体的协作与长期记忆提供了新范式。 Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar,Fenil Bardoliya,Pavan K. Turaga,Chitta Baral,Vivek Gupta

Main category: cs.CV

TL;DR: 提出“感知观测站”框架，系统评估多模态大模型在受控扰动下的视觉感知能力，超越传统任务准确率评价。

Details

Motivation: 现有评测方法过于关注端到端任务准确率，忽视多模态大模型在视觉感知鲁棒性、归因保真度和推理能力方面的评估。 Method: 构建包含多个垂直任务的评测框架，使用真实标注数据集并引入像素级增强和扩散模型生成的风格化错觉进行系统扰动测试。 Result: 在人脸匹配、文本识别、图像匹配、属性定位等任务中揭示了模型在不同扰动下的感知保持能力和局限性。 Conclusion: 该框架为分析当前和未来多模态大模型的感知能力提供了原则性基础，强调需超越单纯准确率衡量。 Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal,Yuchen Liu,Luigi Palmieri,Ilche Georgievski,Marco Aiello

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型的多人体行为预测框架CAMP-VLM，结合视觉上下文和场景图的空间感知，通过合成数据微调，在第三人称视角下显著提升了人类-场景交互的预测准确率。

Details

Motivation: 现有研究主要关注单人场景下的第一人称行为预测，而许多机器人应用需要从第三人称视角理解多人行为，缺乏合适的数据集和有效方法。 Method: 提出CAMP-VLM框架，结合视觉语言模型、场景图的空间信息，并利用逼真模拟器生成的合成数据进行监督微调（SFT）和直接偏好优化（DPO）。 Result: 在合成和真实世界序列上评估，CAMP-VLM比最佳基线模型预测准确率提升高达66.9%。 Conclusion: CAMP-VLM在第三人称多人体行为预测中表现出色，验证了合成数据训练与上下文感知建模的有效性，具备良好的泛化能力。 Abstract: Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 本研究探索了视觉-语言模型（VLM）在少样本多光谱目标检测中的潜力，提出了一种有效融合文本、可见光与热成像模态的方法，并在FLIR和M3FD数据集上验证了其优越性能。

Details

Motivation: 由于标注的多光谱数据稀缺，深度检测器的训练受到限制，因此需要利用文本类别信息等语义监督来提升数据效率。受VLM在计算机视觉中成功的启发，本文探索其在多光谱检测中的应用。 Method: 将两种典型的VLM-based检测器（Grounding DINO 和 YOLO-World）扩展以支持多光谱输入，并设计了一种有效的机制来融合文本、视觉和热成像模态的信息。 Result: 在FLIR和M3FD两个基准上的实验表明，VLM-based检测器在少样本设置下显著优于使用相当数据训练的专业多光谱模型，在全监督设置下也表现出具有竞争力甚至更优的性能。 Conclusion: 大规模VLM学习到的语义先验可以有效迁移到未见过的光谱模态中，为实现高效数据的多光谱感知提供了强有力路径。 Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario,Mason J. Earles

Main category: cs.CV

TL;DR: 本文评估了多种视觉语言模型（VLMs）在农业分类任务中的零样本性能，发现其表现远低于专用监督模型（如YOLO11），尤其在开放生成式提示下表现更差。使用语义评判可提升评估准确性，但总体表明当前现成的VLM尚不适合作为独立农业诊断工具，而可作为辅助组件使用。

Details

Motivation: 探讨现有多模态大模型在农业视觉识别任务中的适用性和可靠性，填补其在农业决策支持中性能认知的空白。 Method: 在AgML数据集的27个农业分类任务上，对开源与闭源VLM进行零样本评估，比较多种提示方式（如多项选择与开放生成）及不同评估方法（如基于LLM的语义评判）下的性能差异，并与YOLO11等专用模型对比。 Result: 所有VLM在零样本设置下均显著落后于YOLO11；最佳闭源模型Gemini-3 Pro在多项选择下平均准确率约62%，开放生成下通常低于25%；引入语义评判可将顶级模型的准确率从21%提升至30%；Qwen-VL-72B为最佳开源模型；植物与杂草分类较易，病虫害识别最难。 Conclusion: 当前现成的VLM尚不足以作为独立的农业诊断系统，但在受限接口、明确标签体系和领域适配评估策略下可作为辅助工具使用。 Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

Lars Beckers,Arno Waes,Aaron Van Campenhout,Toon Goedemé

Main category: cs.CV

TL;DR: 提出一种通过视觉感知和自适应决策主动增强花园生物多样性的机器人割草框架，利用深度特征空间分析选择性保留植被斑块。

Details

Motivation: 传统割草方式不利于生物多样性，被动再野化方法效果有限，需要更智能的主动管理手段来提升城市绿地生态价值。 Method: 采用预训练的ResNet50网络提取植物图像的生态有意义嵌入，通过全局偏差度量估计无需物种级监督的生物多样性，并驱动选择性割草算法动态调整割草与保护行为。 Result: 在模拟草坪和真实花园数据集上验证了系统有效性，嵌入空间离散度与专家评估的生物多样性具有强相关性。 Conclusion: 深度视觉多样性可作为生态丰富度的有效代理指标，所提割草决策方法能有效促进城市生物多样性，有望将单一草坪转化为有价值的生物栖息地。 Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.

Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Ziyuan Liu,Abhinav Valada

Main category: cs.CV

TL;DR: 提出一种生成视频-动作对的方法，基于文本指令和初始图像及机器人状态，通过扩展预训练视频扩散模型并引入桥接注意力机制，实现高质量视频和精确动作生成，推动机器人策略学习。

Details

Motivation: 现有方法多为两阶段流程或单模态适配，限制了跨模态信息交互，且缺乏动作标注，难以充分利用预训练视频模型知识。 Method: 1）扩展预训练视频扩散模型，增加专用动作扩散模型以保留预训练知识；2）引入Bridge Attention机制实现有效跨模态交互；3）设计动作细化模块，将粗略动作转化为精确控制，适用于低分辨率数据集。 Result: 在多个公开基准和真实世界数据集上验证，该方法生成的视频质量更高、动作更准确，显著优于现有基线方法。 Conclusion: 该方法提供了一个可扩展的框架，能够有效利用大规模视频数据进行机器人学习，克服了动作标注缺失和跨模态耦合不足的问题。 Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

Jiaheng Geng,Jiatong Du,Xinyu Zhang,Ye Li,Panqu Wang,Yanjun Huang

Main category: cs.CV

TL;DR: 提出了一种用于端到端自动驾驶的闭环评估平台，通过生成真实场景中的对抗性交互来有效检测模型在安全关键corner case下的性能退化。

Details

Motivation: 现有对抗性评估方法多基于简化仿真环境，难以应用于真实世界端到端自动驾驶系统，缺乏对真实场景中安全关键corner case的有效评估手段。 Method: 构建一个闭环评估平台，结合基于流匹配的真实世界图像生成器与对抗性交通策略，实现对端到端自动驾驶模型在真实场景中的对抗性测试。 Result: 平台能高效稳定地生成逼真的驾驶图像，并通过UniAD、VAD等模型验证了其在corner case下检测模型性能下降的能力。 Conclusion: 该平台可有效发现端到端自动驾驶模型的潜在问题，有助于提升系统的安全性与鲁棒性。 Abstract: Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model's potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

Hao Tang,Hanyu Liu,Alessandro Perelli,Xi Chen,Chao Li

Main category: cs.CV

TL;DR: 提出一种基于3D多通道patch扩散模型的方法，结合脑解剖先验和球谐注意力机制，有效从低角分辨率dMRI预测高角分辨率FOD。

Details

Motivation: 低角分辨率dMRI在FOD估计中精度不足，而高角分辨率需要长时间扫描，现有扩散模型难以高效处理大量球谐系数。 Method: 设计FOD-patch适配器引入脑解剖先验，采用patch-based学习；引入体素级条件协调模块增强全局理解；设计球谐注意力模块学习SH系数间复杂相关性。 Result: 在HAR-FOD预测任务中表现优于现有最先进方法，具有更高的准确性和效率。 Conclusion: 所提方法能高效准确地从LAR-FOD生成HAR-FOD，有望提升临床dMRI应用价值。 Abstract: Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.

[44] Auto-Vocabulary 3D Object Detection

Haomeng Zhang,Kuan-Chuan Peng,Suhas Lohit,Raymond A. Yeh

Main category: cs.CV

TL;DR: 提出了一种无需用户输入的自动词汇3D物体检测框架AV3DOD，利用2D视觉-语言模型生成类别名称，并在定位和语义质量上达到SOTA。

Details

Motivation: 现有开集词汇3D检测方法依赖用户指定类别，缺乏真正自动化生成类别的能力。 Method: 通过图像描述、伪3D框生成和特征空间语义扩展，利用2D视觉-语言模型生成候选语义，并引入语义分数（SS）评估生成类别质量。 Result: 在ScanNetV2和SUNRGB-D数据集上，AV3DOD在mAP和SS指标上均达到SOTA，ScanNetV2上mAP提升3.48，SS相对提升24.5%。 Conclusion: AV3DOD实现了无需人工干预的自动词汇生成，在开放词汇3D检测中具有优越的定位与语义表达能力。 Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

[45] LAPX: Lightweight Hourglass Network with Global Context

Haopeng Zhao,Marsha Mariya Kappan,Mahdi Bamdad,Francisco Cruz

Main category: cs.CV

TL;DR: 提出了一种名为LAPX的轻量级人体姿态估计模型，基于Hourglass网络和自注意力机制，在保持低参数量（2.3M）的同时实现了高精度和实时性，适用于边缘设备。

Details

Motivation: 现有SOTA姿态估计模型计算成本高，轻量化模型在边缘设备部署时存在效率与精度不平衡的问题。 Method: 基于LAP工作，引入自注意力机制以捕获全局上下文信息，改进阶段设计并优化轻量级注意力模块，构建更高效的Hourglass网络LAPX。 Result: 在MPII和COCO两个基准数据集上取得有竞争力的结果，具备仅2.3M参数量，并实现实时性能。 Conclusion: LAPX在精度、速度和模型大小之间取得了良好平衡，验证了其在边缘设备上的适用性，适合高效部署。 Abstract: Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.

[46] Collimator-assisted high-precision calibration method for event cameras

Zibin Liu,Shunkun Liang,Banglei Guan,Dongcai Tan,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于闪烁星图和准直仪的事件相机校准方法，通过线性求解和非线性优化实现高精度内外参标定。

Details

Motivation: 事件相机在长距离测量场景下的几何标定仍具挑战，现有方法难以兼顾远距离与高精度需求。 Method: 利用带有闪烁星图模式的准直仪，首先基于准直仪的球面运动模型线性求解相机参数，再通过非线性优化精细调整参数。 Result: 在多种真实实验条件下验证了该方法的有效性，结果表明其在校准精度和可靠性方面优于现有方法。 Conclusion: 所提方法能有效提升事件相机在长距离、高精度测量场景下的标定性能，具有良好的应用前景。 Abstract: Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Jintao Zhang,Kaiwen Zheng,Kai Jiang,Haoxu Wang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: TurboDiffusion 是一种视频生成加速框架，通过注意力加速、步长蒸馏和W8A8量化等技术，实现端到端扩散模型生成速度提升100-200倍，同时保持视频质量。

Details

Motivation: 现有的扩散模型在视频生成过程中计算开销大、推理速度慢，难以满足实时应用需求，因此需要高效且保质的加速方法。 Method: 采用低比特SageAttention和可训练稀疏线性注意力（SLA）加速注意力计算，使用rCM进行高效的步长蒸馏，并结合W8A8量化技术对模型参数和激活值进行8位量化，同时引入多种工程优化。 Result: 在多个Wan系列视频生成模型上实验表明，TurboDiffusion在单张RTX 5090 GPU上即可实现100-200倍的生成速度提升，同时保持与原模型相当的视频质量。 Conclusion: TurboDiffusion通过多项关键技术显著提升了扩散模型的视频生成效率，为高性能视频生成提供了实用化的解决方案。 Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

[48] Flexible Camera Calibration using a Collimator System

Shunkun Liang,Banglei Guan,Zhenbao Yu,Dongcai Tan,Pengju Sun,Zibin Liu,Qifeng Yu,Yang Shang

Main category: cs.CV

TL;DR: 本文提出了一种基于设计的准直器系统的新颖相机标定方法，利用角度不变性约束将目标与相机间的相对运动简化为球面运动模型，从而减少自由度，并提出了多图像和两图像的闭式线性解法及仅需单张准直器图像的标定算法。

Details

Motivation: 为了提供一个可靠且可控的相机标定环境，解决传统方法中高自由度带来的复杂性和不稳定性问题。 Method: 利用准直器系统的独特光学几何特性引入角度不变性约束，证明了校准靶与相机之间的相对运动符合球面运动模型；基于此约束提出闭式线性求解器（多图像）和最小求解器（两图像），并开发了仅使用单张准直器图像的标定算法。 Result: 在合成数据和真实实验中验证了该方法的可行性，结果表明所提方法优于现有基线方法，且实现了无需相机运动的快速灵活标定。 Conclusion: 所提出的基于准直器系统的相机标定方法通过引入角度不变性约束显著降低了标定过程的复杂性，提高了精度与灵活性，为实际应用提供了高效解决方案。 Abstract: Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa,Yang Yang,Risa Shinoda,Hiroaki Santo,Kenji Oyama,Fumio Okura,Takenao Ohkawa

Main category: cs.CV

TL;DR: 本文提出了一种名为CattleAct的方法，用于从单张图像中自动检测放牧牛之间的行为交互，通过将交互分解为个体行为组合，并利用大规模动作数据集学习潜在空间，结合对比学习微调以构建统一的动作-交互潜在空间，实现了高效的数据利用和准确的交互检测。

Details

Motivation: 由于放牛建交行为稀少且缺乏包含交互的综合性行为数据集，导致牛只交互检测面临非平凡挑战，而现有的人类交互检测方法难以直接应用于畜牧业。 Method: 首先从大规模牛只动作数据集中学习动作潜在空间，然后通过对比学习对预训练的潜在空间进行微调，将稀有交互嵌入到该空间中，从而构建一个统一的动作与交互潜在空间，实现基于单图像的交互检测。 Result: 实验结果表明，在商业规模的牧场环境中，所提方法相比基线模型能够更准确地检测牛只间的交互行为。此外，开发了集成视频和GPS输入的实用系统，并公开了代码。 Conclusion: CattleAct提供了一种数据高效的牛只交互检测方案，有效解决了稀有交互事件检测的难题，具有在智能畜牧管理（如发情检测）中的实际应用潜力。 Abstract: This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

Ze Yuan,Wenbin Li,Shusen Zhao

Main category: cs.CV

TL;DR: 提出了一种结合迭代方法与深度学习的双能谱CT重建框架，包含基于OPMT的知识驱动模块和基于ResDynUNet++的数据驱动模块，有效提升重建质量。

Details

Motivation: 传统双能谱CT重建在处理通道不平衡和界面附近大伪影方面存在挑战，需要更高效的分解与精细化能力。 Method: 采用OPMT快速实现基材料分解得到中间结果，再利用改进的ResDynUNet++网络进行精细化重建，其中网络使用残差动态卷积增强特征提取能力。 Result: 在合成体模和真实临床数据上均表现出优异性能，有效抑制伪影并提高图像质量。 Conclusion: 所提混合框架结合了模型驱动与数据驱动方法的优势，在双能谱CT重建中实现了高精度与强鲁棒性。 Abstract: We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

Yueyang Hu,Haiyong Jiang,Haoxuan Song,Jun Xiao,Hao Pan

Main category: cs.CV

TL;DR: 提出了一种基于SAM分割图的图传播方法SegGraph，用于少样本3D部件分割，通过建模片段间的空间关系并融合2D基础模型特征，显著提升了分割性能。

Details

Motivation: 现有方法在将2D基础模型知识迁移到3D分割时，忽视了几何结构或忽略了SAM提供的高质量分组线索，导致欠分割和标签不一致问题。 Method: 构建一个以片段为节点、空间关系（重叠/邻接）为边的分割图，利用图神经网络传播2D基础模型特征，并通过视图方向加权融合策略保持片段内语义一致性。 Result: 在PartNet-E数据集上超越所有基线方法至少6.9个百分点mIoU，在小部件和边界区域表现尤为突出。 Conclusion: SegGraph能有效整合SAM的几何与语义信息，显著提升少样本3D部件分割的准确性和鲁棒性。 Abstract: This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM's segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: https://github.com/YueyangHu2000/SegGraph.

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Chao Li,Dasha Hu,Chengyang Li,Yuming Jiang,Yuncheng Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督域适应方法C-DGPA，通过联合优化边缘分布和条件分布对齐，显著提升了视觉-语言模型在域适应任务中的性能，在多个基准上实现了最先进的结果。

Details

Motivation: 现有基于提示调优的方法主要关注边缘分布对齐，忽略了条件分布差异，导致类别原型错位和语义判别能力下降。因此，需要一种能够同时处理两种分布差异的提示学习方法。 Method: 提出C-DGPA，采用双分支架构：一个分支使用动态对抗训练实现边缘分布对齐；另一个分支引入类别映射机制（CMM）进行条件分布对齐，标准化语义理解并防止对源域过度依赖。 Result: 在OfficeHome、Office31和VisDA-2017数据集上进行了广泛实验，C-DGPA在所有基准上均取得了新的最先进性能。 Conclusion: C-DGPA通过协同优化边缘和条件分布对齐，有效融合域知识到提示学习中，生成了域不变且具有强语义判别能力的表示，显著提升了无监督域适应下的VLM性能。 Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.

[53] Towards Closing the Domain Gap with Event Cameras

M. Oltan Sevinc,Liao Wu,Francisco Cruz

Main category: cs.CV

TL;DR: 本文探讨了使用事件相机作为传统相机的替代方案，以解决在不同光照条件下（如白天和夜晚）数据域差异导致的性能下降问题。研究结果表明，事件相机在不同光照条件下的表现更加稳定，并且在跨域场景中提供了优于灰度帧的基准性能。

Details

Motivation: 传统相机在训练数据与部署环境不匹配时性能显著下降，特别是在昼夜光照变化这样的域差距下。因此需要寻找一种能够在各种光照条件下保持良好性能的传感器。 Method: 提出使用事件相机代替传统相机进行端到端驾驶任务，通过实验比较事件相机与传统相机在不同光照条件下的性能表现。 Result: 实验结果显示，事件相机在不同光照条件下的性能更为一致，其域偏移惩罚通常等于或小于灰度帧，在跨域情景下提供更优的基线性能。 Conclusion: 事件相机是一种有潜力的解决方案，能够有效缓解由于光照变化引起的域差距问题，为自动驾驶系统提供更可靠的感知能力。 Abstract: Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright,Zhibo Wang,Dmytro Klepachevskyi,Yuhao Chen,Sirisha Rambhatla,David Clausi,John Zelek

Main category: cs.CV

TL;DR: Avatar4D是一个可迁移的生成定制化合成人体运动数据集的管道，适用于特定领域应用，无需手动标注即可实现对人体姿态、外观、视角和环境的精细控制，并通过体育领域的Syn2Sport数据集验证其在姿态估计、零样本迁移和跨运动泛化中的有效性。

Details

Motivation: 现有方法主要关注日常通用动作，缺乏对特定领域（如体育）复杂运动模式的支持，且灵活性和可控性有限，难以满足实际应用需求。 Method: 提出Avatar4D，一种无需人工标注的4D高保真人体运动合成管道，支持对身体姿态、外观、相机视角和环境的细粒度控制，并构建大规模体育合成数据集Syn2Sport，用于评估模型在监督学习和零样本迁移中的表现。 Result: 在Syn2Sport上 benchmark 多个最先进姿态估计模型，展示了其在监督学习、零样本迁移到真实世界数据以及跨体育项目泛化方面的有效性，并验证了合成数据与真实数据在特征空间中的接近程度。 Conclusion: Avatar4D能够生成可扩展、可控且可迁移的领域特定人体运动数据集，无需依赖真实域数据，为特定任务提供了新的数据生成范式。 Abstract: We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose,Ravi K. Rajendran,Biplob Debnath,Konstantinos Karydis,Amit K. Roy-Chowdhury,Srimat Chakradhar

Main category: cs.CV

TL;DR: 本文提出了一种名为VALOR的新方法，通过强化学习和两阶段对齐框架来提升医学视觉-语言模型在放射学报告生成中的视觉对齐性和临床准确性。

Details

Motivation: 现有的放射学报告生成方法存在跨模态对齐不足的问题，导致生成内容出现幻觉，难以保证临床准确性和视觉相关性。 Method: 提出VALOR方法，采用基于强化学习的后对齐框架，结合Group-Relative Proximal Optimization（GRPO），分两个阶段进行：第一阶段使用文本奖励提升术语的临床精确性，第二阶段对齐视觉投影模块与疾病发现，增强对关键图像区域的关注。 Result: 在多个基准上的实验表明，VALOR显著提高了事实准确性和视觉对齐能力，优于当前最先进的报告生成方法。 Conclusion: VALOR有效解决了医学视觉-语言模型在放射学报告生成中的跨模态对齐问题，实现了更准确且视觉可解释的报告生成。 Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang,Sangwoo Mo,Stella X. Yu,Sima Behpour,Liu Ren

Main category: cs.CV

TL;DR: 本文提出了一种名为OAK的新模型，用于开放式的即席视觉场景分类，通过结合CLIP和GCD的优势，在多个数据集上实现了最先进的性能，并生成可解释的显著性图。

Details

Motivation: 现有的分类方法难以应对动态变化的任务需求，而即席类别需要根据特定目标动态构建，因此需要一种能够自适应发现语义上下文并扩展类别的方法。 Method: OAK模型在冻结的CLIP输入端引入少量可学习的上下文令牌，联合优化CLIP的图文对齐目标和GCD的视觉聚类目标，从而实现基于语义扩展和视觉聚类的即席类别扩展。 Result: 在Stanford和Clevr-4数据集上，OAK在多种分类任务中达到最先进水平，例如在Stanford Mood上取得87.4%的新类别准确率，超过CLIP和GCD逾50%，并生成聚焦于手（动作）、面部（情绪）和背景（位置）的可解释显著性图。 Conclusion: OAK有效结合了通用感知机制与即席分类需求，实现了高效、可解释、可泛化的开放式即席视觉分类，为AI代理适应动态任务提供了新思路。 Abstract: Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

[57] Enhanced 3D Shape Analysis via Information Geometry

Amit Vishwakarma,K. S. Subrahamanian Moosath

Main category: cs.CV

TL;DR: 提出一种基于信息几何框架的3D点云形状分析方法，通过将点云表示为高斯混合模型（GMM），并引入具有理论上下界的修正对称KL散度（MSKL），实现了稳定且能反映几何变化的距离度量。

Details

Motivation: 传统几何度量（如Hausdorff和Chamfer距离）难以捕捉全局统计结构且对异常值敏感，现有KL散度近似方法存在无界或数值不稳定问题。 Method: 将点云建模为统计流形上的高斯混合模型（GMM），构建修正对称KL散度（MSKL），并在理论上证明其有界性和数值稳定性。 Result: 在MPI-FAUST和G-PCD数据集上的实验表明，MSKL能稳定地反映几何变化，在人体姿态区分和动物形状比较任务中优于传统距离和现有KL近似方法。 Conclusion: MSKL为点云比较提供了一种鲁棒、稳定的度量方式，适用于需要精确形状分析的应用领域。 Abstract: Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Zhihao Zhang,Xuejun Yang,Weihua Liu,Mouquan Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于编码器-解码器网络（EDN）的学习框架，用于将随机噪声转换为高质量初始噪声，从而提升单视图扩散模型在新视角合成中的性能。

Details

Motivation: 现有的单视图新视角合成扩散模型缺乏专门学习高质量初始噪声的框架，影响生成质量。 Method: 设计离散化Euler反演方法构建高质量噪声数据对，并提出基于编码器-解码器网络的噪声转换框架。 Result: 所提EDN框架可无缝集成到SV3D、MV-Adapter等模型中，在多个数据集上显著提升性能。 Conclusion: 通过学习生成高质量初始噪声，能有效提升单视图扩散模型的新视角合成效果，验证了噪声优化的重要性。 Abstract: Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

[59] Image Compression Using Singular Value Decomposition

Justin Jiang

Main category: cs.CV

TL;DR: 本研究探讨了使用奇异值分解（SVD）和低秩矩阵近似进行图像压缩的效果，发现尽管重构图像在视觉上与原图相似，但其压缩效率始终不如JPEG、JPEG2000和WEBP等标准格式，在低误差要求下甚至比原始图像更大。

Details

Motivation: 为了减少存储和带宽需求，高效的图像压缩技术至关重要。研究旨在评估SVD在图像压缩中的潜力。 Method: 采用奇异值分解和低秩矩阵近似方法对灰度和多通道图像进行压缩，并使用相对Frobenius范数误差和压缩比来评估性能。 Result: 低秩近似能生成视觉上接近原图的图像，但在相同误差水平下，压缩效率低于JPEG、JPEG2000和WEBP；在低误差要求时，SVD压缩后的数据量可能超过原始图像大小。 Conclusion: SVD方法在实际图像压缩中不具备与行业标准编码器竞争的能力，不适用于高效压缩场景。 Abstract: Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng,Zeeshan Hayder,Wei Liu,Hesheng Wang,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了ARMFlow，一种基于MeanFlow的自回归框架，用于3D人类反应动作生成，解决了高保真度、实时推理和在线自回归适应性的挑战。

Details

Motivation: 现有方法无法同时满足高运动保真度、实时推理和在线场景下的自回归适应性这三个关键需求。 Method: 提出ARMFlow，包含因果上下文编码器和MLP速度预测器，并引入Bootstrap Contextual Encoding（BSCE）训练策略以减少误差累积；还提出了离线变体ReMFlow。 Result: ARMFlow在InterHuman和InterX数据集上的FID指标超过现有在线方法40%以上，单步推理速度快，且性能媲美最先进的离线方法。 Conclusion: ARMFlow在保持低延迟的同时实现了高精度的3D反应动作生成，通过BSCE有效缓解了自回归生成中的误差累积问题，适用于在线应用场景。 Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

Satya Narayana Panda,Vaishnavi Kukkala,Spandana Iyer

Main category: cs.CV

TL;DR: 本研究提出了一种结合家族病史和临床影像的多模态AI框架，用于提升皮肤病诊断准确性，尤其在遗传性皮肤病中表现突出。

Details

Motivation: 由于皮肤科专家稀缺且临床表现复杂，皮肤病诊断具有挑战性，而家族病史虽重要却常被忽视，因此需要一种能整合多源数据的AI辅助诊断方法。 Method: 开发了基于可解释卷积神经网络与临床决策树相结合的多模态AI系统，融合图像分析与包括家族病史在内的结构化临床数据，并通过前瞻性临床试验验证其有效性。 Result: 整合家族病史后，AI系统在黑色素瘤、银屑病和特应性皮炎等遗传性皮肤病中的诊断准确率提高，专家反馈显示其有助于早期检测和个性化推荐。 Conclusion: 该多模态AI框架能有效提升皮肤病诊断性能，具备临床可解释性和实际应用潜力，支持未来临床试验与真实世界部署。 Abstract: Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

Qi Zhang,Yunfei Gong,Zhidan Xie,Zhizi Wang,Antoni B. Chan,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了两种半监督的多视角人群计数框架，通过基于模型预测或模型不确定性的多视角融合模型排序来解决标注数据有限的问题。

Details

Motivation: 由于收集和标注多视角图像困难，现有数据集中的多视角帧和场景数量有限，因此需要减少对大量标注数据依赖的方法。 Method: 第一种方法根据不同视角数量输入下的模型预测结果进行排序，确保较少视角的预测值不超过更多视角的预测值；第二种方法根据模型不确定性进行排序，利用预测误差指导训练，使更多视角的不确定性不超过更少视角的不确定性。两种约束均以半监督方式引入训练过程。 Result: 实验表明，所提的多视角模型排序方法在多视角人群计数任务中优于其他半监督计数方法。 Conclusion: 通过引入基于预测或不确定性的排序约束，提出的半监督框架能有效提升多视角人群计数在标注数据有限情况下的性能。 Abstract: Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

Paloma Casteleiro Costa,Parnian Ghapandar Kashani,Xuhui Liu,Alexander Chen,Ary Portes,Julien Bec,Laura Marcu,Aydogan Ozcan

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的多通道像素超分辨率框架FLIM_PSR_k，可从大像素尺寸采集的数据中重建高分辨率荧光寿命成像（FLIM）图像，实现5倍超分辨率，显著提升成像速度与空间带宽积，推动FLIM在临床诊断中的应用。

Details

Motivation: 荧光寿命成像（FLIM）虽具有分子和代谢对比优势，但因像素驻留时间长、信噪比低，导致分辨率与成像速度之间的权衡严重，限制了其临床应用。因此需要一种能突破该瓶颈的技术。 Method: 提出FLIM_PSR_k，一种基于条件生成对抗网络（cGAN）的多通道像素超分辨率（PSR）框架，利用大像素尺寸输入数据重建高分辨率FLIM图像，并通过盲测验证其性能。 Result: 在患者来源的肿瘤组织样本上实现了k=5的超分辨率因子，输出图像的空间带宽积提升25倍，恢复出低分辨率输入中丢失的精细结构，且各项图像质量指标均有显著提升。相比扩散模型，推理时间更短，重建更稳健。 Conclusion: FLIM_PSR_k通过提升FLIM的有效空间分辨率，推动了荧光寿命成像向更快、更高分辨率及硬件灵活的方向发展，尤其适用于低数值孔径和微型化平台，增强了FLIM在转化医学中的潜力。 Abstract: Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Rui Gui,Yang Wan,Haochen Han,Dongxing Mao,Fangming Liu,Min Li,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出了TextEditBench，一个专注于图像中文本区域编辑的综合评估基准，引入语义期望（SE）维度来衡量模型在文本编辑中的推理能力，并揭示现有模型在上下文推理、物理一致性和布局感知方面的不足。

Details

Motivation: 文本编辑在图像生成中仍是一个未被充分探索的领域，需要在保持语义、几何和上下文一致的同时生成可读文本，现有方法缺乏对复杂推理场景的评估能力。 Method: 提出TextEditBench评估基准，聚焦于图像中文本为中心的区域，设计需理解物理合理性、语言意义和跨模态依赖的推理密集型编辑任务，并引入新的评估维度——语义期望（SE）。 Result: 在最先进的编辑系统上进行的大量实验表明，当前模型虽能遵循简单的文本指令，但在上下文依赖推理、物理一致性和布局感知集成方面仍表现不佳。 Conclusion: TextEditBench为推进多模态生成中的文本引导图像编辑与推理能力提供了一个新的测试平台，突出了未来研究需关注语义一致性和复杂场景下的编辑能力。 Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

[65] GFLAN: Generative Functional Layouts

Mohamed Abouagour,Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: 本文提出GFLAN，一种通过将平面图生成分解为拓扑规划和几何实现两个阶段的生成框架，以解决现有方法在捕捉建筑推理方面的不足。

Details

Motivation: 现有深度学习方法难以捕捉建筑设计中的拓扑优先性、功能约束传播和流线模式生成等核心推理过程。 Method: 采用两阶段方法：第一阶段使用具有双编码器的卷积网络生成房间中心点；第二阶段构建连接房间与边界顶点的异构图，并利用Transformer增强的图神经网络回归房间边界。 Result: 该方法在给定建筑外轮廓和入口位置的情况下，能有效生成符合功能与拓扑关系的平面布局。 Conclusion: GFLAN通过显式解耦拓扑与几何步骤，提升了生成平面图的合理性与可控性，推动了自动化设计向更接近真实建筑设计逻辑的方向发展。 Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

Amna Amir,Erchan Aptoula

Main category: cs.CV

TL;DR: 本文提出了一种多标签自适应对比学习方法（MACL），用于解决遥感图像检索中的语义重叠、标签分布不平衡和复杂类别共现问题。

Details

Motivation: 由于土地覆盖类别之间存在语义重叠、标签分布高度不平衡以及复杂的多标签共现模式，现有方法在多标签遥感图像检索中表现受限。 Method: MACL结合了标签感知采样、频率敏感加权和动态温度缩放，以增强常见类和稀有类之间的表征平衡性。 Result: 在DLRSD、ML-AID和WHDLD三个基准数据集上的实验表明，MACL持续优于基于对比损失的基线方法。 Conclusion: MACL能有效缓解语义不平衡问题，提升大规模遥感图像存档中的检索可靠性。 Abstract: Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang,Sizhe Cheng,Chenqi Yi

Main category: cs.CV

TL;DR: 提出PixelArena，利用语义分割任务评估多模态大模型的细粒度生成能力，发现Gemini 3 Pro在零样本下能高保真生成语义掩码，展现出前所未有的视觉智能。

Details

Motivation: 现有图像生成基准多关注美学，缺乏对细粒度生成能力的客观评估，因此需要新方法来精确衡量多模态模型的生成智能。 Method: 引入基于语义分割任务的评估框架PixelArena，通过像素级精度分析模型在零样本设置下的生成表现，并进行定性与定量比较。 Result: 发现Gemini 3 Pro能生成高保真语义掩码，表现出强大的零样本泛化能力，优于其他模型，同时揭示了其成功案例与失败模式。 Conclusion: 该研究标志着多模态生成模型在细粒度视觉智能上的重大进展，为未来多模态、推理、可解释性及基准测试研究提供了新方向。 Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

Haiyu Zhao,Yiwen Shan,Yuanbiao Gou,Xi Peng

Main category: cs.CV

TL;DR: 提出了一种轻量级的全功能视频恢复网络LaverNet，仅含362K参数，通过选择性传递无退化特征来减轻退化对时序建模的影响，在性能上媲美甚至超越现有大型模型。

Details

Motivation: 现有全功能视频恢复方法在处理时变退化时，易受退化干扰且依赖大模型，难以有效建模时序信息并掩盖了实际困难。 Method: 设计了一个轻量级网络LaverNet，引入一种新的传播机制，仅在帧间传递退化无关的特征，以增强时序建模能力。 Result: LaverNet参数不足现有模型的1%，在多个基准上实现了相当甚至更优的恢复性能。 Conclusion: 实验证明，轻量级网络也能实现强大的全功能视频恢复，关键在于有效的特征传播机制而非模型规模。 Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

Huayu Huang,Chen Chen,Banglei Guan,Ze Tan,Yang Shang,Zhang Li,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于岭估计的融合定位方法，结合序列图像的丰富场景信息和激光测距的高精度优势，提升了在观测条件受限情况下的目标定位精度和鲁棒性。

Details

Motivation: 在长距离、小交角和大倾角等受限条件下，传统最小二乘估计因设计矩阵列向量严重多重共线性而导致病态问题，定位结果不稳定且鲁棒性差。 Method: 采用岭估计方法抑制多重共线性，融合序列图像与激光测距数据进行联合定位。 Result: 实验结果表明，该方法相比基于单一信息的地基定位算法具有更高的定位精度，尤其在受限观测条件下显著提升了鲁棒性。 Conclusion: 基于岭估计的融合定位方法能有效解决观测条件受限下的病态问题，提高定位精度与稳定性，适用于多传感器无人机目标跟踪场景。 Abstract: Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

Nan Zhou,Zuxin Li,Fanhang Man,Xuecheng Chen,Susu Xu,Fan Dang,Chaopeng Hong,Yunhao Liu,Xiao-Ping Zhang,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为QUIDS的多智能体调度系统，用于在非专用车载移动群智感知（NVMCS）中提升信息质量（QoI），通过引入聚合感知质量（ASQ）指标和互信信念感知调度算法，在覆盖性和可靠性之间实现联合优化，并显著提升感知性能。

Details

Motivation: 在非专用 vehicular 移动群智感知系统中，由于车辆参与的动态性，难以同时保证良好的感知覆盖性和可靠性，从而影响整体信息质量（QoI）。现有方法未能有效联合优化这两个关键因素。 Method: 提出 QUIDS 系统，引入新的信息质量度量指标 ASQ（Aggregated Sensing Quality），综合衡量覆盖性和可靠性；设计基于信念感知的多车协同调度算法（Mutually Assisted Belief-aware Vehicle Dispatching），在不确定性下估计可靠性并分配激励，以在预算约束下最大化 ASQ。 Result: 基于真实城市数据的实验表明，QUIDS 相比无调度场景将 ASQ 提升了 38%，相比现有最优方法提升 10%；同时在不同重建算法下将地图重构误差降低 39%-74%。 Conclusion: QUIDS 通过质量感知的激励机制联合优化覆盖与可靠性，可在无需专用基础设施的情况下实现低成本、高质量的城市感知，适用于交通与环境监测等智慧城市应用。 Abstract: This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

Soochang Song,Yongjune Kim

Main category: cs.CV

TL;DR: 提出了一种边缘到服务器的协作推理框架，通过选择性重传区域图像减少通信成本，同时保持视觉语言模型的推理精度。

Details

Motivation: 在典型部署中，将边缘设备捕获的图像缩放后传输至服务器进行推理会丢失细节，导致准确率下降。为解决此问题，需保留关键视觉信息并降低通信开销。 Method: 设计了一个两阶段推理框架：第一阶段服务器基于全局图像推理并利用VLM内部注意力确定感兴趣区域（RoI），并通过输出token的最小熵判断是否需要重传；若超过阈值，则请求边缘设备发送RoI的高保真局部图像，用于联合优化推理结果。 Result: 在多个VLM架构上的实验表明，该框架显著降低了通信成本，同时保持了较高的推理准确率。 Conclusion: 所提出的协作推理框架有效平衡了通信效率与模型性能，适用于资源受限的边缘-服务器场景。 Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

Tao Hu,Weiyu Zhou,Yanjie Tu,Peng Wu,Wei Dong,Qingsen Yan,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为GMODiff的增益图驱动单步扩散框架，用于多曝光高动态范围（HDR）重建，通过估计保留扩展动态范围的增益图来克服传统潜在扩散模型在HDR应用中的局限性，并结合回归先验提升重建质量和效率。

Details

Motivation: 由于8位潜在压缩导致的动态范围表示受限、多步去噪带来的高推理成本以及生成模型固有的内容幻觉问题，直接将预训练的潜在扩散模型应用于HDR重建具有挑战性。 Method: 将HDR重建重新定义为条件引导的增益图（GM）估计任务，使用回归模型提供信息丰富的初始估计以实现单步去噪，并利用回归先验指导LDM的去噪过程和潜在解码，从而抑制幻觉并保持结构准确性。 Result: 实验证明，GMODiff在多个现有最先进方法中表现优异，且推理速度比之前的LDM-based方法快100倍。 Conclusion: GMODiff有效解决了LDM在HDR重建中的关键瓶颈，在保持高感知质量的同时显著提升了效率和内容保真度，为基于扩散模型的HDR技术提供了新思路。 Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

Haotian Ling,Zequn Chen,Qiuying Chen,Donglin Di,Yongjia Ma,Hao Li,Chen Wei,Zhulin Tao,Xun Yang

Main category: cs.CV

TL;DR: 本文提出EverybodyDance，一种基于身份匹配图（IMG）和掩码查询注意力（MQA）的系统性方法，解决多角色动画中的身份对应（IC）正确性问题，并通过多项针对性策略和新评测基准验证其在IC和视觉质量上的优越性能。

Details

Motivation: 现有单角色姿态驱动动画技术难以扩展到多角色场景，尤其在角色位置交换时无法保证生成帧与参考帧之间的身份对应正确性（IC）。 Method: 提出身份匹配图（IMG），将生成帧和参考帧的角色建模为加权完全二分图的两个节点集，利用掩码查询注意力（MQA）计算角色间亲和度；将IC正确性形式化为图结构度量并在训练中优化，同时引入身份嵌入引导、多尺度匹配策略和预分类采样等方法。 Result: 在自建的身份对应评测基准上，实验表明EverybodyDance在身份对应正确性和视觉保真度方面显著优于现有最先进基线方法。 Conclusion: 通过图建模与注意力机制结合，EverybodyDance有效解决了多角色动画中的身份混淆问题，为复杂场景下的可控角色动画提供了可行方案。 Abstract: Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan,Bastien Van Delft,Wuyang Li,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了Factorized Video Generation (FVG) 方法，将文本到视频生成分解为推理、构图和时序合成三个阶段，通过引入语义正确的初始帧提升生成视频的逻辑一致性和效率。

Details

Motivation: 现有文本到视频模型在复杂场景构建和时序逻辑遵循方面表现不佳，常因初始帧语义错误导致运动失败。 Method: 将生成过程分为三步：1）使用大语言模型重写提示以明确初始场景；2）用文生图模型生成高质量锚定帧；3）基于锚定帧微调视频模型进行动画生成。 Result: FVG 在 T2V CompBench 上达到最先进水平，在 VBench2 上显著提升各模型表现，并可减少70%采样步数而不损失性能。 Conclusion: 分解式视频生成提供了一种更高效、鲁棒且可控的视频合成路径。 Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

Zhanwei Li,Liang Li,Jiawan Zhang

Main category: cs.CV

TL;DR: 提出了一种名为AFDAN的自适应频域对齐网络，用于解决医学图像分割中标注数据稀缺的问题，通过频域特征对齐实现跨域知识迁移，在VITILIGO2025和DRIVE数据集上取得了优于现有方法的分割性能。

Details

Motivation: 医学图像分割中高质量标注数据稀缺，手动标注耗时且费力，限制了模型性能，因此需要有效的域适应方法来缓解这一问题。 Method: 提出了自适应频域对齐网络AFDAN，包含三个核心模块：对抗域学习模块实现源域到目标域的特征迁移，源-目标频域融合模块融合跨域频域表示，空间-频域集成模块结合空间与频域特征以提升分割精度。 Result: 在VITILIGO2025数据集上达到90.9%的IoU，在DRIVE数据集上达到82.6%的IoU，均优于现有的最先进方法。 Conclusion: AFDAN通过频域对齐有效实现了跨域医学图像分割，显著提升了在少标注情况下的分割性能，具有较强的实用性和泛化能力。 Abstract: High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)--a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

Haodi He,Jihun Yu,Ronald Fedkiw

Main category: cs.CV

TL;DR: 本文提出一种基于高斯点阵（Gaussian Splatting）的统一人脸三维重建方法，利用少量未标定图像和语义分割对齐，在无需完整视频的情况下重建中性姿态人脸，并实现与标准图形管线兼容的高质量纹理生成和重光照。

Details

Motivation: 现有的NeRF等隐式表示在约束和控制方面较弱，且难以融入标准图形管线；同时，传统方法通常需要大量连续视频帧或严格标定数据，限制了在实际场景中的应用。因此，需要一种更显式、更易约束且兼容标准渲染流程的三维人脸重建方法。 Method: 采用高斯点阵作为显式三维表示，利用语义分割标注对齐面部语义区域，通过软约束将高斯点关联到三角化曲面以增强结构化重建，并迭代优化曲面几何；进一步将高斯点变换至纹理空间形成视图相关神经纹理，结合可重光照模型解耦光照与反射率，生成可用于标准图形管线的高分辨率反照率贴图。 Result: 仅用11张未标定、光照不一致的人脸图像即可重建出结构良好的三角化表面和高质量纹理；实现了与现有图形资产和渲染器无缝集成的高视觉保真度重建；验证了该方法在文本驱动资产生成流程中的有效性。 Conclusion: 本文方法通过结构化高斯点阵与三角化表面的联合优化，实现了从少量图像中高效重建可编辑、高保真的三维人脸模型，并成功将其融入标准图形管线，为内容创作提供了灵活且实用的解决方案。 Abstract: We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng,Hao Guo,Yunpeng Bai,Yongkang Dai,Huaxi Huang,Yilei Shi

Main category: cs.CV

TL;DR: BrepLLM是首个使大语言模型能够解析和推理原始Brep数据的框架，通过两阶段训练弥合了3D几何与自然语言之间的模态鸿沟。

Details

Motivation: 现有的基于token序列的大语言模型难以直接处理包含复杂几何和拓扑信息的3D Brep模型，缺乏有效的方法来连接3D几何与自然语言。 Method: 提出BrepLLM框架，采用两阶段训练：跨模态对齐预训练和多阶段LLM微调。首先通过自适应UV采样将Brep转换为图表示，并设计分层BrepEncoder提取几何与拓扑特征；然后利用对比学习对齐全局特征与CLIP文本嵌入，并在第二阶段通过三步策略将节点序列集成到LLM中，包括MLP语义映射、LLM微调和MQE模块提升几何多样性建模。 Result: 构建了包含269,444个Brep-文本问答对的数据集Brep2Text，实验表明BrepLLM在3D物体分类和描述生成任务上达到SOTA性能。 Conclusion: BrepLLM成功实现了对原始Brep数据的理解与自然语言生成，为3D几何与大语言模型的融合提供了有效解决方案。 Abstract: Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

[78] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 提出了一种无需训练的零样本对象计数框架CountZES，通过三个协同阶段进行零样本示例选择，实现了在自然、航空和医学图像中的有效泛化。

Details

Motivation: 现有零样本对象计数方法依赖开放词汇检测器或多实例候选，或随机采样无法准确划分对象实例。 Method: 提出CountZES框架，包含检测锚定示例（DAE）、密度引导示例（DGE）和特征一致性示例（FCE）三个阶段，逐步发现多样化示例。 Result: 在多个数据集上实验表明，CountZES在零样本对象计数任务中性能优于现有方法。 Conclusion: CountZES通过协同的三阶段策略，有效提升了零样本设置下的对象计数准确性和泛化能力。 Abstract: Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

[79] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li,Youngjung Uh

Main category: cs.CV

TL;DR: 提出一种无需训练的简单有效方法，通过几何视角优化文本嵌入以抑制不必要语义，提升文本到图像生成中的主体一致性和文本对齐性。

Details

Motivation: 现有文本到图像生成方法在保持主体一致性方面存在不足，且常需昂贵的微调或图像条件处理；1Prompt1Story虽无需训练但存在语义泄漏问题。 Method: 从几何角度出发，通过重新调整和精炼文本嵌入来抑制跨帧的语义纠缠，避免语义泄漏，实现训练自由的一致性控制。 Result: 大量实验表明，该方法在主体一致性和文本对齐性方面显著优于现有基线方法。 Conclusion: 所提方法简单高效，无需训练即可有效缓解语义纠缠，提升了文本到图像生成在视觉叙事中的表现。 Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[80] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano,Saptarshi Sinha,Jacob Chalk,Wei-Hong Li,Hideo Saito,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了一种基于文本条件扩散模型的人体运动生成方法，专注于目标拾取或放置前的凝视预激活行为。作者首次整理了23.7K个包含凝视预激活的人类运动序列，并在多个公开数据集上进行训练和评估。模型在HD-EPIC数据集上实现了60%的预激活成功率为和89%的到达成功率。

Details

Motivation: 人类在拾取或放置物体前会通过远距离凝视来预激活目标位置，这种自然行为在现有运动生成中未被充分建模，因此需要构建专门的数据集并设计能够模仿该行为的生成模型。 Method: 整合五个公开数据集（HD-EPIC、MoGaze、HOT3D、ADT、GIMO）中的运动序列，构建首个包含23.7K个凝视预激活动作的人类运动数据集；采用文本条件扩散模型进行预训练，然后以目标姿态或位置为条件进行微调，生成更自然的人类接近与抓取动作。 Result: 在HD-EPIC数据集上，模型以目标位置为条件时达到60%的“预激活成功率”和89%的“到达成功率”，验证了生成动作在模拟自然人类行为方面的有效性。 Conclusion: 所提出的方法能有效生成具有自然凝视预激活行为的人体运动，新构建的数据集和引入的‘Prime Success’评价指标为未来研究提供了重要基础。 Abstract: Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

[81] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: SNOW是一个无需训练、骨干网络无关的框架，通过融合视觉语言模型的语义与点云几何和时序一致性，实现统一的4D场景理解。

Details

Motivation: 现有视觉语言模型缺乏对3D几何和时序动态的感知，而几何感知方法语义信息不足，因此需要一种能统一语义、几何以及时序信息的方法来支持自主机器人系统的空间推理。 Method: SNOW利用HDBSCAN聚类生成对象级提议，指导SAM2进行分割；提出Spatio-Temporal Tokenized Patch Encoding（STEP）编码每个区域的语义、几何和时序特征，并构建4D场景图（4DSG）；通过轻量级SLAM后端实现空间锚定与全局对齐。 Result: 在多个基准测试中达到最先进性能，实现了精确的4D场景理解和空间定位推理。 Conclusion: 结构化的4D先验对于具身推理和自主机器人至关重要，SNOW为实现开放世界中可查询的统一世界模型提供了有效方案。 Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

[82] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Senmao Li,Kai Wang,Salman Khan,Fahad Shahbaz Khan,Jian Yang,Yaxing Wang

Main category: cs.CV

TL;DR: 提出StageVAR，一种面向视觉自回归模型的阶段感知加速框架，在保持生成质量的同时实现最高3.4倍的推理加速。

Details

Motivation: 传统VAR模型在大规模生成时计算复杂度高，现有加速方法依赖人工调参且忽视生成过程中不同阶段的重要性差异。 Method: 通过分析发现早期步骤对语义和结构一致性至关重要，而后期步骤主要用于细节优化，据此设计无需训练的即插即用加速策略，利用后期计算中的语义无关性和低秩特性进行剪枝或近似。 Result: 实现了最高3.4倍的加速，GenEval仅下降0.01，DPG下降0.26，性能优于现有加速方法。 Conclusion: 阶段感知设计是实现高效视觉自回归图像生成的有效原则。 Abstract: Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

Yuan Li,Yahan Yu,Youyuan Lin,Yong-Hao Yang,Chenhui Chu,Shin'ya Nishida

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的盲图像质量评估（BIQA）方法，通过人类标注数据作为奖励信号，使模型同时具备类人感知与自洽推理能力。

Details

Motivation: 现有BIQA模型缺乏对人类感知-推理过程的建模，难以实现可解释且与人类判断一致的评估。 Method: 收集反映人类感知-推理链的标注数据，采用强化学习框架，以人类评分为奖励信号，并设计基于自生成描述推断质量的奖励机制，促使模型内化自洽推理能力。 Result: 在评分预测性能上达到与当前最优BIQA系统相当的水平（Pearson和Spearman相关系数），并在超过1000个样本上ROUGE-1得分为0.512，显著高于基线的0.443，表明其推理链更贴近人类解释。 Conclusion: 该方法不仅实现了准确的图像质量预测，还在生成可解释、类人的推理过程方面取得进展，推动了可解释性BIQA的发展。 Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

[84] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu,Yuanyuan Liu,Lin Wei,Chang Tang,Yibing Zhan,Zijing Chen,Zhe Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于眼行为辅助的多模态情感识别（EMER）方法，构建了包含真实情绪标签的数据集，并设计了EMERT模型以弥补面部表情与真实情感之间的差距，实验表明眼行为对鲁棒情感识别具有重要作用。

Details

Motivation: 由于面部表情常作为社交工具而非真实情绪的反映，现有基于面部表情的情感识别（FER）与真实情感识别（ER）之间存在差距，因此需要引入更可靠的情绪线索如眼行为来提升ER的准确性。 Method: 采用自发情绪诱导范式收集非侵入式眼行为数据（如眼动序列和注视图）及面部视频，构建EMER数据集；提出EMERT模型，结合模态对抗特征解耦和多任务Transformer来融合眼行为与面部表情进行情感识别。 Result: 在七个多模态基准协议下评估，EMERT显著优于现有的最先进方法，验证了眼行为作为面部表情补充在情感识别中的有效性。 Conclusion: 眼行为是提升情感识别鲁棒性的重要补充线索，EMER数据集和EMERT模型有助于推动解决FER与ER之间的差距问题。 Abstract: Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

[85] YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images

Huma Hafeez,Matthew Garratt,Jo Plested,Sankaran Iyer,Arcot Sowmya

Main category: cs.CV

TL;DR: 本文提出YOLO11-4K，一种针对4K全景图像的高效实时目标检测框架，通过引入多尺度检测头和GhostConv主干网络，在降低计算延迟的同时提升小物体检测精度，并发布了一个新的标注数据集CVIP360用于评估。

Details

Motivation: 传统检测器如YOLO在处理高分辨率、大视场的360度图像时面临空间畸变和计算开销大的问题，难以有效检测小目标，因此需要专门优化的框架来应对4K全景图像的挑战。 Method: 提出YOLO11-4K框架，采用带P2层的多尺度检测头增强对小目标的敏感性，使用GhostConv减少计算复杂度；构建并公开CVIP360数据集，包含6,876个帧级标注框，用于4K全景检测评估。 Result: YOLO11-4K在0.50 IoU下达到0.95 mAP，单帧推理时间为28.3毫秒，相比YOLO11延迟降低75%（原为112.3毫秒），且mAP从0.908提升至0.95。 Conclusion: YOLO11-4K在保持高精度的同时显著提升推理效率，适用于自动驾驶、监控和增强现实等高分辨率全景视觉应用，具有广泛的适用性。 Abstract: The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.

[86] PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu,Jiajie Liu,Jinyan Zhang,Wenhao Li,Junsong Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseMoE的混合专家网络，用于单目3D人体姿态估计，通过解耦2D姿态和深度特征的编码过程来提升估计精度。

Details

Motivation: 现有的基于提升的方法在编码过程中将2D姿态与不确定的深度信息纠缠在一起，影响了整体估计精度，本文旨在解决这一问题。 Method: 提出PoseMoE，采用混合专家网络分别处理2D姿态和深度特征，并引入跨专家知识聚合模块以增强时空上下文信息。 Result: 在Human3.6M、MPI-INF-3DHP和3DPW三个数据集上实验表明，PoseMoE优于传统的提升方法。 Conclusion: 解耦2D姿态与深度特征的编码并引入可靠深度初始化可显著提升单目3D人体姿态估计性能。 Abstract: The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

[87] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou,Zhexiao Huang,Yuan Guo,Zhangxuan Gu,Tianyu Xia,Zichen Luo,Fei Tang,Dehan Kong,Yanyi Shang,Suling Ou,Zhenlin Guo,Changhua Meng,Shuheng Shen

Main category: cs.CV

TL;DR: 本文提出了VenusBench-GD，一个大规模、跨平台、双语的GUI元素定位基准，通过分层任务分类和高质量数据构建，实现对多模态模型在基础与高级定位任务上的全面评估。

Details

Motivation: 现有GUI定位基准存在数据量不足、领域覆盖窄或过于依赖特定平台的问题，缺乏适用于真实场景的综合性评估框架。 Method: 构建了一个涵盖多个平台、多种应用和丰富UI元素的大规模双语基准；设计了高质量的数据标注流程；提出分层任务分类体系，将定位任务分为基础和高级两类，包含六个子任务。 Result: 实验表明，通用多模态模型在基础任务上已可匹敌甚至超越专用GUI模型，但在高级任务上仍落后；专用模型存在过拟合和鲁棒性差的问题。 Conclusion: 需要多层次、综合性的评估框架来推动GUI代理的发展，VenusBench-GD为评估和改进模型提供了更全面的测试平台。 Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

[88] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi

Main category: cs.CV

TL;DR: 本文提出了一种用于骨架动作定位的自监督预训练方法，通过片段判别预训练任务和U形模块增强时序特征分辨能力，显著提升了现有对比学习方法在BABEL和PKUMMD数据集上的表现。

Details

Motivation: 现有的自监督预训练方法在骨架动作识别上取得了成功，但在动作边界检测这一更精细的任务上仍具挑战，需要对时间变化敏感的特征。 Method: 设计了一个片段判别预训练任务，将骨骼序列划分为非重叠段，并通过对比学习区分不同视频中的片段；同时采用U形结构融合中间特征以提升帧级定位的分辨率。 Result: 在BABEL数据集多个子集和评估协议下，该方法一致地改进了现有对比学习方法的表现，并在NTU RGB+D和BABEL上预训练后，在PKUMMD上实现了最先进的迁移性能。 Conclusion: 所提出的方法有效提升了骨架数据在时序动作定位任务中的表示能力，推动了自监督学习在细粒度动作分析中的应用。 Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[89] Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

Hossein Javidnia

Main category: cs.CV

TL;DR: 本文提出了MAGINet，一种多尺度注意力引导的内在分解网络，可从单张人脸图像中准确估计光照归一化的漫反射反照率，并结合RefinementNet和Pix2PixHD生成器输出完整的六通道物理渲染分解结果，在真实人脸图像的重光照与材质编辑中实现了最先进的性能。

Details

Motivation: 在非约束光照条件下，精确的人脸图像内在分解对于实现高质量的重光照、数字人和增强现实效果至关重要。现有方法在细节清晰度和光照不变性方面存在不足，因此需要更强大的模型来提升分解质量。 Method: 提出MAGINet，采用分层残差编码、空间-通道注意力机制和自适应多尺度特征融合；先预测512×512的反照率图，再通过轻量级RefinementNet上采样至1024×1024并精细化；随后基于精细化反照率，使用Pix2PixHD架构生成其余五个渲染通道：环境光遮蔽、法线、镜面反射、半透明性和原始漫反射颜色。整体训练结合了masked-MSE、VGG、边缘和patch-LPIPS损失。 Result: 在FFHQ-UV-Intrinsics数据集上训练后，该方法在漫反射反照率估计方面达到SOTA水平，完整渲染通道的保真度显著优于先前方法，生成的分解结果具有更清晰的边界和更强的光照不变性。 Conclusion: 所提方法能够高效、准确地完成单张人像照片的完整内在分解，支持高质量的人脸重光照与材质编辑，适用于数字人、AR等实际应用。 Abstract: Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

[90] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Zhiwei Li,Yitian Pang,Weining Wang,Zhenan Sun,Qi Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Test-Time Padding (TTP) 的轻量级防御框架，用于提升视觉-语言模型（如CLIP）在推理阶段对对抗性扰动的鲁棒性，无需重新训练，通过检测和自适应修复注意力模式，在保持干净样本准确率的同时显著增强对抗鲁棒性。

Details

Motivation: 现有的对抗防御方法依赖训练时微调或无法有效区分干净与对抗样本，导致鲁棒性和准确率难以兼顾，尤其在安全关键场景中存在风险。 Method: TTP首先利用空间填充前后CLIP特征嵌入的余弦相似度变化来检测对抗样本，设定通用阈值；对检测到的对抗样本，采用可训练填充恢复注意力模式，并结合相似性感知集成策略提升预测鲁棒性；对干净样本则保持原样或结合现有测试时适应技术进一步提升精度。 Result: 在多种CLIP骨干网络和细粒度数据集上的实验表明，TTP在对抗鲁棒性方面显著优于当前最先进的测试时防御方法，同时不牺牲干净样本的准确性。 Conclusion: TTP是一种高效、通用的测试时防御框架，能够在无需重新训练的情况下，可靠地检测并适应对抗输入，为VLMs的实际部署提供了更强的安全保障。 Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

[91] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang,Lei Ke,Boqiang Zhang,Tianyuan Qu,Hanxun Yu,Zhenpeng Huang,Meng Yu,Dan Xu,Dong Yu

Main category: cs.CV

TL;DR: 本文提出N3D-VLM，一个统一的多模态框架，通过集成原生3D对象感知与3D感知视觉推理，实现精确的3D定位和可解释的空间理解。

Details

Motivation: 现有模型缺乏对3D场景中空间关系和深度线索的内在理解能力，限制了其在复杂三维环境中的应用。 Method: 提出N3D-VLM框架，利用深度估计将大规模2D标注提升至3D空间，并构建支持3D对象定位与空间推理的训练数据集；模型基于文本描述直接在3D空间中定位对象，并进行显式3D空间推理。 Result: 实验表明，该框架在3D定位任务上达到SOTA性能，并在3D空间推理方面显著优于现有方法。 Conclusion: N3D-VLM通过引入原生3D感知和结构化推理，提升了多模态模型在3D场景下的理解能力和可解释性。 Abstract: While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

[92] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur,Marwan Taher,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出一种动态重建系统，通过单目RGB视频输入实现场景的完整且持久的4D重建，能够回放跨时间步的完整动态3D结构。

Details

Motivation: 现有方法难以实现对动态场景中不可见部分的持续重建和物体永久性建模，缺乏时空一致性与运动连续性。 Method: 将场景分解为一组刚性3D基元，利用密集2D对应点联合优化其刚体运动，并引入基于运动分组的机制来外推遮挡物体的运动，实现4D（3D+时间）重建。 Result: 系统在物体扫描和多物体数据集上显著优于现有方法，支持可回放的3D重建、多物体扫描和物体永久性等能力。 Conclusion: 该方法实现了高质量的动态场景4D重建，具备良好的时空一致性与对遮挡物体的运动预测能力，推动了单目视频理解与持久场景建模的发展。 Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

[93] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

Yin Zhang,Yongqiang Zhang,Yaoyue Zheng,Bogdan Raducanu,Dan Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Causal-Tune的新型微调方法，通过在频域中分离并抑制视觉基础模型中的非因果因素（如低频和高频伪影），增强领域泛化语义分割性能，在恶劣天气条件下显著优于基线。

Details

Motivation: 现有方法在领域泛化语义分割中忽略预训练视觉模型中存在的由长期训练导致的伪影问题，这些伪影与非因果因素相关，影响特征表示的有效性，限制了模型在未见域上的泛化能力。 Method: 利用离散余弦变换将特征转换到频域，使用高斯带通滤波器分离因果与非因果成分，并引入可学习的因果感知token在频域优化因果部分，丢弃非因果部分后通过逆DCT还原至空间域进行后续处理。 Result: 在多个跨域任务上实验表明，Causal-Tune显著提升性能，尤其在雪天条件下比基线提高4.8% mIoU。 Conclusion: 通过显式建模频域中的因果与非因果因素，Causal-Tune有效提升了视觉基础模型在领域泛化语义分割中的鲁棒性和性能。 Abstract: Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

[94] CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series

Nico Albert Disch,Saikat Roy,Constantin Ulrich,Yannick Kirchhoff,Maximilian Rokuss,Robin Peretzke,David Zimmerer,Klaus Maier-Hein

Main category: cs.CV

TL;DR: CRONOS是一个用于3D医学扫描数据时间预测的统一框架，支持从多个过去扫描中进行多对一预测，并首次实现连续时间下的3D体素级序列到图像预测。

Details

Motivation: 现有模型在不规则采样下难以进行体素级时间预测，且多依赖单次先前扫描或固定时间网格，限制了临床应用。 Method: CRONOS通过学习一个时空速度场，将上下文体积迁移到任意目标时间的输出体积，直接在3D体素空间操作，支持离散和连续时间戳。 Result: 在涵盖Cine-MRI、灌注CT和纵向MRI的三个公开数据集上，CRONOS优于其他基线方法，同时保持良好的计算效率。 Conclusion: CRONOS是首个支持多上下文、连续时间3D医学扫描预测的框架，推动了疾病进展建模与可重复多数据集基准测试的发展。 Abstract: Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.

[95] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong,Jiaqi Gu,Yujing Lou,Lubin Fan,Yixiong Zou,Yue Wu,Jieping Ye,Ruixuan Li

Main category: cs.CV

TL;DR: 提出了一种名为Sketch-in-Latents (SkiLa) 的新范式，通过在统一的特征空间中生成潜在的草图令牌（latent sketch tokens）作为视觉思维，实现多模态大模型的文本与视觉联合推理，提升了视觉中心任务的表现和泛化能力。

Details

Motivation: 当前多模态大语言模型在需要视觉想象的任务上表现不足，而人类能在无预定义工具的情况下在统一空间内进行灵活的视觉-文本想象。受此启发，作者希望利用MLLMs共享的特征空间，将视觉令牌无缝融入文本推理过程，实现统一的多模态思维。 Method: 提出SkiLa范式，扩展MLLMs的自回归能力以原生生成连续的视觉嵌入（即潜在草图令牌），在多步推理中动态切换文本思考模式和视觉绘图模式，并引入潜在视觉语义重建机制，确保生成的草图令牌具有语义一致性。 Result: 实验表明，SkiLa在视觉中心任务上表现优越，并在多种通用多模态基准上展现出强泛化能力。 Conclusion: SkiLa实现了文本与视觉在统一特征空间中的协同推理，证明了潜在空间中生成视觉思维的有效性，为多模态模型的视觉想象能力提供了新思路。 Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

[96] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu,Tong Yu,Shenling Wang,Xudong Zhao

Main category: cs.CV

TL;DR: 提出了一种基于Swin-Transformer的文本条件扩散模型Yuan-TecSwin，通过替换CNN模块增强非局部建模能力，在ImageNet上实现了1.37的SOTA FID分数，并显著提升了生成图像的真实感。

Details

Motivation: 卷积神经网络（CNN）在扩散模型中的局部性限制了其对长距离语义信息的理解能力，影响生成图像的质量和语义一致性。 Method: 采用Swin-Transformer替代U型结构中的CNN模块，提升编码器和解码器的非局部特征提取与图像恢复能力；优化文本编码器、文本嵌入使用方式及文本条件融合策略以增强文本-图像对齐；引入自适应时间步搜索机制改进推理性能。 Result: 在ImageNet生成任务上取得1.37的FID分数，为当前最优结果，且无需在去噪不同阶段引入额外模型；人类受试者难以区分生成图像与真实图像；推理性能提升10%。 Conclusion: Yuan-TecSwin通过引入Swin-Transformer和优化文本条件融合，在不增加模型复杂度的情况下显著提升了文本到图像生成的质量和真实感，具备优秀的非局部建模和语义对齐能力。 Abstract: Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

[97] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment

Ayush Bhavsar

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、面向应用的实时视频去雾增强管道Hazedefy，基于暗通道先验和大气散射模型，具有计算简单、可在消费级硬件上部署的优点。

Details

Motivation: 为了实现在消费级硬件上对实时视频和直播画面进行高效去雾处理，提升可见性和对比度，同时避免依赖GPU加速。 Method: 采用基于伽马自适应重建、快速透射率近似（带下界以保证数值稳定性）、基于分数顶部像素平均的稳定大气光估计以及可选的颜色平衡阶段的去雾流程。 Result: 在真实世界图像和视频上的实验表明，该方法能有效提升可见性和对比度，且无需GPU即可运行。 Conclusion: Hazedefy是一种适用于移动和嵌入式设备的实时去雾解决方案，兼顾性能与效果，具备良好的实际部署能力。 Abstract: This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.

[98] Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou,Zeqi Xiao,Tianyi Wei,Shuai Yang,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了Log-linear Sparse Attention (LLSA)，一种用于长序列扩散Transformer的高效稀疏注意力机制，通过分层结构将选择和注意力计算复杂度从二次降至对数线性，在保持生成质量的同时显著加速训练与推理。

Details

Motivation: 现有的Top-K稀疏注意力方法在处理长序列时仍存在二次选择成本且需增大K以维持性能，因其单层设计难以有效捕捉全局结构，限制了DiTs的扩展能力。 Method: 提出LLSA，采用分层Top-K稀疏选择机制，逐级细化关键块的选择，并引入分层KV增强机制以保留多粒度下的全局上下文；同时开发高性能GPU实现，仅使用稀疏索引进行前向和反向传播。 Result: 在256x256像素序列上，LLSA将注意力推理速度提升28.27倍，DiT训练加速6.09倍，同时保持生成质量；无需patchification或VAE编码即可实现高分辨率图像生成。 Conclusion: LLSA为高效训练长序列Diffusion Transformers提供了可行路径，显著降低计算开销，支持更长序列的视觉生成任务。 Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

[99] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation

Kanwal Aftab,Graham Adams,Mark Scanlon

Main category: cs.CV

TL;DR: 本文提出了一种基于电插座类型的三阶段深度学习管道，用于室内多媒体地理定位，在打击人口贩卖和儿童剥削等犯罪中具有重要法医应用价值。

Details

Motivation: 由于室内环境存在布局相似、光照变化大、GPS信号不可靠等问题，室内多媒体地理定位研究相对不足，而现有方法难以满足实际数字取证需求。 Method: 采用YOLOv11检测插座（mAP@0.5=0.843），Xception模型分类12种插座类型（准确率0.912），再将插座类型映射到国家（>90%置信度下准确率达0.96）。构建了两个专用数据集并进行数据增强。 Result: 在真实条件下的TraffickCam数据集上验证了该方法的有效性，相比使用旅游网站图像，更贴近现实场景；整个框架开源，具备实际应用潜力。 Conclusion: 利用标准化电插座作为稳定室内地标是可行且有效的，为室内数字取证地理定位提供了实用解决方案。 Abstract: Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.

[100] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen,Mingyue Cui,Xingyi Yang

Main category: cs.CV

TL;DR: 本文提出了DeContext，一种通过干扰多模态注意力路径来防御上下文扩散模型中未经授权图像编辑的方法，有效阻断了输入与输出之间的关联，同时保持图像质量。

Details

Motivation: 由于上下文扩散模型可轻易修改图像，引发隐私泄露和恶意伪造的担忧，亟需一种保护个人图像不被未经同意使用的防御机制。 Method: 通过分析发现上下文信息主要通过多模态注意力层传播，因此在关键的去噪步骤和Transformer模块中注入微小且有针对性的扰动，以削弱跨注意力路径，从而切断上下文传递。 Result: 在Flux Kontext和Step1X-Edit模型上的实验表明，DeContext能持续阻止非授权图像编辑，同时保持良好的视觉质量。 Conclusion: 基于注意力机制的扰动是一种高效且鲁棒的防御手段，可用于保护现代大规模上下文扩散模型中的图像隐私。 Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.

[101] SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu,Di Wang,Hebaixu Wang,Haoyang Chen,Wentao Jiang,Yilin Cheng,Haonan Guo,Wei Cui,Jing Zhang

Main category: cs.CV

TL;DR: 提出SARMAE，一种用于自监督SAR表征学习的噪声感知掩码自动编码器，构建了首个百万级SAR数据集SAR-1M，并设计了Speckle-Aware Representation Enhancement (SARE)和Semantic Anchor Representation Constraint (SARC)以提升表示学习的鲁棒性和语义一致性。

Details

Motivation: 现有SAR深度学习受限于数据稀缺，且SAR图像中固有的斑点噪声影响细粒度语义表示学习。 Method: 构建大规模SAR-1M数据集，提出SARMAE框架，包含SARE模块（注入SAR特异性斑点噪声）和SARC模块（利用配对光学图像先验进行语义对齐）。 Result: 在多个SAR数据集上，SARMAE在分类、检测和分割任务中均达到最先进性能。 Conclusion: SARMAE通过噪声感知和语义一致性的设计，有效提升了SAR图像的自监督表示学习效果。 Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.

[102] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis,Christos Sgouropoulos,Bill Psomas,Theodoros Giannakopoulos,Giorgos Sfikas,Ioannis Kakogeorgiou

Main category: cs.CV

TL;DR: 本文提出了REGLUE，一种统一的潜在扩散框架，通过联合建模VAE潜在变量、局部视觉基础模型（VFM）语义和全局[CLS]令牌，在单一SiT骨干网络中实现更高效的图像生成。该方法利用轻量级卷积语义压缩器非线性聚合多层VFM特征，并结合外部对齐损失，显著提升了ImageNet上的FID分数并加速了收敛。

Details

Motivation: 现有潜在扩散模型缺乏直接的高层语义监督，导致训练缓慢且样本质量受限；当前融合视觉基础模型的方法未能充分利用其丰富的多层次空间语义信息。 Method: 提出REGLUE框架，将VAE图像潜在变量、紧凑的局部（块级）VFM语义和全局（图像级）[CLS]令牌统一建模于单个SiT骨干网络中；使用轻量卷积语义压缩器非线性地聚合多层VFM特征为低维空间结构化表示，并在扩散过程中与VAE潜在变量纠缠；引入外部对齐损失以正则化内部表征向冻结的VFM目标靠近。 Result: 在ImageNet 256x256上，REGLUE在FID指标和收敛速度方面均优于SiT-B/2、SiT-XL/2基线及REPA、ReDi、REG等方法；实验证明空间VFM语义、非线性压缩、全局令牌和外部对齐均为关键因素。 Conclusion: REGLUE通过全局-局部-潜在联合建模，有效整合视觉基础模型的多层次语义信息，提升了潜在扩散模型的生成质量和训练效率，验证了空间语义、非线性压缩与多尺度对齐策略的重要性。 Abstract: Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

[103] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger,Jan-Niklas Dihlmann,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 本文提出了一种名为FrameDiffuser的自回归神经渲染框架，用于交互式应用中基于G-buffer生成时序一致且逼真的图像。

Details

Motivation: 现有扩散模型在单帧生成中缺乏时序一致性，而视频模型计算成本高且不适用于实时交互场景。 Method: 采用自回归方式，结合ControlNet和ControlLoRA进行结构与时间一致性控制，并通过三阶段训练策略实现稳定生成。 Result: 在特定环境中实现了高质量、光照准确的渲染效果，具有良好的时间一致性与较快的推理速度。 Conclusion: FrameDiffuser适合交互式应用，在环境特化训练下优于通用方法。 Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

[104] Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray

Gonçalo Gaspar Alves,Shekoufeh Gorgi Zadeh,Andreas Husch,Ben Bausch

Main category: cs.CV

TL;DR: 提出基于ResNet-50和三元组损失的主体指纹方法，用于在多数据集中识别重复主体，有效缓解数据泄露问题，在多种医学影像数据上取得高检索准确率。

Details

Motivation: 开源数据集合并时可能因同一主体出现在多个数据集中导致数据泄露，从而高估模型性能，需有效识别并处理重复主体。 Method: 采用ResNet-50网络，结合三元组边界损失进行训练，将同一主体的不同图像映射到潜在空间中的相近区域，实现少样本条件下的主体指纹识别与相似性匹配。 Result: 在ChestXray-14和BraTS-2021数据集上，分别达到99.10%（20-way 1-shot）和99.20%（20-way 1-shot）的Mean Recall@K成绩，表现出优异的少样本识别能力。 Conclusion: 主体指纹技术能有效识别跨数据集的重复个体，显著降低数据泄露风险，适用于多种医学影像模态，具备实际应用潜力。 Abstract: Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.

[105] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

Serafino Pandolfini,Lorenzo Pellegrini,Matteo Ferrara,Davide Maltoni

Main category: cs.CV

TL;DR: 本文系统评估了最先进的深度伪造检测模型在局部修复检测任务上的泛化能力，发现基于大规模生成器训练的模型能有效检测中大范围或再生式修复操作。

Details

Motivation: 随着生成式AI的发展，局部图像编辑技术被广泛用于网络安全威胁场景，但现有检测方法对这类局部篡改的检测能力尚不明确。 Method: 采用多个包含不同生成器、掩码大小和修复技术的数据集，对原本用于全图合成检测的最先进检测器进行跨任务评估。 Result: 实验表明，基于大量生成器训练的模型在局部修复检测上具有部分可迁移性，能够可靠地检测中大区域篡改或再生式修复，性能优于许多专用检测方法。 Conclusion: 当前先进的检测模型具备一定的跨任务泛化能力，尤其适用于较显著的局部图像篡改检测，为实际应用中的图像真实性验证提供了可行性支持。 Abstract: The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.

[106] SDFoam: Signed-Distance Foam for explicit surface reconstruction

Antonella Rech,Nicola Conci,Nicola Garau

Main category: cs.CV

TL;DR: 本文提出SDFoam，通过联合学习显式Voronoi图与隐式符号距离场（SDF），在保持RadiantFoam渲染效率的同时，显著提升了神经辐射场的网格重建精度与表面质量。

Details

Motivation: 现有方法如NeRF、3DGS和RadiantFoam在视图合成中表现良好，但在精确网格重建方面仍存在不足，尤其是表面模糊、浮点物和拓扑错误等问题。 Method: 提出SDFoam，结合显式的Voronoi图（用于高效光线追踪）与隐式的SDF（用于几何正则化），通过光线追踪优化场景，并引入Eikonal约束进行几何一致性训练，使Voronoi单元面贴近零等值面，从而获得更清晰的表面。 Result: SDFoam在多种场景下显著提升了网格重建的Chamfer距离，同时保持了与RadiantFoam相当的PSNR和SSIM指标，且训练速度未受影响。 Conclusion: SDFoam通过隐式-显式联合建模，在不牺牲渲染效率的前提下，有效改善了表面质量和拓扑结构，为高质量视图合成与精确三维重建提供了新的解决方案。 Abstract: Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.

[107] A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece,Zhehua Mao,Netanell Avisdris,Brian Dromey,Raffaele Napolitano,Dafna Ben Bashat,Francisco Vasconcelos,Danail Stoyanov,Leo Joskowicz,Sophia Bano

Main category: cs.CV

TL;DR: 本文介绍了一个公开的、多中心、多设备的胎儿超声图像基准数据集，包含专家标注的解剖标志点，涵盖主要的胎儿生物测量指标，旨在促进人工智能辅助胎儿生长评估的研究。

Details

Motivation: 手动标记超声图像中的解剖标志点耗时、依赖操作者，且在不同设备和中心间存在变异性，限制了自动化方法的可重复性。因此需要一个多源标注数据集来推动AI辅助胎儿生长评估的发展。 Method: 收集来自三个临床中心、使用七种不同超声设备获取的4,513张去标识化超声图像，由专家标注关键解剖标志点，并提供标准化的训练/测试划分、评估代码和基线结果。同时使用自动生物测量模型量化域偏移效应。 Result: 该数据集是首个公开的覆盖所有主要胎儿生物测量指标的多中心、多设备、标注解剖标志点的数据集；实验表明单中心训练和测试会高估模型性能，跨中心测试更具有泛化评估价值。 Conclusion: 该数据集为胎儿生物测量中的域适应和多中心泛化提供了可靠基准，有助于实现跨机构的更可靠AI辅助胎儿生长评估，所有数据、标注、训练代码和评估流程均已公开。 Abstract: Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

[108] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang,Pengfei Ren,Buyuan Zhang,Da Li,Tianhao Han,Haoyang Zhang,Liang Xie,Hongbo Chen,Erwei Yin

Main category: cs.CV

TL;DR: 本文提出了一种用于基于骨架的在线微手势识别的多视图自监督数据生成管道，并发布了首个大规模公开基准OMG-Bench，同时提出了HMATr模型，通过分层记忆库和可学习的位置感知查询统一手势检测与分类，显著优于现有方法。

Details

Motivation: 由于缺乏公开的数据集和针对微手势识别的通用算法，当前基于骨架的在线微手势识别面临挑战，尤其是细微动作模式使得精确骨架数据和帧级标注难以构建。 Method: 开发了一个多视图自监督管道来自动生成骨架数据，结合启发式规则和专家精修实现半自动标注；提出了HMATr模型，采用分层记忆增强Transformer架构，利用帧级和窗口级记忆库保留历史上下文，并通过可学习的位置感知查询隐式编码手势位置与语义。 Result: 构建了包含40类细粒度手势、13,948个实例、1,272个序列的OMG-Bench基准；HMATr在检测率上比现有最优方法提升7.6%。 Conclusion: HMATr有效解决了在线微手势识别中的细微动作、快速动态和连续执行等挑战，为该领域建立了强有力的基线，且OMG-Bench有望推动相关研究发展。 Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

[109] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Yunkai Yang,Yudong Zhang,Kunquan Zhang,Jinxiao Zhang,Xinying Chen,Haohuan Fu,Runmin Dong

Main category: cs.CV

TL;DR: 提出了一种面向遥感语义分割任务的可控数据合成框架TODSynth，结合多模态扩散Transformer和任务反馈驱动的采样策略，显著提升合成数据在少样本和复杂场景下的有效性。

Details

Motivation: 现有合成数据方法在遥感图像语义分割中面临语义掩码控制复杂和采样质量不稳定的问题，限制了其在下游任务中的应用效果。 Method: 提出了TODSynth框架，包括具有统一三重注意力的多模态扩散Transformer（MM-DiT），以及基于任务反馈的即插即用采样策略；引入控制-校正流匹配（CRFM）方法，在生成早期高可塑性阶段根据语义损失动态调整采样方向。 Result: 实验表明，该方法在少样本和复杂场景下显著优于现有的可控生成方法，生成的数据更稳定且更贴近下游分割任务需求。 Conclusion: TODSynth通过联合注意力机制与任务导向的采样优化，有效提升了遥感图像合成数据的质量和实用性，为减少人工标注提供了可靠方案。 Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

[110] TreeNet: A Light Weight Model for Low Bitrate Image Compression

Mahadev Prasad Panda,Purnachandra Rao Makkena,Srivatsa Prativadibhayankaram,Siegfried Fößel,André Kaup

Main category: cs.CV

TL;DR: 本文提出了一种名为TreeNet的新型低复杂度图像压缩模型，采用二叉树结构的编码器-解码器架构，并结合注意力特征融合机制，在降低计算复杂度的同时在低比特率下优于JPEG AI。

Details

Motivation: 降低基于学习的图像压缩方法的计算复杂度，以促进其广泛应用。 Method: 提出TreeNet模型，采用二叉树结构的编码器-解码器，并引入注意力特征融合机制来整合多分支特征。 Result: 在三个基准数据集上评估，TreeNet在低比特率下相比JPEG AI平均BD-rate改善4.83%，模型复杂度降低87.82%；并通过消融实验分析了潜在表示的影响。 Conclusion: TreeNet在显著降低模型复杂度的同时实现了优越的压缩性能，为高效图像压缩提供了有效解决方案。 Abstract: Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.

[111] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

Zhiyang Guo,Ori Zhang,Jax Xiang,Alan Zhao,Wengang Zhou,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Make-It-Poseable的新框架，将3D角色摆姿问题转化为潜在空间中的变换任务，通过直接操纵角色的潜在表示来实现高保真、拓扑自适应的姿势生成与编辑。

Details

Motivation: 现有方法在皮肤权重预测、拓扑结构和姿态一致性方面存在不足，限制了其鲁棒性和泛化能力，因此需要一种更有效的3D角色摆姿方法。 Method: 提出Make-It-Poseable框架，使用潜在空间变换代替传统顶点变形，核心是一个基于骨骼运动操作形状token的潜在姿态变换器，并采用密集姿态表示进行精确控制，同时引入潜在空间监督策略和自适应补全模块。 Result: 该方法在姿态生成质量上表现出优越性能，能够处理拓扑变化并保持高保真几何，且自然支持部件替换和细节优化等3D编辑应用。 Conclusion: Make-It-Poseable通过潜在空间建模有效解决了传统方法的局限性，在3D角色摆姿及编辑任务中展现出更强的鲁棒性、通用性和应用潜力。 Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.

[112] FlowDet: Unifying Object Detection and Generative Transport Flows

Enis Baty,C. P. Bridges,Simon Hadfield

Main category: cs.CV

TL;DR: 本文提出了FlowDet，首次将现代条件流匹配技术应用于目标检测，相较于基于扩散的方法，在不同实验设置下表现更优。

Details

Motivation: 受DiffusionDet启发，作者希望将目标检测扩展为更广泛的生成式传输问题，并改进推理效率和性能。 Method: 采用条件流匹配（Conditional Flow Matching）技术，学习从噪声到真实边界框的更简单、更直的传输路径，支持灵活调整框数和推理步数。 Result: 在COCO和LVIS数据集上分别实现了比DiffusionDet高+3.6% AP和+4.2% AP$_{rare}$的性能，尤其在召回率受限场景下优势明显。 Conclusion: FlowDet通过更高效的生成路径建模，显著提升了生成式目标检测的性能与可扩展性，是扩散方法的有力替代方案。 Abstract: We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.

[113] Kling-Omni Technical Report

Kling Team,Jialu Chen,Yuanzheng Ci,Xiangyu Du,Zipeng Feng,Kun Gai,Sainan Guo,Feng Han,Jingbin He,Kang He,Xiao Hu,Xiaohua Hu,Boyuan Jiang,Fangyuan Kong,Hang Li,Jie Li,Qingyu Li,Shen Li,Xiaohan Li,Yan Li,Jiajun Liang,Borui Liao,Yiqiao Liao,Weihong Lin,Quande Liu,Xiaokun Liu,Yilun Liu,Yuliang Liu,Shun Lu,Hangyu Mao,Yunyao Mao,Haodong Ouyang,Wenyu Qin,Wanqi Shi,Xiaoyu Shi,Lianghao Su,Haozhi Sun,Peiqin Sun,Pengfei Wan,Chao Wang,Chenyu Wang,Meng Wang,Qiulin Wang,Runqi Wang,Xintao Wang,Xuebo Wang,Zekun Wang,Min Wei,Tiancheng Wen,Guohao Wu,Xiaoshi Wu,Zhenhua Wu,Da Xie,Yingtong Xiong,Yulong Xu,Sile Yang,Zikang Yang,Weicai Ye,Ziyang Yuan,Shenglong Zhang,Shuaiyu Zhang,Yuanxing Zhang,Yufan Zhang,Wenzheng Zhao,Ruiliang Zhou,Yan Zhou,Guosheng Zhu,Yongjie Zhu

Main category: cs.CV

TL;DR: Kling-Omni是一个端到端的通用视频生成框架，能够基于多模态输入（如文本、图像和视频）生成高质量、高保真的视频内容，并支持智能推理与编辑。

Details

Motivation: 现有视频生成系统通常功能分离，难以统一处理生成、编辑与推理任务，缺乏对复杂多模态输入的综合理解能力。 Method: 提出Kling-Omni框架，采用端到端架构，将文本、参考图像和视频上下文等多模态输入统一编码为联合表示；构建了支持大规模预训练的数据系统，并优化推理基础设施以提升效率。 Result: 实验表明，Kling-Omni在上下文内生成、基于推理的编辑和多模态指令遵循方面表现优异，能生成电影级质量的视频内容。 Conclusion: Kling-Omni不仅是一个视频创作工具，更向具备感知、推理、生成与交互能力的多模态世界模拟器迈出了关键一步。 Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

[114] R3ST: A Synthetic 3D Dataset With Realistic Trajectories

Simone Teglia,Claudia Melis Tonti,Francesco Pro,Leonardo Russo,Andrea Alfarano,Leonardo Pentassuglia,Irene Amerini

Main category: cs.CV

TL;DR: 本文提出了R3ST，一种结合真实世界轨迹的逼真三维合成数据集，用于提升道路车辆轨迹预测研究，兼具精确标注与真实驾驶行为。

Details

Motivation: 现有真实数据集缺乏精确的真值标注，而合成数据集则缺乏真实的车辆运动轨迹，因此需要一种兼具真实性与高质量标注的数据集。 Method: 通过构建合成3D环境，并融合来自无人机拍摄的SinD数据集的真实世界轨迹，生成具有真实人类驾驶行为的合成数据集R3ST。 Result: R3ST数据集实现了真实车辆轨迹与高精度多模态真值标注的结合，缩小了合成数据与真实场景之间的差距。 Conclusion: R3ST为交通分析和轨迹预测任务提供了一个更优的数据资源，推动了计算机视觉在道路安全领域的应用发展。 Abstract: Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

[115] KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Shuting Zhao,Zeyu Xiao,Xinrong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为KineST的新型运动学引导的状态空间模型，用于从稀疏信号中重建高质量的全身姿态，在AR/VR中实现了高精度、良好时序一致性且计算轻量。

Details

Motivation: 现有方法在从头戴设备获取的稀疏信号中重建全身姿态时，难以兼顾准确性、时序连贯性和计算效率，且多分离处理时空依赖关系。 Method: 提出KineST模型：1）在状态空间对偶框架下引入运动学引导的双向扫描策略，嵌入运动学先验以捕捉关节间复杂关系；2）采用混合时空表示学习方法，紧密耦合空间与时间上下文；3）引入几何角速度损失，对旋转变化施加物理合理的约束。 Result: 实验表明，KineST在多个指标上优于现有方法，兼具高精度和良好的时序一致性，同时保持轻量级结构，适合AR/VR应用。 Conclusion: KineST通过融合运动学先验与紧凑的时空建模，在稀疏输入下实现了高效、稳定且逼真的全身动作追踪，为AR/VR中的全身动捕提供了有效解决方案。 Abstract: Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

[116] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian,Boyao Han,Chen Shi,Lei Xiao,Long Yang,Shaoshuai Shi,Li Jiang

Main category: cs.CV

TL;DR: 提出GeoPredict，一种几何感知的视觉-语言-动作框架，通过预测性运动学和几何先验增强连续动作策略，提升机器人在3D空间推理任务中的性能。

Details

Motivation: 现有VLA模型多为反应式且以2D为中心，在需要精确3D推理的任务中表现不可靠，缺乏对空间结构和运动轨迹的深层理解。 Method: 引入轨迹级模块预测机械臂多步3D关键点轨迹，并设计预测性3D高斯几何模块，结合未来轨迹进行工作区几何预测与引导优化；训练时利用深度渲染提供监督，推理时仅需轻量级查询标记，无需额外3D解码。 Result: 在RoboCasa Human-50、LIBERO和真实世界操作任务上均优于强VLA基线，尤其在几何密集和空间要求高的场景中表现突出。 Conclusion: GeoPredict通过引入几何与运动预测先验，有效提升了VLA模型在复杂3D环境中的泛化能力与操作精度，为机器人提供了更可靠的空间推理机制。 Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

[117] DenseBEV: Transforming BEV Grid Cells into 3D Objects

Marius Dähling,Sebastian Krebs,J. Marius Zöllner

Main category: cs.CV

TL;DR: 本文提出了一种名为DenseBEV的新型两阶段锚点生成方法，利用BEV特征网格直接作为查询，结合非最大抑制和时序建模，显著提升了多摄像头3D目标检测性能，尤其在小目标检测上表现突出，在nuScenes和Waymo数据集上均达到SOTA。

Details

Motivation: 传统方法使用随机查询或辅助网络生成锚点，效率低且不够直观；同时，大量查询导致注意力机制计算开销大，训练效率低。 Method: 将BEV特征图的每个网格单元直接作为对象查询（即锚点），采用BEV-based NMS使梯度仅流过未被抑制的物体以提升训练效率，并融合先验检测结果进行混合时序建模，增强检测性能。 Result: 在nuScenes上NDS和mAP均有显著提升，行人mAP提升3.8%；在Waymo上LET-mAP达60.7%，超过此前最佳5.4%；即使使用更稀疏的BEV网格仍表现优异。 Conclusion: DenseBEV通过密集BEV查询与高效训练策略，实现了更优的多摄像头3D检测性能，尤其擅长小物体检测，并在多个数据集上取得领先结果。 Abstract: In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.

[118] Next-Generation License Plate Detection and Recognition System using YOLOv8

Arslan Amin,Rafia Mumtaz,Muhammad Jawad Bashir,Syed Mohammad Hassan Zaidi

Main category: cs.CV

TL;DR: 本研究评估了YOLOv8系列模型在车牌识别和字符识别任务中的性能，提出了一种基于x轴位置进行字符排序的优化方法，并设计了一个结合YOLOv8 Nano和Small的高效检测流程，具有高精度与计算效率，适用于边缘设备上的智能交通系统应用。

Details

Motivation: 为了提升复杂环境下车牌检测与识别的实时性与准确性，推动智能交通系统的发展。 Method: 采用YOLOv8 Nano进行车牌检测，YOLOv8 Small进行字符识别，并引入基于x轴坐标的字符排序方法，构建高效的LPR处理流程。 Result: YOLOv8 Nano在LPR任务上达到0.964的精确率和0.918的mAP50；YOLOv8 Small在字符识别任务上达到0.92的精确率和0.91的mAP50。 Conclusion: 所提出的YOLOv8优化组合方案在保持计算效率的同时实现了高精度识别，适合部署于边缘设备，为智慧交通基础设施提供了可行的技术路径。 Abstract: In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.

[119] Radiology Report Generation with Layer-Wise Anatomical Attention

Emmanuel D. Muñiz-De-León,Jorge A. Rosales-de-Golferichs,Ana S. Muñoz-Rodríguez,Alejandro I. Trejo-Castro,Eduardo de Avila-Armenta,Antonio Martínez-Torteya

Main category: cs.CV

TL;DR: 提出了一种紧凑型图像到文本架构，用于从单张胸部X光片生成放射学报告的“发现”部分，结合冻结的DINOv3 ViT编码器和带有层次解剖注意力机制的GPT-2解码器，在不增加可训练参数的情况下提升了病理检测和报告连贯性。

Details

Motivation: 现有最先进的多模态模型依赖大规模训练、临床元数据和多视角成像，资源消耗大且难以普及，因此需要一种轻量级、仅基于单张图像的解决方案。 Method: 采用冻结的DINOv3 Vision Transformer作为图像编码器，GPT-2作为文本解码器，并引入层次解剖注意力机制，利用肺部和心脏分割掩码通过分层高斯平滑引导注意力至临床相关区域。 Result: 在MIMIC-CXR数据集上评估显示，CheXpert五种关键病理的Macro-F1提升168%（0.083→0.238），Micro-F1提升146%（0.137→0.337），14项观察指标整体提升86%（0.170→0.316），RadGraph F1提升9.7%。 Conclusion: 尽管模型规模小且仅依赖图像输入，但解码器层面的解剖引导显著改善了空间定位和临床相关区域的报告一致性，证明了轻量化设计在放射学报告生成中的有效性。 Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.

[120] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song,Jinzhou Li,Rao Fu,Devin Murphy,Kaichen Zhou,Rishi Shiv,Yaqi Li,Haoyu Xiong,Crystal Elaine Owens,Yilun Du,Yiyue Luo,Xianyi Cheng,Antonio Torralba,Wojciech Matusik,Paul Pu Liang

Main category: cs.CV

TL;DR: 本文提出了OpenTouch，首个真实场景下的第一人称全手触觉数据集，包含5.1小时同步的视频-触觉-姿态数据和2900个带文本标注的片段，用于推动多模态感知、具身学习与机器人操作研究。

Details

Motivation: 当前自我中心感知难以准确获知手部何时、何地以及以多大压力接触物体，缺乏可靠的可穿戴触觉传感器和匹配的第一人称视频与全手触觉数据。 Method: 构建OpenTouch数据集，采集真实场景下同步的视频、触觉和手部姿态数据，并设计检索与分类基准任务，探索触觉信号在抓取理解与跨模态对齐中的作用。 Result: 触觉信号能有效提升抓取理解能力，增强跨模态对齐效果，并可从真实视频中可靠检索；数据集包含5.1小时同步数据和2900个标注片段。 Conclusion: OpenTouch为多模态自我中心感知、具身学习和高接触性机器人操作提供了重要基础，展示了触觉在理解物理交互中的关键作用。 Abstract: The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

[121] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Amita Kamath,Kai-Wei Chang,Ranjay Krishna,Luke Zettlemoyer,Yushi Hu,Marjan Ghazvininejad

Main category: cs.CV

TL;DR: 本文指出现有文本到图像（T2I）模型评估基准（如GenEval）存在随时间漂移、不再与人类判断对齐的问题，提出新基准GenEval 2和评估方法Soft-TIFA，以提升评估的覆盖性、组合性和稳定性，强调持续审计和改进自动化评估基准的重要性。

Details

Motivation: 现有T2I评估基准因模型快速进步而出现“基准漂移”，导致自动评分与人类判断严重偏离，影响评估有效性，亟需更稳定、更具挑战性的新基准。 Method: 提出GenEval 2，增强对基础视觉概念的覆盖和组合性设计；引入Soft-TIFA评估方法，通过分解视觉原语进行评分，并与VQAScore等整体式评分对比，验证其更贴近人类判断且更抗漂移。 Result: 实验证明GenEval已严重漂移（最大绝对误差达17.7%），并已被当前模型饱和；GenEval 2更具挑战性，Soft-TIFA与人类判断更一致，且理论上更不易随时间漂移。 Conclusion: 基准漂移是T2I评估中的严重问题，GenEval 2和Soft-TIFA提供了更优解决方案，但长期稳定性仍需持续监控与迭代，凸显了对自动化评估体系动态维护的重要性。 Abstract: Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

[122] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu,Lei Ke,Xiaohang Zhan,Longxiang Tang,Yuqi Liu,Bohao Peng,Bei Yu,Dong Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了RePlan，一种用于复杂指令和场景下的图像编辑框架，通过结合视觉-语言规划器与扩散编辑器，实现精确的多区域并行编辑，并在新提出的IV-Edit基准上表现出色。

Details

Motivation: 现有基于指令的图像编辑模型在面对复杂的语言指令和混乱或模糊的视觉场景（即IV-Complexity）时表现不佳，缺乏对细粒度区域的精准控制和可靠推理能力。 Method: 提出RePlan框架：首先使用视觉-语言规划器通过逐步推理分解指令并将操作明确对应到目标区域；然后利用无需训练的注意力-区域注入机制，在扩散模型中并行执行多区域编辑；并通过基于GRPO的强化学习在1K纯指令样本上优化规划过程。 Result: 在新提出的聚焦细粒度定位与知识密集型编辑的IV-Edit基准上，RePlan在区域精度和整体保真度方面显著优于使用更大数据集训练的强基线模型。 Conclusion: RePlan有效应对了指令-视觉复杂性挑战，通过显式区域对齐的规划机制和高效的编辑执行策略，实现了更精准、可靠的自然语言驱动图像编辑。 Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

[123] Pixel Seal: Adversarial-only training for invisible image and video watermarking

Tomáš Souček,Pierre Fernandez,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Tom Sander,Alexandre Mourachko

Main category: cs.CV

TL;DR: 本文提出了Pixel Seal，一种新的图像和视频水印方法，通过对抗性训练、三阶段训练计划和高分辨率适应技术，在鲁棒性和不可感知性之间实现了更好的平衡，显著优于现有方法。

Details

Motivation: 现有的数字内容水印方法在鲁棒性和不可感知性之间难以平衡，且在高分辨率图像和视频中表现不佳，因此需要一种更有效的方法来解决这些问题。 Method: 提出了一种仅使用对抗性训练的范式，消除了不可靠的像素级不可感知性损失；引入了三阶段训练计划以稳定收敛；并通过基于JND的衰减和训练时推理模拟实现高分辨率适应。 Result: Pixel Seal在多种图像类型和变换下表现出更强的鲁棒性和不可感知性，明显优于现有最先进方法，并能有效扩展到视频应用。 Conclusion: Pixel Seal为图像和视频的来源追踪提供了一个实用且可扩展的解决方案，显著提升了水印技术的实际应用价值。 Abstract: Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.

[124] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Valay Bundele,Mehran Hosseinzadeh,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 提出ReMeDI-SAM3，一种无需训练的记忆增强型SAM3扩展方法，通过相关性感知记忆过滤、分段插值和基于特征的重识别模块，显著提升手术器械在遮挡后的分割性能。

Details

Motivation: 现有方法如SAM3在手术视频分割中存在记忆更新无差别、记忆容量固定和遮挡后身份恢复能力弱的问题，难以应对手术场景中的频繁遮挡和快速运动。 Method: 提出ReMeDI-SAM3，包含三个核心组件：(i) 相关性感知记忆过滤与专用遮挡感知记忆存储；(ii) 分段插值策略以扩展有效记忆容量；(iii) 基于特征的重识别模块结合时间投票机制，实现遮挡后身份的可靠恢复。 Result: 在EndoVis17和EndoVis18数据集上零样本设置下，相比原始SAM3的mcIoU分别提升了约7%和16%，性能优于先前需训练的方法。 Conclusion: ReMeDI-SAM3有效缓解了误差累积问题，显著提升了手术视频中器械分割的鲁棒性和准确性，尤其在处理遮挡和身份恢复方面表现突出。 Abstract: Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

[125] M-PhyGs: Multi-Material Object Dynamics from Video

Norika Wada,Kohei Yamashita,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 提出Multi-material Physical Gaussians (M-PhyGs) 方法，从自然场景视频中联合分割多材质区域并估计其物理参数，用于复杂自然物体（如花）的多材质物理建模。

Details

Motivation: 现有方法假设物体为单一均匀材质或具有简单结构，难以处理现实世界中材质复杂、几何多样的物体（如花），需更通用的多材质物理参数估计方法。 Method: 提出M-PhyGs，通过级联的3D和2D损失函数及时间小批量策略，从短视频中联合实现多材质分割与连续介质力学参数（含重力）估计。 Result: 在新构建的Phlowers数据集上验证了M-PhyGs的有效性，各组件显著提升多材质物理参数估计精度。 Conclusion: M-PhyGs 能有效处理真实场景下多材质复杂自然物体的物理建模，为视觉驱动的物理理解提供了新思路。 Abstract: Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.

[126] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu

Main category: cs.CV

TL;DR: 本文提出了LinkedOut，一种基于视频大语言模型（VLLM）的新型表示方法，用于实现无需手工标注、支持多视频输入、低延迟的视频推荐系统。

Details

Motivation: 现有的VLLM在视频理解中虽有潜力，但因解码延迟高、不支持多视频输入、语言输出限制等问题，难以应用于如视频推荐等下游任务；缺乏既能保留像素级细节又能利用世界知识的表示方法。 Method: 提出LinkedOut表示方法，从原始帧中提取VLLM的世界知识，通过可提示查询和辅助模态引导生成语义对齐的知识感知token，并引入跨层知识融合的MoE机制以选择合适的抽象层级。 Result: LinkedOut在标准基准上实现了最先进的视频推荐性能，支持快速推理、多视频历史输入，并消除了语言瓶颈；消融实验验证了层多样性与逐层融合的有效性。 Conclusion: LinkedOut是首个基于VLLM、直接在原始帧上运行且无需手工标签的视频推荐方法，为充分利用VLLM的世界知识先验和视觉推理提供了可行路径。 Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

[127] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Kaiwen Jiang,Xueting Li,Seonwook Park,Ravi Ramamoorthi,Shalini De Mello,Koki Nagano

Main category: cs.CV

TL;DR: 本文提出了一种结合2D扩散模型和3D感知前馈方法优势的高效肖像动画框架，通过知识蒸馏将2D扩散模型的表达细节迁移到快速的3D一致表示中，实现了高质量、高帧率的面部动画。

Details

Motivation: 现有的2D视频扩散模型虽能生成高质量动画，但缺乏3D一致性且速度慢；而3D感知方法虽快且具3D一致性，但表达细节不足。本文旨在融合两者优势，实现既快速又具表现力和3D一致性的面部动画。 Method: 采用知识蒸馏策略，将2D扩散模型中的表达信息迁移到前馈编码器中，构建一个解耦的3D动画表示；该表示不依赖预定义参数模型，并通过轻量级局部融合机制高效结合3D结构与动态信息。 Result: 本方法在野外单图输入下可实现实时动画（107.31 FPS），同时保持与当前最优方法相当的动画质量，在速度与质量之间取得更好平衡。 Conclusion: 所提方法成功融合了2D扩散模型的表现力与3D前馈模型的速度和一致性，为数字孪生、远程临场等实际应用提供了高效且高质量的面部动画解决方案。 Abstract: Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

[128] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Kai Qiu,Chong Luo,Zuxuan Wu

Main category: cs.CV

TL;DR: FlashPortrait是一种基于视频扩散变换器的端到端方法，用于生成身份保持的无限长度肖像动画，推理速度提升达6倍。

Details

Motivation: 现有的基于扩散的长肖像动画方法在身份（ID）一致性方面表现不足，难以满足高质量、长时间动画的需求。 Method: 提出FlashPortrait，首先使用现成提取器计算与身份无关的面部表情特征；引入归一化面部表情块（Normalized Facial Expression Block），通过均值和方差对齐面部特征与扩散潜在表示；采用动态滑动窗口策略，在重叠区域加权融合以保证过渡平滑；利用高阶潜在导数跳过多步去噪过程，实现加速。 Result: 在多个基准测试上验证了FlashPortrait的有效性，定性和定量结果均优于现有方法，实现了最高6倍的推理加速，同时保持良好的ID一致性和视觉质量。 Conclusion: FlashPortrait有效解决了长时肖像动画中身份不一致和推理效率低的问题，为高效、高质量的个性化视频生成提供了可行方案。 Abstract: Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

[129] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding,Yang Zhou,Xi Chen,Miao Yang,Jiarong Ou,Rui Chen,Xin Tao,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为Alchemist的元梯度数据选择框架，用于提升文本到图像生成模型的训练效率和视觉质量。该方法通过自动评估样本影响力，实现高效的数据筛选。

Details

Motivation: 现有文本到图像模型受限于训练数据质量，网络爬取和合成数据集常包含低质或冗余样本，导致训练不稳定和计算效率低下，因此需要有效的数据选择方法。 Method: 提出Alchemist框架，包含数据评分与剪枝两个阶段：利用轻量级评分器基于多粒度感知的梯度信息估计样本影响，并采用Shift-Gsampling策略选取高信息量子集进行训练。 Result: 在合成和网络爬取数据集上实验表明，使用Alchemist选出的50%数据训练即可超越使用全量数据训练的效果，显著提升视觉质量和下游任务性能。 Conclusion: Alchemist是首个面向文本到图像模型训练的自动、可扩展的元梯度数据选择框架，能有效提升数据利用效率和模型性能。 Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

[130] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong,Haotian Yang,Angtian Wang,Yizhi Wang,Yiding Yang,Canyu Zhang,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为VIVA的可扩展框架，用于基于指令的视频编辑，通过VLM引导编码和奖励优化来提升对复杂真实世界指令的泛化能力。

Details

Motivation: 现有基于扩散模型的方法通常在简单编辑操作的配对数据上训练，难以泛化到多样且复杂的现实指令，存在泛化差距。 Method: 引入基于视觉语言模型（VLM）的instructor模块，将文本指令、源视频首帧和参考图像编码为视觉接地的指令表示，并设计Edit-GRPO后训练阶段，采用组相对策略优化方法，利用相对奖励直接优化模型；同时构建了一个合成生成多样化高质量配对数据的数据管道。 Result: 实验表明，VIVA在指令遵循、泛化能力和编辑质量方面优于现有最先进方法。 Conclusion: VIVA通过VLM引导的表示学习与奖励优化，有效提升了视频编辑模型对复杂自然语言指令的适应性和编辑效果，具备良好的实际应用潜力。 Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

[131] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen,Yifan Wang,Zhengqin Li,Homanga Bharadhwaj,Yujin Chen,Chuan Qin,Ziyi Kou,Yuan Tian,Eric Whitmire,Rajinder Sodhi,Hrvoje Benko,Eli Shlizerman,Yue Liu

Main category: cs.CV

TL;DR: 本文提出了EgoMAN数据集和模型，用于实现交互阶段感知的3D手势轨迹预测，结合视觉-语言推理与运动生成。

Details

Motivation: 现有3D手势轨迹预测研究受限于缺乏语义监督的数据集以及推理与动作关联较弱的模型。 Method: 提出EgoMAN数据集，包含219K个6DoF轨迹和3M个结构化问答对，并设计基于轨迹-标记接口的推理到运动框架EgoMAN模型，通过渐进训练对齐推理与运动动态。 Result: 模型在轨迹预测准确性与交互阶段感知方面表现优异，并能在真实场景中良好泛化。 Conclusion: EgoMAN实现了语义、空间与运动推理的联合建模，推动了具身交互中手部运动预测的发展。 Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[132] SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu,Chih-hao Lin,Henry Che,Aditi Tiwari,Chuhang Zou,Shenlong Wang,Derek Hoiem

Main category: cs.CV

TL;DR: 本文提出了SceneDiff方法和基准，用于多视角下的物体变化检测，通过结合预训练的3D、分割和图像编码模型，在不同视角下有效识别场景中物体的增减或移动，显著优于现有方法。

Details

Motivation: 在不同时间拍摄的同一场景图像或视频之间检测物体的变化（如添加、移除或移动）具有重要意义，但视角变化会导致物体误判为发生变化，因此需要更鲁棒的方法来解决这一问题。 Method: 提出SceneDiff方法，利用预训练的3D、分割和图像编码模型，将两帧图像在3D空间中对齐，提取对象区域，并比较其空间和语义特征以检测变化，无需训练。 Result: 在多视角和双视角基准上实验表明，该方法大幅超越现有方法，相对AP提升分别为94%和37.4%。 Conclusion: SceneDiff是一种无需训练且对多视角变化鲁棒的物体变化检测方法，在多个基准上表现优异，推动了该领域的发展。 Abstract: We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

[133] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju,Yongyuan Liang,Yen-Jen Wang,Nandiraju Gireesh,Yuanliang Ju,Seungjae Lee,Qiao Gu,Elvis Hsieh,Furong Huang,Koushil Sreenath

Main category: cs.CV

TL;DR: 本文提出了MomaGraph，一种用于具身智能体的统一场景表示方法，结合了空间-功能关系和部件级交互元素，并发布了首个大规模、任务驱动的家庭环境场景图数据集MomaGraph-Scenes及评估套件MomaGraph-Bench。基于此，开发了MomaGraph-R1（7B规模的视觉语言模型），在强化学习训练下实现零样本任务规划，在新基准上达到71.6%准确率，超越最佳基线11.4%，并在真实机器人任务中展现良好迁移能力。

Details

Motivation: 现有场景表示方法通常将空间与功能关系分离，忽略物体状态变化和时间更新，且缺乏针对当前任务的相关信息，难以满足家庭环境中移动操作机器人的需求。因此需要一种紧凑、语义丰富且任务导向的统一场景表示。 Method: 提出MomaGraph，整合空间-功能关系与可交互部件；构建MomaGraph-Scenes数据集和MomaGraph-Bench评估套件；基于该数据集，采用强化学习训练7B参数的视觉语言模型MomaGraph-R1，结合Graph-then-Plan框架实现零样本任务规划。 Result: MomaGraph-R1在MomaGraph-Bench上取得71.6%的准确率，超越最优基线11.4%；在多个公共基准上表现优异，并成功迁移到真实机器人实验中，验证了其泛化与实用能力。 Conclusion: MomaGraph为家庭环境中移动操作机器人提供了更全面、动态且任务导向的场景表示方案，通过数据、模型与评估体系的协同推进，显著提升了具身智能体在复杂家庭场景中的推理与规划能力。 Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

[134] SFTok: Bridging the Performance Gap in Discrete Tokenizers

Qihang Rao,Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: SFTok是一种离散的图像分词器，通过引入多步迭代机制和自强制引导重建，显著提升了高分辨率图像生成中的重建质量，在高压缩率下实现了最先进的性能。

Details

Motivation: 离散分词器在多模态系统中因训练与推理不一致问题导致性能落后于连续分词器，限制了其应用。 Method: 提出SFTok，结合自强制引导视觉重建和去偏-拟合训练策略，解决多步过程中的训练-推理不一致性问题。 Result: 在仅64个token每张图像的高压缩率下，SFTok在ImageNet上达到1.21的rFID和2.29的gFID，表现出卓越的重建和类到图像生成性能。 Conclusion: SFTok有效提升了离散分词器的性能，弥合了与连续分词器之间的差距，有望推动其在多模态系统中的广泛应用。 Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

[135] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin,Meixi Song,Dizhe Zhang,Wenxuan Lu,Haodong Li,Bo Du,Ming-Hsuan Yang,Truong Nguyen,Lu Qi

Main category: cs.CV

TL;DR: 提出了一种全景度量深度基础模型，通过数据在环范式和三阶段伪标签流程，在多尺度场景中实现强泛化和零样本迁移能力。

Details

Motivation: 现有深度估计模型在处理不同场景距离（尤其是室内外混合、合成与真实数据）时存在域差距大、几何不一致和泛化能力弱的问题，需要一个能广泛适用于多样化真实场景的全景深度模型。 Method: 采用DINOv3-Large作为主干网络，引入即插即用的范围掩码头、以清晰度为中心和以几何为中心的优化策略；构建大规模混合数据集，并设计三阶段伪标签校准流程以减小域间差异。 Result: 在Stanford2D3D、Matterport3D和Deep360等多个基准上表现出色，展现出强大的零样本泛化能力，尤其在多样化的现实场景中实现了稳健且稳定的度量深度预测。 Conclusion: 所提出的方法通过数据与模型协同设计，有效提升了全景深度估计在复杂真实环境中的适应性与精度，验证了数据-in-the-loop范式的有效性。 Abstract: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}

[136] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen,Yihua Du,Wenhang Ge,Jing He,Chirui Chang,Donghao Zhou,Zhen Yang,Luozhou Wang,Xin Tao,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了UniStereo数据集和StereoPilot模型，用于解决单目到立体视频转换中的多阶段流程缺陷，实现高效、高质量的立体视图生成。

Details

Motivation: 现有的Depth-Warp-Inpaint（DWI）多阶段方法存在误差传播、深度模糊和格式不一致问题，难以满足高质量立体视频内容的需求。 Method: 构建了大规模统一的立体视频数据集UniStereo，并提出StereoPilot模型，采用前馈网络直接合成目标视图，无需显式深度图或扩散采样，引入可学习域切换器和循环一致性损失以适应不同立体格式。 Result: 实验表明，StereoPilot在视觉质量和计算效率上显著优于现有最先进方法。 Conclusion: StereoPilot结合UniStereo数据集为单目到立体视频转换提供了更高效、一致的解决方案，推动了立体显示内容的生成技术发展。 Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

[137] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang,Kaituo Feng,Dongyang Chen,Zhongyu Wang,Zhixun Li,Sicheng Gao,Meng Meng,Xu Zhou,Manyuan Zhang,Yuzhang Shang,Xiangyu Yue

Main category: cs.CV

TL;DR: 提出AdaTooler-V，一种能够自适应使用视觉工具的多模态大语言模型，通过AT-GRPO强化学习算法和大规模训练数据，在多种视觉推理任务中显著优于现有方法。

Details

Motivation: 现有开源多模态大模型常盲目调用视觉工具，导致推理开销增加和性能下降，缺乏对工具是否真正必要的判断能力。 Method: 提出AdaTooler-V模型与AT-GRPO强化学习算法，基于每样本的工具收益评分动态调整奖励尺度，并构建AdaTooler-V-CoT-100k和AdaTooler-V-300k两个数据集用于SFT冷启动和强化学习训练。 Result: 在十二个基准上实验表明，AdaTooler-V在单图、多图和视频推理任务中均表现优异，其中AdaTooler-V-7B在高分辨率基准V*上达到89.8%准确率，超过GPT-4o和Gemini 1.5 Pro。 Conclusion: AdaTooler-V实现了高效的自适应工具使用，有效提升多模态推理性能并降低不必要的计算开销，推动了多模态大模型的实用化发展。 Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

[138] DVGT: Driving Visual Geometry Transformer

Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Shengyin Jiang,Long Chen,Zhi-Xin Yang,Jiwen Lu

Main category: cs.CV

TL;DR: 提出Driving Visual Geometry Transformer (DVGT)，用于从无姿态多视角图像序列中重建全局密集3D点图，适用于自动驾驶中的灵活相机配置和多场景几何感知。

Details

Motivation: 缺乏针对自动驾驶的、能适应不同场景和相机配置的密集几何感知模型。 Method: 使用DINO骨干网络提取图像特征，通过交替的 intra-view 局部注意力、cross-view 空间注意力和 cross-frame 时间注意力建模图像间几何关系，并用多头解码器预测全局点云和自车位姿。 Result: 在nuScenes、OpenScene、Waymo、KITTI和DDAD等多个驾驶数据集上显著优于现有方法，无需精确相机参数或外部传感器后对齐即可输出度量尺度的几何结果。 Conclusion: DVGT是一种无需显式3D几何先验的驾驶场景几何感知模型，具备良好的跨场景泛化能力和相机配置灵活性，适用于实际自动驾驶系统。 Abstract: Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

[139] EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai,Chaoyang Wang,Guocheng Gordon Qian,Willi Menapace,Sergey Tulyakov,Bernard Ghanem,Peter Wonka,Ashkan Mirzaei

Main category: cs.CV

TL;DR: 本文提出了一个简单而有效的基于指令的视频编辑框架EasyV2V，在数据、模型和控制方面进行了创新，实现了最先进的视频编辑效果。

Details

Motivation: 视频编辑在一致性、控制性和泛化性方面仍面临挑战，现有方法不够有效，因此需要一种更简单且强大的编辑框架。 Method: 通过组合现有专家模型生成多样化的视频对，将图像编辑扩展到视频，并引入密集描述剪辑和过渡监督；利用预训练的文本到视频模型，采用序列拼接和轻量LoRA微调；通过单一掩码机制实现时空统一控制，并支持参考图像输入。 Result: EasyV2V在多种输入条件下（如视频+文本、视频+掩码+文本等）均表现出色，超越了同时期工作和商业系统，实现了SOTA的视频编辑结果。 Conclusion: EasyV2V通过简洁的设计在数据构建、模型训练和用户控制之间取得了良好平衡，为指令驱动的视频编辑提供了高效且通用的解决方案。 Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

[140] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu,Chengzhi Mao,Yaojie Liu,Alan Yuille,Wen-Sheng Chu

Main category: cs.CV

TL;DR: AuditDM是一个自动化框架，通过强化学习训练多模态大模型作为审计员，主动发现并修正模型在生成问题和反事实图像时的分歧，揭示其失败模式，并利用这些发现提升模型性能。

Details

Motivation: 现有对多模态大语言模型（MLLMs）的评估方法缺乏可解释性，难以充分暴露不同模型之间的能力差距。 Method: 提出AuditDM框架，使用强化学习微调一个MLLM作为审计员，生成能最大化目标模型分歧的问题和反事实图像，从而发现失败模式，并用这些数据进行无标注的模型修正。 Result: 在Gemma-3和PaliGemma-2等SOTA模型上发现了超过20种不同的失败类型；基于这些发现进行微调后，在16个基准测试中持续提升了所有模型性能，甚至使3B模型超越了28B模型。 Conclusion: 随着数据扩展效益递减，有针对性的模型审计为模型诊断与改进提供了有效路径。 Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

[141] Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu,Ziqiao Ma,Wenhao Chai,Xuweiyi Chen,Weiyang Jin,Joyce Chai,Saining Xie,Stella X. Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为NEPA（Next-Embedding Predictive Autoregression）的视觉自监督学习方法，通过预测图像块嵌入实现生成式预训练，无需像素重建、离散化或对比损失，在ImageNet和ADE20K等任务上取得优异性能。

Details

Motivation: 受自然语言中生成式预训练成功的启发，探索是否可以在视觉领域采用类似原则，摆脱传统表征学习范式，直接通过预测嵌入来构建通用视觉模型。 Method: 提出NEPA方法，使用因果掩码和梯度截断，训练Transformer模型根据历史图像块嵌入预测未来的嵌入，仅依赖嵌入空间中的自回归预测作为预训练目标。 Result: 在ImageNet-1k上使用ViT-B和ViT-L分别达到83.8%和85.3%的top-1准确率，并在ADE20K语义分割任务上展现出良好的迁移能力。 Conclusion: 基于嵌入的生成式预训练是一种简单、可扩展且可能跨模态通用的视觉自监督学习新范式。 Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

[142] Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu,Jia-Bin Huang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为生成式重对焦（Generative Refocusing）的两步方法，通过DeblurNet和BokehNet实现从单张图像中恢复全焦图像并生成可控虚化效果，创新性地采用半监督训练结合合成配对数据与真实非配对虚化图像，利用EXIF元数据提升真实性，并支持文本引导调整和自定义光圈形状。

Details

Motivation: 单图像重对焦面临恢复清晰内容和生成真实散景的挑战，现有方法依赖全焦输入、合成数据且控制能力有限，难以反映真实光学特性。 Method: 提出Generative Refocusing，包含DeblurNet用于从多种输入恢复全焦图像，BokehNet用于生成可控散景；采用半监督训练，结合合成配对数据与未配对的真实虚化图像，并利用EXIF元数据建模真实光学特征。 Result: 在散焦去模糊、散景合成和重对焦基准测试中均取得最优性能，能够生成更真实自然的虚化效果，并支持文本引导编辑和自定义光圈形状。 Conclusion: 该方法突破了传统单图像重对焦的限制，通过半监督训练策略有效融合真实与合成数据，提升了模型泛化性和实用性，实现了高质量、可控制的重对焦效果。 Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

[143] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang,Hao Ouyang,Qiuyu Wang,Yue Yu,Yihao Meng,Wen Wang,Ka Leong Cheng,Shuailei Ma,Qingyan Bai,Yixuan Li,Cheng Chen,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: WorldCanvas是一个结合文本、轨迹和参考图像的多模态框架，能够生成连贯且可控的世界事件视频，支持多智能体交互、对象进出、外观保持和反直觉事件，提升了世界模型的交互性和用户控制能力。

Details

Motivation: 现有的文本生成视频或纯轨迹控制方法难以实现语义丰富且高度可控的复杂世界事件模拟，缺乏对多智能体交互、对象身份一致性和动态出入场景的支持。 Method: 提出WorldCanvas框架，将轨迹（编码运动、时间和可见性）与自然语言（语义意图）和参考图像（视觉身份）相结合，通过多模态条件生成实现对复杂世界事件的精细控制。 Result: 生成的视频在时间上连贯，并展现出对象身份和场景的涌现一致性，即使对象暂时消失也能保持外观；支持多代理互动、对象进出、参考引导的外观控制和反直觉事件。 Conclusion: WorldCanvas推动了世界模型从被动预测向主动、可由用户塑造的模拟器转变，为构建可交互、可提示的动态环境提供了新路径。 Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Table of Contents

cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

[2] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

[4] Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

[6] Convolutional Lie Operator for Sentence Classification

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

[9] A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

[13] Sigma-Moe-Tiny Technical Report

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

[15] Hacking Neural Evaluation Metrics with Single Hub Text

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

[24] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

[30] In-Context Algebra

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

[34] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

[41] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

[44] Auto-Vocabulary 3D Object Detection

[45] LAPX: Lightweight Hourglass Network with Global Context

[46] Collimator-assisted high-precision calibration method for event cameras

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

[48] Flexible Camera Calibration using a Collimator System

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

[53] Towards Closing the Domain Gap with Event Cameras

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

[57] Enhanced 3D Shape Analysis via Information Geometry

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

[59] Image Compression Using Singular Value Decomposition

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

[65] GFLAN: Generative Functional Layouts

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

[72] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models