cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar,Juhna Park,Aparna Garimella,Vivek Gupta

Main category: cs.CL

TL;DR: 提出了一种无需参考表、基于属性的表格生成评估框架TabReX，通过图结构推理实现可解释的结构与事实保真度评分，并配套构建大规模基准TabReX-Bench验证其鲁棒性与优越性。

Details

Motivation: 现有表格生成评估方法或忽略结构信息，或将表格扁平化为文本，缺乏对结构化生成质量的可靠、可解释评估手段。 Method: 将源文本和生成表格转化为规范知识图谱，利用大语言模型引导的匹配过程进行对齐，基于图推理计算结构和事实保真度得分，设计无需参考表的可解释评估框架TabReX，并构建多领域、多层次扰动的基准TabReX-Bench进行系统评估。 Result: TabReX在与专家排名的相关性上表现最佳，对复杂扰动保持稳定，支持细粒度的模型与提示分析，提供可控的敏感性-特异性权衡及单元级错误溯源。 Conclusion: TabReX为表格生成提供了可信、可解释的评估新范式，显著优于现有方法，推动结构化生成系统的可靠评估发展。 Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Joel Mire,Maria Antoniak,Steven R. Wilson,Zexin Ma,Achyutarama R. Ganti,Andrew Piper,Maarten Sap

Main category: cs.CL

TL;DR: 本文提出了SocialStoryFrames，一种用于捕捉读者对故事反应的推断的计算框架，并通过两个模型SSF-Generator和SSF-Classifier加以实现，验证了其在大规模社交媒体故事分析中的有效性。

Details

Motivation: 现有计算模型难以捕捉读者阅读故事时产生的丰富解释性、情感性和评价性反应，因此需要一种能更细致建模读者响应的方法。 Method: 结合叙事理论、语言语用学和心理学构建了一个分类体系，提出SocialStoryFrames形式化框架，并开发SSF-Generator与SSF-Classifier两个模型，使用包含6140个社交媒体故事的数据集SSF-Corpus进行分析，通过人类调查（N=382）和专家标注分别验证模型。 Result: 模型能够有效识别作者意图、解释与预测推理、情感反应和价值判断；分析揭示了不同在线社区中叙事意图的频率与相互关系，以及叙事实践及其多样性的差异。 Conclusion: SocialStoryFrames通过结合细粒度的情境敏感建模与通用的读者反应分类体系，为研究在线社区中的讲故事行为提供了新工具。 Abstract: Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Armağan Amcalar,Eyup Cinar

Main category: cs.CL

TL;DR: 本文提出了一种基于Mermaid指令图的结构化提示框架BRAID，通过限制推理过程提升大语言模型在自主代理系统中的推理准确性和成本效率。

Details

Motivation: 大语言模型在性能、成本和令牌使用之间存在非线性关系，传统无约束自然语言推理容易导致高成本和低效，因此需要一种更高效的结构化推理方法。 Method: 设计并实现了BRAID框架，利用Mermaid语法构建机器可读的有界推理图，引导模型进行结构化而非自由扩展的推理，并在多个基准数据集上评估不同GPT模型的表现。 Result: 实验表明，与传统提示相比，BRAID显著提高了推理准确性和成本效率，尤其在AdvancedIF、GSM-Hard和SCALE MultiChallenge等挑战性任务中表现突出。 Conclusion: BRAID是一种有效且可扩展的技术，能够优化自主代理系统中的推理效率，为实际生产系统中的LLM部署提供了新方向。 Abstract: Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

Kieran Henderson,Kian Omoomi,Vasudha Varadarajan,Allison Lahnala,Charles Welch

Main category: cs.CL

TL;DR: 该研究通过分类自我披露句子并构建标注者模型，以预测社会规范判断，发现人口统计学信息比态度、关系和经验更具影响力，且少量相关评论和多样化的自我披露样本能带来最佳性能。

Details

Motivation: 探索何种个人信息对预测主观任务中的标注者标签最具信息量，特别是在社会规范判断任务中。 Method: 对自我披露句子进行分类，并使用理论驱动方法与自动聚类方法构建标注者模型，进行多种消融实验和分析。 Result: 人口统计学信息对预测标注模式的影响最大；理论驱动方法优于自动聚类；仅需少量相关评论即可取得良好效果；更多样化的自我披露样本带来最佳预测性能。 Conclusion: 在建模标注者判断时，应优先考虑人口统计学特征和多样化的自我披露内容，且简单的理论分类方法已足够有效。 Abstract: Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng,Sinan Wang,Zhengxiang Cheng,Yao Wan,Dongping Chen

Main category: cs.CL

TL;DR: 本文提出了Sage，一种无需人类标注即可评估LLM作为评判者（LLM-as-a-Judge）质量的新框架，基于理性选择理论的公理引入了局部自一致性和全局逻辑一致性两个新指标，并发现当前主流LLM在担任评判者时存在显著可靠性问题。

Details

Motivation: 现有LLM-as-a-Judge评估依赖人类标注的真实标签，存在人为偏见和可扩展性限制，因此需要一个不依赖人类标注、更可靠且可扩展的评估方法。 Method: 受理性选择理论启发，提出两个新评估维度：局部自一致性（成对偏好稳定性）和全局逻辑一致性（偏好间的传递性），并在650个结合结构化任务与真实用户查询的问题上进行实验验证。 Result: 实验表明Sage指标稳定且与LLMBar、RewardBench2等监督基准高度相关；发现顶级模型如Gemini-2.5-Pro和GPT-5在近四分之一难题中无法保持偏好一致；揭示了“情境性偏好”现象，显示显式评分标准有助于提升一致性；微调、面板式评判和深度推理可增强判断一致性；同时发现人类判断也存在显著不一致。 Conclusion: Sage提供了一种可靠、无需人类标注的LLM-as-a-Judge评估方案，揭示了当前LLM作为评判者时的可靠性缺陷，并指出通过设计改进（如显式标准、微调和多模型协作）可提升判断一致性，同时质疑了人类标注作为金标准的可靠性。 Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

[6] Convolutional Lie Operator for Sentence Classification

Daniela N. Rim,Heeyoul Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于李群卷积的新型句子分类模型SCLie和DPCLie，通过捕捉语言中复杂的非欧几里得对称性，在传统卷积神经网络基础上提升了性能。

Details

Motivation: 传统卷积神经网络在建模语言中的复杂变换方面存在局限，难以有效捕捉非欧几里得空间中的对称性结构。 Method: 将李群卷积（Lie Convolutions）引入基于卷积的句子分类器，构建了SCLie和DPCLie模型，利用李群操作捕获语言中复杂的、位置不变的变换模式。 Result: 所提模型在实验中优于传统的卷积句子分类器，表明李群方法能更有效地建模语言中的非常规变换，从而提升分类准确率。 Conclusion: 李群卷积为语言建模提供了新的范式，鼓励进一步探索几何深度学习在自然语言处理中的应用。 Abstract: Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim

Main category: cs.CL

TL;DR: 提出了一种语义驱动的强化学习方法（SRL）用于医学报告生成，通过报告级的临床正确性奖励优化大视觉-语言模型，显著提升生成报告的临床准确性。

Details

Motivation: 现有医学报告生成方法多基于词元级目标，仅模仿放射科医生的语言风格，缺乏对临床正确性的保证，因此需要一种能直接优化医学准确性的训练范式。 Method: 提出语义驱动的强化学习（SRL），采用Group Relative Policy Optimization（GRPO）和基于余弦相似度的报告级奖励（MCCS）来衡量生成报告与参考报告在关键放射学发现上的语义一致性，并引入轻量级推理格式约束以生成结构化思考报告。 Result: 在IU X-Ray和MIMIC-CXR数据集上，MRG-R1分别取得了51.88和40.39的CE-F1分数，达到当前最优性能，验证了语义强化学习优于传统词元级监督。 Conclusion: 优化基于临床语义对齐的报告级奖励比传统的词元重叠更能有效提升医学报告生成的临床正确性，为医学大模型训练提供了新的监督方向。 Abstract: Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

Yash Bhaskar,Sankalp Bahad,Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: 本文针对Faux-Hate共享任务，提出了一种结合先进自然语言处理技术和领域特定预训练的系统，用于检测由虚假叙述驱动的仇恨言论，特别是在混合使用印地语和英语的社交媒体文本中。

Details

Motivation: 由于社交媒体上虚假信息和仇恨言论迅速传播，识别由虚假叙述引发的仇恨言论（Faux-Hate）成为一个关键挑战，尤其是在多语言环境中。 Method: 采用先进的自然语言处理技术，并结合领域特定的预训练模型，利用多任务学习框架同时处理二分类Faux-Hate检测以及目标和严重性预测两个子任务。 Result: 该系统在共享任务中取得了具有竞争力的结果，表明多任务学习能有效提升在复杂文本环境下的检测性能。 Conclusion: 结合领域预训练与多任务学习的方法在检测代码混合的Faux-Hate内容方面是有效的，为应对多语言社交媒体中的有害内容提供了可行方案。 Abstract: Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.

Mengfan Shen,Kangqi Song,Xindi Wang,Wei Jia,Tao Wang,Ziqiang Han

Main category: cs.CL

TL;DR: 提出了一种基于LoRA微调Qwen2.5-7B模型的领域自适应信息抽取管道，用于从微博警察通报中提取结构化信息，取得了高精度效果。

Details

Motivation: 从非正式、异构的社交媒体文本（如微博警察通报）中准确提取结构化信息具有挑战性，但对社会科学研究至关重要。 Method: 采用针对性提示工程结合低秩适应（LoRA）对Qwen2.5-7B模型进行参数高效微调，构建多任务信息抽取流水线，从27,822条微博中人工标注4,933个样本，抽取15个关键字段。 Result: LoRA微调显著优于基础和指令调优模型，在死亡人数检测上准确率超98.36%，死亡数与省级地点提取的精确匹配率分别为95.31%和95.54%。 Conclusion: 该方法为专业领域内的多任务结构化信息提取提供了高效、可靠的解决方案，可有效将非结构化文本转化为可用于研究的结构化数据。 Abstract: Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam

Main category: cs.CL

TL;DR: 提出一种独立于大语言模型的 fact-checking 模块，结合领域特定的摘要模型，以减少医疗文本生成中的幻觉问题，基于MIMIC-III数据集微调，并通过逻辑与数值检验验证事实准确性。

Details

Motivation: 大语言模型在医疗场景中容易产生幻觉输出，影响决策安全，需要提高生成内容的可靠性。 Method: 使用LoRa在MIMIC-III数据集上微调摘要模型，并设计一个基于离散逻辑和数值检验的独立fact-checking模块，用于验证生成内容中的命题与电子健康记录的一致性。 Result: fact-checking模块达到0.8904的精确率、0.8234的召回率和0.8556的F1分数；摘要模型获得0.5797的ROUGE-1和0.9120的BERTScore。 Conclusion: 所提出的fact-checking模块能有效提升医疗文本生成的准确性，降低幻觉风险，具备临床应用潜力。 Abstract: In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

Qizhou Chen,Chengyu Wang,Taolin Zhang,Xiaofeng He

Main category: cs.CL

TL;DR: 提出基于信息瓶颈理论的大模型知识编辑框架IBKE，通过紧凑的潜在表示实现高效、泛化性强的知识修正。

Details

Motivation: 现有大模型编辑技术难以在不重训练的情况下实现跨领域的知识更新，且易导致非预期行为，限制了其实际应用。 Method: 基于信息瓶颈理论，设计信息压缩与隔离机制，利用紧凑的潜在表示引导基于梯度的更新，开发IBKE编辑框架。 Result: 在多种大模型架构和基准任务上验证了IBKE的有效性，表现出最先进的准确性、更好的编辑泛化性和特异性。 Conclusion: IBKE为开放域知识编辑提供了理论严谨且实用的新范式，提升了大模型在现实应用中的可用性与可信度。 Abstract: Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE's effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu,Yijie Jin,Jiajun Li,Yi Tu,Guoping Long,Dandan Tu,Tianqi Hou,Junchi Yan,Zhijie Deng

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、即插即用的推理加速算法LoPA，通过优化Token填充顺序（TFO）显著提升扩散大语言模型（dLLM）的并行解码效率，并结合多设备推理系统实现高达10.1 TPF和1073.9 tokens/秒的吞吐量。

Details

Motivation: 现有的基于置信度的dLLM解码策略并行度有限，通常每轮前向传播仅生成1-3个token，限制了推理速度。作者发现解码过程中的Token填充顺序（TFO）对并行程度高度敏感，因此提出需系统性优化TFO以提升效率。 Method: 提出Lookahead PArallel Decoding（LoPA）算法，通过多个并行分支同时探索不同的候选TFO，并根据各分支的置信度选择最具未来并行潜力的填充顺序。该方法无需训练，可直接应用于现有模型。同时设计了支持分支并行（BP）的多设备推理系统以支持高并发。 Result: 将LoPA应用于最先进的D2F模型，在GSM8K上使D2F-Dream的TPF从原有水平提升至10.1，性能仍优于Dream基线；在多GPU部署下，单样本吞吐量达到1073.9 tokens/秒。 Conclusion: LoPA通过优化Token填充顺序显著提升了dLLM的解码并行度和推理效率，配合专用多设备系统实现了当前最优的高速解码性能，为dLLM的实际应用提供了高效解决方案。 Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

[13] Sigma-Moe-Tiny Technical Report

Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong

Main category: cs.CL

TL;DR: 本文提出了Sigma-MoE-Tiny，一种具有极高稀疏性的Mixture-of-Experts语言模型，每层最多包含96个专家但每个token仅激活一个专家，实现了20B总参数中仅激活0.5B的高效性。为解决极端稀疏带来的专家负载均衡难题，提出渐进式稀疏化策略，并通过稳定训练和后训练提升性能，在多种评测中表现优于或媲美更大规模模型。

Details

Motivation: 现有的MoE模型在扩展性和效率上虽有优势，但在极高稀疏度下（如每层大量专家且极低激活）面临专家负载不均和训练不稳定的问题，亟需新的方法以实现高效且稳定的稀疏化训练。 Method: 采用细粒度专家分割，每层设置多达96个专家，每个token仅激活一个专家；引入渐进式稀疏化调度策略来改善低层的负载均衡问题；结合高质量语料预训练与后续训练提升模型能力。 Result: Sigma-MoE-Tiny在仅激活0.5B参数的情况下达到20B总参数规模，训练过程稳定无崩溃损失尖峰，专家利用率显著提升；在多项基准测试中性能优于同规模甚至更大模型。 Conclusion: Sigma-MoE-Tiny通过精细设计的专家结构和渐进稀疏化策略，成功实现了当前开源模型中最高的稀疏程度，同时保持优异性能和训练稳定性，为未来高稀疏MoE架构的发展提供了有效路径与深入见解。 Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

Yehor Tereshchenko,Mika Hämäläinen,Svitlana Myroniuk

Main category: cs.CL

TL;DR: 本研究比较了OpenAI的GPT模型在芬兰语与四种低资源乌拉尔语（科米-兹梁语、莫克沙语、埃尔齐亚语、乌德穆尔特语）之间的翻译表现，重点分析推理与非推理架构在拒绝率上的差异。结果显示，推理模型的拒绝率低16个百分点，表明其在低资源语言翻译中更具潜力。

Details

Motivation: 现有大语言模型翻译评估主要集中于高资源语言，缺乏对低资源及濒危语言表现的理解，尤其是乌拉尔语系语言面临数据稀缺问题，亟需评估模型在其上的适用性。 Method: 使用文学文本的平行语料库，对GPT系列中推理与非推理架构模型进行翻译任务评估，通过拒绝率分析模型尝试翻译的意愿。 Result: 推理模型相较非推理模型在翻译任务中的拒绝率显著降低16个百分点，表现出更强的翻译尝试意愿和性能优势。 Conclusion: 推理架构显著提升大模型在低资源乌拉尔语翻译中的表现，降低了拒绝率，为濒危语言的保护与技术应用提供了可行路径，也凸显了模型架构选择的重要性。 Abstract: The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI's GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.

[15] Hacking Neural Evaluation Metrics with Single Hub Text

Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai

Main category: cs.CL

TL;DR: 提出一种在离散空间中寻找对抗性文本的方法，揭示神经文本评估指标（如COMET）的脆弱性，发现的单一枢纽文本在多个翻译任务中表现出异常高的评分并具有跨语言对泛化能力。

Details

Motivation: 当前基于神经网络的文本评估指标（如COMET）存在黑箱问题，缺乏可靠性与安全性保障，需探究其在面对对抗性输入时的稳定性。 Method: 在离散文本空间中设计方法寻找一个能被持续评为高质量的单一对抗性枢纽文本，用于暴露评估指标的漏洞。 Result: 所找到的枢纽文本在WMT'24 En-Ja和En-De任务上分别取得79.1 COMET%和67.8 COMET%，表现超过M2M100模型为各句子单独生成的翻译，并验证了其在Ja-En和De-En等多语言对上的泛化性。 Conclusion: 现有神经评估指标存在严重漏洞，单一对抗文本即可获得高分且跨语言有效，提示需重新审视其可靠性与安全性。 Abstract: Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi,Javier Garcia Gilabert,Zachary Hopton,Vilém Zouhar,Carlos Escolano,Gerard I. Gállego,Jorge Iranzo-Sánchez,Ahrii Kim,Dominik Macháček,Patricia Schmidtova,Maike Züfle

Main category: cs.CL

TL;DR: 本论文提出了“Hearing to Translate”测试套件，系统评估了5种最先进的语音大语言模型（SpeechLLMs）与16种级联及直接翻译系统的性能，结果表明当前级联架构在多数场景下仍优于SpeechLLMs。

Details

Motivation: 研究语音大模型（SpeechLLMs）是否在语音到文本翻译质量上优于传统的级联架构，尤其是在多语言、复杂语音条件下的表现仍不明确。 Method: 构建了名为“Hearing to Translate”的综合测试套件，对5种先进的SpeechLLMs与16种结合领先语音基础模型（SFM）和多语言大语言模型（LLM）的直接与级联系统进行了严格基准测试，覆盖16个基准、13种语言对和9种挑战性语音条件。 Result: 实验结果显示，在大多数情况下，级联系统整体上更为可靠；当前SpeechLLMs仅在部分特定设置中可与级联系统相当，而纯SFM表现落后；集成LLM（无论在模型内部还是流水线中）对高质量语音翻译至关重要。 Conclusion: 尽管SpeechLLMs具有简化流程的潜力，但目前在跨语言语音翻译任务中，结合LLM的级联系统仍是更优选择，未来SpeechLLMs的发展需进一步提升鲁棒性和翻译质量。 Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

Darshil Chauhan,Adityasinh Solanki,Vansh Patel,Kanav Kapoor,Ritvik Jain,Aditya Bansal,Dhruv Kumar,Prateek Narang

Main category: cs.CL

TL;DR: 本文提出了一种高效且隐私保护的低秩自适应框架（LoRA），结合多域经验回放，以解决在资源受限环境下临床语音识别中的数据隐私、计算资源和声学域偏移问题。该方法显著提升了真实场景下的ASR性能，并减少了灾难性遗忘。

Details

Motivation: 由于数据隐私限制、计算资源有限以及严重的声学域偏移，现有的多语言ASR模型在实际临床环境中表现极差（WER高达40.94%），难以部署应用。因此需要一种可在边缘设备上持续学习且保护隐私的自适应方法。 Method: 采用低秩适应（LoRA）实现边缘设备上的持续学习，结合多域经验回放机制来缓解灾难性遗忘，从而在不传输原始数据的情况下对IndicWav2Vec模型进行领域自适应。 Result: 在目标领域（Gram Vaani临床音频）上实现了17.1%的相对WER降低，并且相比朴素自适应方法，灾难性遗忘减少了47%。 Conclusion: 所提出的框架为在高影响真实环境（如农村医疗）中构建可靠、可自我改进的ASR系统提供了可行路径，兼顾了性能、隐私与资源约束。 Abstract: Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Primoz Kocbek,Leon Kopitar,Gregor Stiglic

Main category: cs.CL

TL;DR: 本研究探讨了使用大语言模型（LLM）简化生物医学文本以提高健康素养的方法，比较了基于提示模板、双AI代理和微调等策略，并评估了GPT-4o及GPT-4o-mini的性能。

Details

Motivation: 旨在提升公众对复杂生物医学信息的理解能力，通过文本简化降低阅读门槛，从而改善健康素养。 Method: 采用公开数据集中的生物医学摘要及其通俗化版本，设计并比较了三种方法：基于提示模板的基线方法、双AI代理协作方法和微调方法；使用GPT-4o和GPT-4o-mini作为基准模型，并结合Flesch-Kincaid、SMOG、SARI、BERTScore、G-Eval等自动指标以及5点李克特量表进行定性评估。 Result: GPT-4o-mini在各项评估中表现优于其他方法，而微调方法表现不佳；G-Eval这一基于LLM的自动评分指标与人工评估结果趋势一致，显示出良好潜力。 Conclusion: 轻量级大模型如GPT-4o-mini在生物医学文本简化任务中表现优异，无需微调即可取得良好效果；基于LLM的自动评估指标G-Eval可有效替代部分人工评价，具备应用前景。 Abstract: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

Primoz Kocbek,Gregor Stiglic

Main category: cs.CL

TL;DR: 本文描述了提交给CLEF 2025 SimpleText赛道任务1的方法，使用gpt-4.1系列模型进行科学文本的句子级和文档级简化，比较了无上下文提示工程与微调方法的效果。

Details

Motivation: 旨在提升科学文本在句子和文档层面的可读性，支持非专业读者理解复杂内容。 Method: 采用gpt-4.1、gpt-4.1-mini和gpt-4.1-nano模型，对比无上下文的提示工程方法与微调方法在句子和文档级简化中的表现。 Result: gpt-4.1-mini在无上下文设置下表现稳健；微调模型效果参差，其中gpt-4.1-nano-ft在某一文档级任务中表现突出。 Conclusion: 提示工程在多数情况下优于微调，不同粒度的文本简化存在独特挑战，模型选择需根据任务需求权衡。 Abstract: This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero,David Montero,Roman Orus

Main category: cs.CL

TL;DR: 提出了一种名为Refusal Steering的推理时方法，用于在不重新训练的情况下精细控制大语言模型在政治敏感话题上的拒绝行为。

Details

Motivation: 现有的基于模式的拒绝检测方法脆弱且不够精确，难以在保持安全对齐的同时灵活控制模型的拒绝行为，尤其是在政治敏感话题上。 Method: 使用LLM-as-a-judge为拒绝行为打分，并采用岭正则化变体计算更精确的引导向量，通过激活引导在推理时调节模型行为。 Result: 在Qwen3-Next-80B-A3B-Thinking等模型上成功移除政治敏感话题的拒绝行为，同时在JailbreakBench上保持安全性，在通用基准上接近基线性能；该方法可跨不同规模模型泛化，并能诱导目标性拒绝。 Conclusion: 激活引导是一种可行的方法，可在推理时实现对模型拒绝行为的细粒度、透明化控制，兼顾内容安全与政治敏感话题的响应灵活性。 Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He,Zekai Qu,Zeyuan Liu,Yinghao Chen,Yuxin Zuo,Cheng Qian,Kaiyan Zhang,Weize Chen,Chaojun Xiao,Ganqu Cui,Ning Ding,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了JustRL，一种简化的强化学习方法，采用单阶段训练和固定超参数，在更少计算资源下实现了最先进的性能，挑战了当前复杂训练策略的必要性。

Details

Motivation: 研究动机是质疑当前强化学习中不断增加的复杂性（如多阶段训练、动态超参数等）是否真的必要，并探索更简单、稳定的方法是否也能取得优异表现。 Method: 提出JustRL，使用单阶段训练和固定超参数，无需复杂的调度或课程学习，在两个1.5B规模的语言模型上进行数学推理任务的训练。 Result: JustRL在九个数学基准上的平均准确率达到54.9%和64.3%，性能达到SOTA，且计算量减少2倍；训练过程稳定，无崩溃或停滞；超参数无需调优即可迁移；引入常见技巧反而可能损害性能。 Conclusion: 当前领域可能过度增加复杂性来解决本可通过稳定基础方法避免的问题，JustRL提供了一个简单、可复现的基线，倡导回归简洁有效的训练范式。 Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

William English,Chase Walker,Dominic Simon,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为GinSign的框架，用于将自然语言 grounding 到系统签名中以进行时序逻辑翻译，通过分层分解方法提升了 grounded 翻译的准确性，在多个领域实现了95.5%的逻辑等价性，比现有最佳方法提升1.4倍。

Details

Motivation: 现有自然语言到时序逻辑的翻译框架通常假设原子 grounding 准确或存在翻译准确率低的问题，限制了在可信自主系统中的应用。 Method: 提出GinSign框架，引入一个分层的 grounding 模型：首先预测谓词标签，然后选择合适类型的常量参数，将自由生成问题转化为结构化分类问题，从而可使用更小的掩码语言模型并减少对大型语言模型的依赖。 Result: 实验表明，忽略 grounding 的框架虽能生成语法正确的LTL，但语义上与目标表达式不等价；而GinSign支持下游模型检测，实现了95.5%的 grounded 逻辑等价得分，相比SOTA提升1.4倍。 Conclusion: GinSign通过结构化分类和分层 grounding 显著提高了自然语言到时序逻辑翻译的语义准确性和实用性，适用于构建可信的自主系统。 Abstract: Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically -- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar

Main category: cs.CL

TL;DR: 提出了一种推理轨迹增强的RAG框架，通过三阶段结构化推理（文档裁决、冲突分析和基于证据的综合）来解决检索内容冲突、过时或主观信息的问题，并引入CATS评估流程以提升系统可解释性和准确性。

Details

Motivation: 现有RAG方法在面对检索结果冲突、过时或主观信息时表现不佳，且缺乏统一的推理监督机制。 Method: 构建包含文档级裁决、冲突分析和基于证据综合的三阶段推理增强RAG框架，并设计CATS流水线进行多维度评估（如事实正确性、拒绝准确性等），使用LLM-as-a-Judge进行评判。 Result: 在539个查询的数据集上实验显示性能显著提升，特别是Qwen模型经监督微调后，端到端答案正确率从0.069升至0.883，行为符合度从0.074升至0.722。 Conclusion: 该框架有效提升了RAG系统在处理冲突和不确定信息时的准确性与可解释性，为构建可信、可追溯的问答系统提供了可行路径。 Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

Primož Kocbek,Azra Frkatović-Hodžić,Dora Lalić,Vivian Hui,Gordan Lauc,Gregor Štiglic

Main category: cs.CL

TL;DR: 本文研究了在生物医学多模态问答中，将图表转换为文本与使用OCR-free视觉检索之间的权衡，发现对于中等规模模型，文本化图表更有效，而前沿大模型下OCR-free方法表现接近，且ColFlor在轻量级检索中表现优异。

Details

Motivation: 探讨在多模态检索增强生成（MM-RAG）中，何时应将图表转为文本、何时直接使用图像进行检索，特别是在视觉信息密集的糖生物学领域缺乏明确指导。 Method: 构建包含120道选择题的基准数据集，按检索难度分层；实现四种增强方式（无增强、文本RAG、多模态转换、基于ColPali的视觉检索），使用Docling解析和Qdrant索引，并在多个开源与闭源模型上评估性能。 Result: Gemma-3-27B-IT模型下，文本与多模态转换优于OCR-free方法（准确率0.722–0.740 vs. 0.510）；GPT-4o下三者接近（多模态0.808，文本0.782，ColPali 0.745）；GPT-5系列中ColPali/ColFlor提升至0.828，且不同视觉检索器间无显著差异，GPT-5-nano落后8-10%。 Conclusion: 模型能力决定最优路径：中等模型更适合将图表转为文本以降低理解负担，前沿大模型下OCR-free视觉检索更具竞争力；ColFlor在性能相当的同时更轻量，是强生成器下的高效默认选择。 Abstract: Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

William English,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为Grammar Forced Translation (GraFT)的框架，用于将自然语言翻译为时序逻辑语言，通过限制每步输出的有效词汇集来降低任务复杂度，并在多个基准上显著提升了端到端和域外翻译准确率。

Details

Motivation: 现有方法在原子命题提取、共指消解和小样本学习方面存在困难，难以实现自然语言到形式语言的准确翻译。 Method: 提出GraFT框架，利用任务特性逐步限制语言模型输出的有效令牌集，分解提升和翻译两个阶段的搜索空间，从而简化整体任务。 Result: 在CW、GLTL和Navi基准上的实验表明，GraFT相比现有最先进方法平均提升端到端翻译准确率5.49%，域外翻译准确率提升14.06%。 Conclusion: 通过约束输出语法空间，GraFT能更高效地学习自然语言到时序逻辑的映射，在准确性和泛化能力上均优于现有方法。 Abstract: Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

Aditya Yadavalli,Tiago Pimentel,Tamar I Regev,Ethan Wilcox,Alex Warstadt

Main category: cs.CL

TL;DR: 本文提出了一种信息论方法，利用大模型量化语音中韵律（如情感、反讽）所传递的独立于文本的信息量，发现韵律在缺乏长期语境时比文本多传递一个数量级以上的情感与反讽信息。

Details

Motivation: 韵律承载了文本无法表达的重要语义信息，但如何量化其独立贡献尚不明确。 Method: 结合大型语音与语言模型，估计话语意义维度（如情感）与其传播通道（如音频、文本）之间的互信息。 Result: 在讽刺和情感识别上，音频通道（即韵律）比文本多传递超过十倍的信息；而在疑问句识别上，韵律的额外贡献较小。 Conclusion: 韵律在传达情感和讽刺方面起主导作用，该方法可推广至更多语义维度、通道和语言的研究。 Abstract: Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

Main category: cs.CL

TL;DR: 本文提出了LLMCcache，一种新的逐层缓存框架，通过基于输入序列语义相似性重用中间激活来加速Transformer推理。该方法具有模型无关性，适用于编码器和解码器架构，并支持在任意Transformer层进行缓存，实验显示其在极小精度损失下显著提升了推理速度。

Details

Motivation: 现有的缓存机制（如token级KV缓存）仅限于自回归解码且适用范围有限，而Transformer模型的高推理延迟阻碍了其实时和大规模部署，因此需要更通用、高效的缓存方案。 Method: 提出LLMCcache，一种层级别的缓存框架，利用轻量级指纹机制匹配语义相似的输入，并在任意Transformer层重用中间激活；同时设计自适应驱逐策略以应对缓存过期问题；框架兼容编码器和解码器结构。 Result: 在BERT和GPT-2上于SQuAD、WikiText-103和OpenBookQA数据集进行实验，推理速度最高提升3.1倍，精度损失小于0.5%。 Conclusion: LLMCcache是一种实用且通用的Transformer推理优化方案，能够在几乎不损失准确率的前提下显著降低推理延迟，适用于现实应用场景。 Abstract: Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Tzu-Han Lin,Wei-Lin Chen,Chen-An Li,Hung-yi Lee,Yun-Nung Chen,Yu Meng

Main category: cs.CL

TL;DR: 本文提出AdaSearch，一种两阶段的强化学习框架，旨在使大语言模型能够自适应地平衡参数化知识与外部搜索，减少不必要的搜索调用，提升决策透明度和知识边界意识。

Details

Motivation: 现有搜索代理容易过度依赖搜索或忽视已有知识，导致成本增加或产生幻觉；现有奖励机制需要大量工程设计且难以准确归因，无法有效衡量真正的自适应行为。 Method: 提出AdaSearch，采用两阶段、结果驱动的强化学习框架，将问题求解与是否调用搜索的决策分离，使搜索决策过程显式化和可解释。 Result: 在多个模型族和规模上实验表明，AdaSearch显著提升了知识边界意识，减少了不必要的搜索调用，保持了良好的任务性能，并提供了更透明、可解释的决策行为。 Conclusion: AdaSearch通过显式的搜索决策机制，有效实现了参数知识与外部搜索的自适应平衡，适用于金融、医疗等高风险领域，推动了搜索代理向更智能、可靠的方向发展。 Abstract: Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad

Main category: cs.CL

TL;DR: 本文提出了Multimodal RewardBench 2 (MMRB2)，首个针对多模态理解与生成的奖励模型综合基准，涵盖文本到图像、图像编辑、交错生成和多模态推理四项任务，并基于专家标注的偏好数据评估现有模型性能，发现当前最优模型仍显著落后于人类表现。

Details

Motivation: 奖励模型在大语言模型训练中至关重要，但在处理图文交错序列的多模态模型中尚缺乏系统评估，因此需要一个全面且具有挑战性的多模态奖励模型基准。 Method: 构建包含1,000个专家标注偏好对的MMRB2基准，覆盖四个多模态任务，使用来自23个模型和代理的响应，通过集成过滤策略确保高人类共识；评估多种现有判断模型（如多模态LLM-as-a-judge和基于人类偏好的训练模型）的表现，并分析其与下游任务的相关性。 Result: Gemini 3 Pro达到75-80%准确率，GPT-5和Gemini 2.5 Pro为66-75%，优于GPT-4o（59%），但低于人类的>90%；开源模型Qwen3-VL-32B表现接近Gemini 2.5 Flash（64%）；MMRB2评分与Best-of-N下游任务表现高度相关。 Conclusion: MMRB2为多模态奖励模型提供了可靠评估标准，揭示了当前模型与人类判断之间的差距，并指出了未来改进方向。 Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

[30] In-Context Algebra

Eric Todd,Jannik Brinkmann,Rohit Gandikota,David Bau

Main category: cs.CL

TL;DR: 该研究探讨了在符号意义不固定、随序列变化的代数任务中，Transformer模型如何发展出符号推理机制。模型在可变符号环境下仍能高精度完成算术任务，并泛化到未见的代数群。

Details

Motivation: 探索当符号含义通过上下文动态决定时，Transformer是否仍能发展出有效的推理机制，而非依赖固定的几何嵌入。 Method: 设计了一种新的序列算术任务，其中符号到代数群元素的映射在每个序列中随机变化；使用针对性的数据分布进行因果分析，识别模型内部机制。 Result: 发现三种稳定机制：交换复制（专用注意力头复制答案）、单位元识别（区分含单位元的事实）和基于闭包的抵消（追踪群成员关系以约束有效答案）；模型在训练和未见群上均接近完美准确率。 Conclusion: Transformer不仅依赖几何嵌入，还能在意义不固定的变量环境中发展出类似符号操作的推理机制，支持上下文中的代数推理。 Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Nikhil Prakash,Donghao Ren,Dominik Moritz,Yannick Assogba

Main category: cs.CL

TL;DR: 提出了一种名为Constructive Circuit Amplification的新方法，通过识别推理过程中的关键token和相关模型组件，仅更新这些稀疏组件来提升特定任务（如数学推理）的性能，实现高达11.4%的准确率提升，同时仅修改1.59%的模型参数，且对其他能力影响极小。

Details

Motivation: 基于先前研究发现LLM中存在负责特定任务的稀疏子网络（即电路），并且微调通常通过增强已有电路来提升性能，因此探索是否可以直接干预这些电路以实现更精确、任务定向的能力提升。 Method: 提出Constructive Circuit Amplification方法，从模型的推理轨迹中识别出关键token以及负责目标任务的模型组件，并仅对这些组件进行更新。 Result: 在数学推理任务上应用该方法，多个模型的准确率最高提升了11.4%，仅修改了1.59%的模型组件，且在MMLU、TriviaQA和TruthfulQA等基准上对其他能力影响极小。 Conclusion: 通过选择性地更新一组稀疏的模型组件，可以可靠地增强大语言模型的特定能力，验证了直接干预功能电路的有效性和可行性。 Abstract: Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

Yan Yang,George Bebis,Mircea Nicolescu

Main category: cs.CV

TL;DR: 提出了一种两步生成式数据增强框架，结合基于规则的掩码扭曲和非配对图像到图像转换（GAN），以生成更逼真的遮挡人脸样本，提升了遮挡人脸识别中的数据稀缺和分布偏移问题。

Details

Motivation: 解决遮挡人脸识别中数据稀缺和分布偏移的问题，现有方法生成的遮挡人脸缺乏真实感或多样性不足。 Method: 采用两步框架：首先进行基于规则的掩码扭曲，然后利用GAN进行非配对图像到图像翻译；引入非掩码保留损失和随机噪声注入以稳定训练并提升样本多样性。 Result: 相比仅使用规则扭曲的方法，所提方法在定性评估上有一致提升，并能与现有GAN方法（如IAMGAN）互补；实验验证了各组件的有效性。 Conclusion: 该框架能有效提升遮挡人脸数据增强的真实性和多样性，为面向人脸识别的数据中心化增强提供了可行方向。 Abstract: Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出JARVIS，一种受I-JEPA启发的自监督框架，用于增强多模态大语言模型（MLLMs）的视觉理解能力，通过引入视觉基础模型作为上下文和目标编码器，减少对语言先验的依赖，在多个基准上提升了视觉中心任务的表现。

Details

Motivation: 现有的MLLMs主要依赖文本描述学习视觉内容，导致监督信号主观且不完整，并因多模态指令调优规模较小而过度拟合语言先验，忽视视觉细节。 Method: 将I-JEPA学习范式融入MLLM的标准视觉-语言对齐训练流程中，利用冻结的视觉基础模型作为上下文和目标编码器，训练基于LLM前几层的预测器来从图像中学习结构和语义规律，减少对语言监督的依赖。 Result: 在多个标准MLLM基准测试中，JARVIS在不同LLM家族上均一致提升了视觉中心任务的性能，同时未损害多模态推理能力。 Conclusion: JARVIS通过引入自监督的视觉表示学习机制，有效增强了MLLMs的视觉理解能力，缓解了语言先验过拟合问题，为构建更强视觉推理能力的多模态模型提供了可行路径。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Dwip Dalal,Utkarsh Mishra,Narendra Ahuja,Nebojsa Jojic

Main category: cs.CV

TL;DR: 本文提出了一个名为CityNav的新基准，用于评估多模态大语言模型（MLLM）在真实城市环境中的稀疏视觉导航能力，引入了“路径言语化”（VoP）方法以提升MLLM的推理与导航性能。

Details

Motivation: 现有的评估基准主要侧重于语言或模拟环境，缺乏对真实世界中知识密集型、需要复杂推理的具身智能体决策能力的测试，因此需要一个新的更具挑战性的任务和基准来衡量MLLM在现实场景中的实际表现。 Method: 提出“稀疏视觉导航”任务，并构建包含四个全球城市的CityNav基准；要求代理仅依靠视觉输入和内部多模态推理进行导航；引入VoP方法，通过从MLLM中提取显式的认知地图（关键地标和方向）来增强推理过程。 Result: 实验表明，当前最先进的MLLM及标准推理技术（如思维链、反思）在此任务上表现不佳；而VoP方法显著提升了导航成功率，验证了其有效性。 Conclusion: CityNav为评估MLLM在真实世界复杂环境中的具身导航能力提供了新标准，VoP展示了通过显式认知映射改善MLLM推理的潜力，推动了MLLM向更强大的具身智能发展。 Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: 本文提出了R4，一种无需训练的、基于4D时空空间的检索增强推理框架，通过为视觉-语言模型构建结构化的持久记忆，提升其在动态环境中对时空信息的理解与推理能力。

Details

Motivation: 受人类在四维空间中构建语义化、持久化内部表征的启发，研究者希望赋予视觉-语言模型类似的能力，以实现对时空动态环境的有效感知与推理。 Method: R4框架通过在度量空间和时间中锚定对象级语义描述，持续构建4D知识数据库；在推理时，将自然语言查询分解为语义、空间和时间键，用于从数据库中检索相关信息并融入VLM的推理过程。 Result: 在具身问答和导航基准上的实验表明，R4显著优于基线方法，在时空信息的检索与推理性能上均有提升。 Conclusion: R4提供了一种无需训练、支持多智能体共享的4D检索增强推理新范式，推动了动态环境中具身智能的4D推理发展。 Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar,Fenil Bardoliya,Pavan K. Turaga,Chitta Baral,Vivek Gupta

Main category: cs.CV

TL;DR: 本文提出了一个名为“感知观测站”（The Perceptual Observatory）的新框架，用于系统评估多模态大语言模型（MLLMs）的视觉感知能力，超越传统的任务准确率指标，关注模型在受控扰动下的鲁棒性、归因保真度和推理能力。

Details

Motivation: 现有MLLMs虽然强大，但其视觉感知能力缺乏深入评估；多数模型仅扩展语言部分而复用相似视觉编码器，导致难以判断性能提升是源于真实视觉理解还是依赖文本先验知识。 Method: 构建包含多个垂直维度的评估框架，包括基础视觉任务（如人脸匹配、图像中文本理解）和局部到全局的理解（如图像匹配、网格指向游戏、属性定位），使用含真实标签的人脸与文字数据集，并通过像素级增强和扩散模型生成的风格化错觉进行系统性扰动测试。 Result: 该框架揭示了当前MLLMs在不同扰动下感知接地和关系结构保持的能力差异，发现了模型对特定视觉特征的敏感性和脆弱性，提供了比传统基准更细粒度的分析视角。 Conclusion: The Perceptual Observatory 为评估 MLLMs 的真正视觉理解能力提供了原则性工具，强调需超越端到端准确率，转向更具解释性的感知分析，以指导未来更可靠多模态模型的设计。 Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal,Yuchen Liu,Luigi Palmieri,Ilche Georgievski,Marco Aiello

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型的框架CAMP-VLM，用于从第三人称视角预测多个人类行为，结合视觉上下文和场景图的空间感知，并通过合成数据进行微调，在预测准确率上比现有最佳方法提高了66.9%。

Details

Motivation: 现有研究主要集中于单人场景下的第一人称视角行为预测，而许多机器人应用需要从第三人称视角理解多人行为，缺乏合适的多人体行为预测数据集和方法。 Method: 提出CAMP-VLM框架，结合视觉语言模型、场景图的空间信息和视觉上下文特征；使用逼真模拟器生成合成人类行为数据，并采用监督微调（SFT）和直接偏好优化（DPO）进行模型训练。 Result: 在合成和真实世界序列上评估，CAMP-VLM比最优基线方法在预测准确率上最高提升66.9%，展现出良好的泛化能力。 Conclusion: CAMP-VLM通过融合上下文感知与场景结构信息，有效提升了第三人称视角下的多人体行为预测性能，验证了合成数据训练与VLM框架在该任务中的有效性。 Abstract: Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 本论文探索了视觉-语言模型（VLM）在少样本多光谱目标检测中的潜力，提出有效融合文本、可见光与热成像模态的方法，在数据稀缺情况下显著优于传统模型，并在全监督设置下表现优越。

Details

Motivation: 由于标注多光谱数据稀缺，深度检测器训练受限，而文本类别信息可提供有价值的语义监督，因此探索VLM在少样本多光谱检测中的应用动机强烈。 Method: 适配两种代表性的VLM检测器（Grounding DINO和YOLO-World）以处理多光谱输入，并提出一种有效的多模态（文本、视觉、热）融合机制。 Result: 在FLIR和M3FD两个基准上实验表明，VLM-based检测器在少样本和全监督设置下均优于或媲美专用多光谱模型。 Conclusion: 大规模VLM学习到的语义先验可有效迁移到未见的光谱模态，为数据高效的多光谱感知提供了有力途径。 Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario,Mason J. Earles

Main category: cs.CV

TL;DR: 本文评估了多种视觉语言模型（VLMs）在农业分类任务中的表现，发现现有VLM在农业诊断中仍不足以替代专用监督模型，但在约束条件下可作为辅助工具。

Details

Motivation: 探讨当前视觉语言模型在农业决策支持中的可靠性，填补该领域基准研究的空白。 Method: 在AgML集合的27个农业分类数据集上，对开源和闭源VLM进行基准测试，涵盖植物病害、虫害与损伤、植物与杂草种类识别等任务，采用零样本设置，并比较不同提示方式（多选与开放式）及评估方法（如LLM语义评分）的影响。 Result: 零样本VLM显著低于专用监督模型YOLO11的准确率；最佳闭源模型Gemini-3 Pro在多选提示下平均准确率为62%，开放式提示下通常低于25%；使用LLM语义评分可将开放式准确率提升至30%；开源模型中Qwen-VL-72B表现最佳；植物与杂草分类任务相对简单，虫害与损伤识别最难。 Conclusion: 目前现成的VLM尚不适合作为独立的农业诊断系统，但可通过约束接口、明确标签本体和领域感知的评估策略作为辅助组件使用。 Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

Lars Beckers,Arno Waes,Aaron Van Campenhout,Toon Goedemé

Main category: cs.CV

TL;DR: 提出一种通过视觉感知和自适应决策主动增强花园生物多样性的机器人割草框架，利用深度特征空间分析识别并保留植被斑块，无需物种级监督即可估计生物多样性。

Details

Motivation: 传统割草方式破坏植被多样性，而被动再野化方法效果有限，因此需要一种主动保护城市绿地生物多样性的智能化解决方案。 Method: 采用预训练的ResNet50网络提取植物图像的生态有意义嵌入，通过全局偏差度量评估嵌入空间中的视觉多样性，并驱动选择性割草算法动态控制刀片启停。 Result: 在模拟草坪和真实花园数据集上验证了系统有效性，嵌入空间离散度与专家对生物多样性的评估高度相关。 Conclusion: 深度视觉多样性可作为生态丰富度的有效代理，所提割草决策方法能有效促进城市生物多样性，有望将单一草坪转变为有价值的生物栖息地。 Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.

Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Ziyuan Liu,Abhinav Valada

Main category: cs.CV

TL;DR: 提出一种生成视频-动作对的方法，基于文本指令和初始图像及机器人状态，通过扩展预训练视频扩散模型并引入桥接注意力机制，实现高质量视频和精确动作生成，推动机器人策略学习。

Details

Motivation: 现有方法在视频和动作模态间缺乏紧密耦合的跨模态交互，且难以充分利用预训练视频扩散模型的知识，同时动作标注缺失限制了其在机器人策略学习中的应用。 Method: 1) 扩展预训练视频扩散模型，增加一个并行的专用动作扩散模型以保留预训练知识；2) 引入Bridge Attention机制实现有效的跨模态交互；3) 设计动作细化模块，将粗略动作转换为精确控制，适用于低分辨率数据集。 Result: 在多个公开基准和真实世界数据集上验证，该方法生成的视频质量更高、动作更准确，显著优于现有基线方法。 Conclusion: 该方法提供了一个可扩展的框架，能够有效利用大规模视频数据进行机器人学习，解决了动作标注缺失和跨模态融合不足的问题。 Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

Jiaheng Geng,Jiatong Du,Xinyu Zhang,Ye Li,Panqu Wang,Yanjun Huang

Main category: cs.CV

TL;DR: 提出了一种用于端到端自动驾驶的闭环评估平台，通过生成真实场景中的对抗性交互来有效检测模型在安全关键角落情况下的性能退化。

Details

Motivation: 现有对抗性评估方法多基于简化仿真环境，难以应用于真实世界端到端自动驾驶系统，缺乏对真实场景中安全关键角落情况的有效评估。 Method: 构建一个闭环评估平台，结合基于流匹配的真实世界图像生成器与对抗性交通策略，生成真实驾驶图像并模拟挑战性交互，以评估基于真实数据训练的端到端模型。 Result: 平台能高效稳定地生成逼真的驾驶图像，并通过对抗策略成功生成当前自动驾驶系统难以应对的角落案例；在UniAD和VAD等模型上的实验表明该平台可有效揭示模型性能下降问题。 Conclusion: 所提平台能够有效检测端到端自动驾驶模型在安全关键场景中的潜在问题，有助于提升系统的安全性与鲁棒性。 Abstract: Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model's potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

Hao Tang,Hanyu Liu,Alessandro Perelli,Xi Chen,Chao Li

Main category: cs.CV

TL;DR: 提出一种基于3D多通道patch扩散模型的方法，结合解剖先验和球谐系数注意力机制，有效从低角分辨率dMRI预测高角分辨率FOD。

Details

Motivation: 低角分辨率dMRI在预测纤维方向分布（FOD）时精度不足，而高角分辨率需长时间扫描，限制了应用；现有扩散模型因FOD中球谐系数过多而难以高效生成高质量FOD。 Method: 提出3D多通道patch扩散模型，设计FOD-patch适配器引入脑解剖先验以提升patch学习效率，引入体素级条件协调模块增强全局理解，并设计球谐（SH）注意力模块以捕捉SH系数间的复杂相关性。 Result: 实验结果表明，该方法在HAR-FOD预测任务中性能优于现有最先进方法，显著提升预测精度与效率。 Conclusion: 所提方法通过结构创新和先验知识融合，实现了从低分辨率dMRI到高分辨率FOD的高效准确估计，具有良好的临床应用前景。 Abstract: Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.

[44] Auto-Vocabulary 3D Object Detection

Haomeng Zhang,Kuan-Chuan Peng,Suhas Lohit,Raymond A. Yeh

Main category: cs.CV

TL;DR: 本文提出了自动词汇3D目标检测（AV3DOD），无需用户指定类别，利用2D视觉-语言模型自动生成语义候选，并引入语义分数（SS）评估生成类别的质量，在ScanNetV2和SUNRGB-D上实现了定位和语义质量的最先进性能。

Details

Motivation: 现有开放词汇3D检测方法仍依赖用户指定类别，限制了其真正意义上的泛化能力，因此需要完全自动化的类别生成方法。 Method: 提出AV3DOD框架，利用2D视觉-语言模型通过图像描述、伪3D框生成和特征空间语义扩展生成丰富的语义候选，并引入语义分数（SS）评估生成类别的质量。 Result: 在ScanNetV2和SUNRGB-D数据集上，AV3DOD在mAP和SS指标上均达到SOTA，相比CoDA在mAP上提升3.48，在SS上相对提升24.5%。 Conclusion: AV3DOD实现了无需用户输入的自动词汇3D检测，在定位精度和语义生成质量方面均显著优于现有方法，推动了开放词汇检测向更自动化方向发展。 Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

[45] LAPX: Lightweight Hourglass Network with Global Context

Haopeng Zhao,Marsha Mariya Kappan,Mahdi Bamdad,Francisco Cruz

Main category: cs.CV

TL;DR: 提出了一种名为LAPX的轻量级高精度人体姿态估计模型，基于Hourglass网络和自注意力机制，在保持2.3M参数的小规模下，在MPII和COCO数据集上实现了具有竞争力的结果，并具备边缘设备上的实时性能。

Details

Motivation: 现有SOTA姿态估计模型计算成本高，轻量化模型在边缘设备部署时存在兼容性问题或精度下降，因此需要兼顾精度、效率与设备适配性的新模型。 Method: 在LAP基础上构建基于Hourglass架构的LAPX模型，引入自注意力机制以捕获全局上下文信息，优化阶段设计并改进轻量级注意力模块。 Result: LAPX仅使用2.3M参数，在MPII和COCO两个基准数据集上取得了具有竞争力的性能，并展现出良好的实时性，适合在边缘设备部署。 Conclusion: LAPX有效平衡了模型精度与计算效率，是一种适用于边缘设备的高性能轻量级人体姿态估计方案。 Abstract: Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.

[46] Collimator-assisted high-precision calibration method for event cameras

Zibin Liu,Shunkun Liang,Banglei Guan,Dongcai Tan,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于闪烁星图和准直仪的事件相机校准方法，通过线性求解和非线性优化实现高精度内外参标定。

Details

Motivation: 事件相机在长距离测量场景下的几何校准存在挑战，尤其是高精度和远距离同时满足的问题。 Method: 利用带有闪烁星图模式的准直仪，基于球面运动模型进行相机参数的线性求解，并通过非线性优化进一步精化参数。 Result: 在多种真实实验条件下，该方法在校准精度和可靠性方面均优于现有方法。 Conclusion: 所提方法有效解决了事件相机在长距离、高精度测量中的校准难题，具有更高的准确性和鲁棒性。 Abstract: Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Jintao Zhang,Kaiwen Zheng,Kai Jiang,Haoxu Wang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: TurboDiffusion是一种加速视频生成的框架，通过注意力加速、步长蒸馏和W8A8量化等技术，实现100-200倍加速同时保持视频质量。

Details

Motivation: 现有的扩散模型在视频生成中计算成本高、速度慢，难以满足实时应用需求，因此需要一种高效的端到端加速框架。 Method: 采用低比特SageAttention和可训练的稀疏线性注意力（SLA）加速注意力计算，使用rCM进行步长蒸馏，并结合W8A8量化技术将模型参数和激活值压缩至8位，同时引入多种工程优化。 Result: 在多个Wan系列大模型上实验表明，TurboDiffusion在单张RTX 5090 GPU上即可实现100-200倍的生成速度提升，且生成视频质量与原模型相当。 Conclusion: TurboDiffusion通过多项关键技术有效提升了扩散模型的视频生成效率，为高性能视频生成提供了实用化解决方案。 Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

[48] Flexible Camera Calibration using a Collimator System

Shunkun Liang,Banglei Guan,Zhenbao Yu,Dongcai Tan,Pengju Sun,Zibin Liu,Qifeng Yu,Yang Shang

Main category: cs.CV

TL;DR: 提出了一种基于设计的准直器系统的新相机标定方法，利用角度不变性约束将相对运动简化为纯旋转，实现了高效、灵活的标定。

Details

Motivation: 传统相机标定方法在控制环境和自由度上存在局限，需要更可靠且可控的标定方案。 Method: 利用准直器系统的光学几何特性引入角度不变性约束，并证明目标与相机间的相对运动符合球面运动模型，从而将其从6自由度减少到3自由度的纯旋转；基于此提出多图像的闭式线性求解器、两图像的最小求解器以及单准直器图像标定算法。 Result: 在合成和真实实验中验证了该方法的可行性，并显示其优于现有基线方法。 Conclusion: 所提方法通过准直器系统和球面运动约束，实现了高精度、灵活性和快速的相机标定，适用于实际应用。 Abstract: Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa,Yang Yang,Risa Shinoda,Hiroaki Santo,Kenji Oyama,Fumio Okura,Takenao Ohkawa

Main category: cs.CV

TL;DR: 本文提出了一种名为CattleAct的方法，用于从单张图像中自动检测放牧牛之间的行为互动，适用于智能畜牧业管理。

Details

Motivation: 由于缺乏包含互动的全面行为数据集，且放牧牛的互动属于罕见事件，导致牛的行为互动检测面临挑战。 Method: 通过将互动分解为个体牛行为的组合，首先从大规模牛行为数据集中学习动作潜在空间，然后使用对比学习对预训练的潜在空间进行微调以嵌入稀有互动，构建统一的动作与互动潜在空间。 Result: 在商业规模牧场上的实验表明，该方法相比基线模型能更准确地检测互动。 Conclusion: CattleAct是一种高效的数据利用方法，能够有效检测牛之间的行为互动，并已实现开源。 Abstract: This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

Ze Yuan,Wenbin Li,Shusen Zhao

Main category: cs.CV

TL;DR: 提出了一种用于双光谱CT（DSCT）的混合重建框架，结合迭代方法与深度学习，通过知识驱动模块（OPMT）快速生成中间解，并利用新型ResDynUNet++网络进行精细化重建。

Details

Motivation: 为解决双光谱CT中通道不平衡和界面附近大伪影等重建挑战，结合传统迭代方法的稳定性和深度学习的强大学习能力。 Method: 框架包含两个部分：1）知识驱动模块使用斜投影修正技术（OPMT）从投影数据重建基材料图像的中间解；2）数据驱动模块采用基于UNet++并引入残差动态卷积块的ResDynUNet++网络对中间结果进行优化。 Result: 在合成体模和真实临床数据集上的实验表明，该方法能有效抑制伪影、提升图像质量，优于现有重建方法。 Conclusion: 所提出的混合框架在保持计算效率的同时显著提升了DSCT的重建精度，为双光谱CT的临床应用提供了可靠的技术支持。 Abstract: We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

Yueyang Hu,Haiyong Jiang,Haoxuan Song,Jun Xiao,Hao Pan

Main category: cs.CV

TL;DR: 本文提出了一种名为SegGraph的新方法，用于少样本3D部件分割，通过构建基于SAM分割掩码的段图来显式学习几何特征，并利用图神经网络传播2D基础模型特征以学习全局几何结构。

Details

Motivation: 现有方法在3D特征学习中忽略了几何结构或未充分利用SAM提供的高质量分组线索，导致分割不足和部件标签不一致的问题。 Method: 设计了一种基于SAM段图的传播方法，构建节点表示段、边捕捉空间关系（重叠/邻接）的段图；每个节点自适应调节2D基础模型特征，并通过图神经网络进行传播以学习全局几何结构；采用新的视方向加权融合策略映射段特征到3D点，减弱低质量段的贡献。 Result: 在PartNet-E上的大量实验表明，该方法至少比所有竞争基线高出6.9个百分点mIoU；进一步分析显示，SegGraph在小部件和部件边界上表现尤为出色，证明了其优越的几何理解能力。 Conclusion: SegGraph有效整合了2D基础模型的知识与3D几何结构信息，在少样本3D部件分割任务中实现了显著性能提升。 Abstract: This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM's segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: https://github.com/YueyangHu2000/SegGraph.

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Chao Li,Dasha Hu,Chengyang Li,Yuming Jiang,Yuncheng Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督域适应方法C-DGPA，通过双重对齐策略（边缘分布和条件分布）提升视觉-语言模型在下游任务中的性能，显著改善了类原型对齐和语义可分辨性，在多个基准上实现了最先进的结果。

Details

Motivation: 现有基于提示调优的方法主要关注边缘分布对齐，忽略了条件分布差异，导致类原型错位和语义判别能力下降。 Method: 提出C-DGPA，采用双分支架构：一个分支用动态对抗训练对齐边缘分布，另一个分支引入类映射机制（CMM）对齐条件分布，标准化语义理解并防止对源域过度依赖。 Result: 在OfficeHome、Office31和VisDA-2017数据集上进行了大量实验，C-DGPA在所有基准上均取得当前最优性能。 Conclusion: C-DGPA通过协同优化边缘和条件分布对齐，有效融合领域知识到提示学习中，生成域不变且语义可分辨的表示，显著提升了无监督域适应性能。 Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.

[53] Towards Closing the Domain Gap with Event Cameras

M. Oltan Sevinc,Liao Wu,Francisco Cruz

Main category: cs.CV

TL;DR: 本文探讨了事件相机在昼夜光照差异导致的域间隙问题中的应用，相较于传统相机，事件相机在不同光照条件下表现出更一致的性能，并且在跨域场景中提供了优于灰度帧的基准性能。

Details

Motivation: 解决传统相机因训练数据与部署环境不匹配而导致性能下降的问题，特别是昼夜光照变化引起的域间隙问题。 Method: 提出使用事件相机作为替代方案，以减少由于光照条件变化带来的域偏移影响。 Result: 实验结果表明，事件相机在不同光照条件下保持了更加稳定的性能，其域偏移惩罚通常等于或小于灰度帧，在跨域情况下表现更好。 Conclusion: 事件相机是一种有潜力的解决方案，能够在无需额外调整的情况下克服由昼夜光照变化引起的域间隙问题。 Abstract: Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright,Zhibo Wang,Dmytro Klepachevskyi,Yuhao Chen,Sirisha Rambhatla,David Clausi,John Zelek

Main category: cs.CV

TL;DR: Avatar4D是一个可迁移的生成定制化合成人体运动数据集的管道，适用于特定领域应用，无需手动标注即可实现对姿态、外观、视角和环境的精细控制，并通过Syn2Sport数据集验证其在体育领域的有效性。

Details

Motivation: 现有方法主要关注通用日常动作，缺乏灵活性和领域适应性，难以满足如体育等特定场景中对人体动作精细建模的需求。 Method: 提出Avatar4D，一种无需人工标注的4D高保真人体运动生成管道，支持对身体姿态、外观、相机视角和环境的细粒度控制，并构建大规模体育合成数据集Syn2Sport用于评估。 Result: 在Syn2Sport上测试了多种先进姿态估计模型，验证了合成数据在监督学习、零样本迁移到真实数据及跨运动泛化中的有效性，并证明生成数据在特征空间接近真实数据。 Conclusion: Avatar4D能够生成可扩展、可控且可迁移的领域特定人体运动数据集，为不依赖真实域数据的任务提供了可行方案。 Abstract: We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose,Ravi K. Rajendran,Biplob Debnath,Konstantinos Karydis,Amit K. Roy-Chowdhury,Srimat Chakradhar

Main category: cs.CV

TL;DR: 本文提出了一种名为VALOR的新方法，通过强化学习和两阶段对齐框架来提升医学视觉-语言模型在放射学报告生成中的视觉对齐性和临床准确性。

Details

Motivation: 现有的放射学报告生成方法常依赖大量标注数据或检索技术，且存在跨模态对齐不足导致的幻觉问题，难以保证报告的临床准确性和视觉依据性。 Method: 提出VALOR框架，采用基于强化学习的后对齐方法，分两个阶段：首先使用文本奖励优化模型以生成更准确的临床术语；然后对视觉投影模块进行对齐，使其关注与诊断相关的图像区域。 Result: 在多个基准上的实验表明，VALOR显著提升了事实准确性和视觉 grounding 效果，优于当前最先进的方法。 Conclusion: VALOR有效解决了医学视觉-语言模型在放射学报告生成中的跨模态对齐问题，减少了幻觉现象，实现了更可靠、可解释的报告生成。 Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang,Sangwoo Mo,Stella X. Yu,Sima Behpour,Liu Ren

Main category: cs.CV

TL;DR: 本文提出了一种名为OAK的新模型，用于开放性即席视觉场景分类，通过结合CLIP和GCD的优势，在斯坦福和Clevr-4数据集上实现了最先进的性能，并生成可解释的显著性图。

Details

Motivation: 即席类别是动态创建以满足特定目标的，不同于固定的常见类别，因此需要一种能够自适应地发现上下文并扩展类别的方法。 Method: 在冻结的CLIP模型输入端引入少量可学习的上下文令牌，同时优化CLIP的图文对齐目标和GCD的视觉聚类目标。 Result: 在Stanford和Clevr-4数据集上，OAK在多个分类任务中达到最先进水平，例如在Stanford Mood上达到87.4%的新类别准确率，超过CLIP和GCD逾50%，并生成关注手部（动作）、面部（情绪）和背景（位置）的可解释显著性图。 Conclusion: OAK有效结合了常见与即席类别的感知机制，实现了可解释、自适应且可泛化的视觉分类，推动了AI代理在动态任务中的场景理解能力。 Abstract: Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

[57] Enhanced 3D Shape Analysis via Information Geometry

Amit Vishwakarma,K. S. Subrahamanian Moosath

Main category: cs.CV

TL;DR: 本文提出了一种基于信息几何的3D点云形状分析框架，通过将点云表示为高斯混合模型（GMM），并定义具有理论上下界的修正对称KL散度（MSKL），实现了稳定且能反映几何变化的点云比较。

Details

Motivation: 传统点云比较方法如Hausdorff和Chamfer距离难以捕捉全局统计结构且对异常值敏感，现有KL散度近似方法存在无界或数值不稳定问题。 Method: 将点云建模为统计流形上的高斯混合模型（GMM），证明GMM空间构成统计流形，并提出具有理论保证上下界的修正对称KL散度（MSKL）作为相似性度量。 Result: 在MPI-FAUST和G-PCD数据集上的实验表明，MSKL具有数值稳定性，其值能单调反映几何变化，在人体姿态区分和动物形状比较任务中优于传统距离和现有KL近似方法。 Conclusion: MSKL为点云比较提供了一种鲁棒、稳定的度量方式，所提出的信息几何框架为3D形状分析开辟了新路径。 Abstract: Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Zhihao Zhang,Xuejun Yang,Weihua Liu,Mouquan Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于编码器-解码器网络（EDN）的学习框架，用于将随机噪声转换为高质量初始噪声，从而提升单视图扩散模型在新视角合成中的性能。

Details

Motivation: 现有单视图新视角合成的扩散模型缺乏专门学习高质量初始噪声的框架，导致生成质量受限。 Method: 设计离散化Euler反演方法构建高质量噪声数据对，并提出基于编码器-解码器网络（EDN）的噪声转换框架。 Result: 所提EDN可无缝集成到SV3D、MV-Adapter等模型中，在多个数据集上显著提升生成性能。 Conclusion: 通过学习从随机噪声到高质量噪声的映射，能有效增强单视图扩散模型的新视角合成能力。 Abstract: Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

[59] Image Compression Using Singular Value Decomposition

Justin Jiang

Main category: cs.CV

TL;DR: 该研究探讨了使用奇异值分解（SVD）和低秩矩阵逼近进行图像压缩的效果，发现尽管重建图像在视觉上与原图相似，但其压缩效率远不如JPEG、JPEG2000和WEBP等现有格式，在低误差要求下甚至比原始图像更大。

Details

Motivation: 为了降低存储和带宽需求，研究SVD在图像压缩中的应用潜力。 Method: 采用奇异值分解和低秩矩阵逼近方法，对灰度和多通道图像进行压缩，并以相对Frobenius误差和压缩比评估性能。 Result: 重建图像视觉上接近原图，但在相同误差水平下，压缩率始终低于主流格式，且在低误差要求时文件大小可能超过原图。 Conclusion: SVD-based低秩近似方法在实际图像压缩中不具竞争力，不适合替代现有工业标准格式。 Abstract: Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng,Zeeshan Hayder,Wei Liu,Hesheng Wang,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了ARMFlow，一种基于MeanFlow的自回归框架，用于3D人类反应动作生成，通过因果上下文编码和Bootstrap上下文编码提升在线生成的准确性与实时性，并提出其离线变体ReMFlow，实现了最先进的性能。

Details

Motivation: 现有方法无法同时满足高运动保真度、实时推理和在线场景下的自回归适应性这三个关键需求。 Method: 提出ARMFlow框架，包含因果上下文编码器和MLP速度预测器；引入Bootstrap上下文编码（BSCE）训练策略以减少误差累积；并设计了离线版本ReMFlow。 Result: ARMFlow在InterHuman和InterX数据集上比现有在线方法FID指标提升超过40%，推理速度最快，且性能媲美离线最先进方法。 Conclusion: ARMFlow有效解决了3D反应动作生成中高保真、低延迟和自回归稳定性之间的平衡问题，兼具优异的在线与离线表现。 Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

Satya Narayana Panda,Vaishnavi Kukkala,Spandana Iyer

Main category: cs.CV

TL;DR: 本研究开发了一种结合临床影像与家族病史的多模态AI框架，用于提升皮肤疾病诊断准确性，尤其在遗传性皮肤病方面表现突出，具备可解释性并支持临床整合。

Details

Motivation: 由于皮肤科专家资源有限且临床表现复杂，皮肤病诊断具有挑战性；同时家族病史虽对疾病易感性和治疗反应有重要影响，却常被忽视，因此需要一种能整合家族病史与影像数据的AI辅助诊断系统。 Method: 提出一种多模态AI框架，结合基于深度学习的图像分析与结构化临床数据（包括详细家族病史），采用可解释的卷积神经网络与融合遗传风险因素的临床决策树，并计划通过多中心前瞻性临床试验验证其在真实世界中的诊断效能。 Result: 集成家族病史后，AI系统在黑色素瘤、银屑病和特应性皮炎等遗传性皮肤病中表现出更高的诊断准确率；专家评估显示其有助于早期发现和个性化建议，系统具备良好的可解释性，适合临床工作流集成。 Conclusion: 该多模态AI框架有效整合家族病史与影像数据，提升了皮肤病诊断性能，尤其适用于遗传性皮肤疾病，未来通过临床试验验证后有望实现临床广泛应用。 Abstract: Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

Qi Zhang,Yunfei Gong,Zhidan Xie,Zhizi Wang,Antoni B. Chan,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了两种半监督的多视角人群计数框架，通过基于模型预测或模型不确定性的多视角融合模型排序来解决标注数据有限的问题。

Details

Motivation: 由于收集和标注多视角图像困难，现有数据集中的多视角帧和场景数量有限，导致多视角人群计数面临数据稀缺问题。 Method: 第一种方法根据不同输入视角数的模型预测结果进行排序，要求视角越少预测值不应越大；第二种方法根据模型不确定性进行排序，要求更多视角下的不确定性不应高于更少视角下的不确定性，并在训练中以半监督方式引入这些约束。 Result: 实验表明，所提出的方法在有限标注数据下优于其他半监督人群计数方法。 Conclusion: 通过引入基于预测或不确定性的排序约束，所提出的半监督框架有效提升了多视角人群计数在数据稀缺情况下的性能。 Abstract: Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

Paloma Casteleiro Costa,Parnian Ghapandar Kashani,Xuhui Liu,Alexander Chen,Ary Portes,Julien Bec,Laura Marcu,Aydogan Ozcan

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的多通道像素超分辨率框架FLIM_PSR_k，可从大像素尺寸采集的数据中重建高分辨率荧光寿命成像（FLIM）图像，实现5倍超分辨因子，显著提升空间分辨率和图像质量，推动FLIM在临床诊断中的应用。

Details

Motivation: FLIM具有分子和代谢成像潜力，但受限于成像速度慢、信噪比低和分辨率不足，难以广泛应用于临床。需要一种能突破分辨率-速度权衡的方法以促进其实际部署。 Method: 提出FLIM_PSR_k，一种基于条件生成对抗网络（cGAN）的多通道像素超分辨率（PSR）框架，训练模型从高达5倍增大像素尺寸的输入数据中恢复高分辨率FLIM图像，并在患者来源的肿瘤组织样本上进行盲测验证。 Result: 实现了k=5的超分辨因子，输出图像的空间带宽积提升25倍，恢复了低分辨率输入中丢失的精细结构，在多种图像质量指标上表现出统计学显著改善，且推理时间短，适用于实际应用。 Conclusion: FLIM_PSR_k通过提升FLIM的有效空间分辨率，使其向更快、更高分辨率及兼容低数值孔径与微型化平台的实现迈进，增强了FLIM在转化医学中的适用性和潜力。 Abstract: Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Rui Gui,Yang Wan,Haochen Han,Dongxing Mao,Fangming Liu,Min Li,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出了TextEditBench，一个专注于图像中文本区域编辑的综合评估基准，强调需要模型理解物理合理性、语言意义和跨模态依赖的推理密集型编辑场景，并引入语义期望（SE）作为新评估维度来衡量模型在文本编辑中的语义一致性、上下文连贯性和跨模态对齐能力。

Details

Motivation: 文本编辑在图像生成中仍是一个未被充分探索的领域，现有模型难以在保持语义、几何和上下文一致的同时生成可读文本，因此需要一个新的评估基准来推动该方向的发展。 Method: 提出TextEditBench评估基准，聚焦于以文本为中心的图像区域，设计推理密集型编辑任务，并引入新的评估维度——语义期望（SE），用于量化模型在语义一致性、上下文连贯性和跨模态对齐方面的能力。 Result: 在多个最先进的图像编辑系统上进行实验表明，当前模型虽能执行简单文本指令，但在上下文依赖推理、物理一致性和布局感知集成方面仍表现不佳。 Conclusion: TextEditBench为文本引导的图像编辑与多模态生成中的推理能力提供了一个新的测试平台，有助于推动该领域的进一步发展。 Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

[65] GFLAN: Generative Functional Layouts

Mohamed Abouagour,Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: 本文提出GFLAN，一种通过将拓扑规划与几何实现显式分解来生成建筑平面图的生成框架，分两阶段进行：先分配房间中心点，再通过图神经网络回归房间边界。

Details

Motivation: 现有深度学习方法在捕捉建筑设计推理方面存在不足，如拓扑关系优先性、功能约束传播和流线模式生成等，缺乏对建筑逻辑的建模能力。 Method: 采用两阶段方法：第一阶段使用具有双编码器的卷积网络生成房间中心点的概率分布；第二阶段构建连接房间节点与边界顶点的异构图，并利用Transformer增强的图神经网络联合回归房间边界。 Result: 该方法能更好地建模拓扑与功能约束，在给定外墙和门位置的情况下生成合理且符合设计逻辑的平面布局。 Conclusion: GFLAN通过显式的结构分解提升了平面图生成的可控性与合理性，为结合建筑知识与深度学习提供了新路径。 Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

Amna Amir,Erchan Aptoula

Main category: cs.CV

TL;DR: 本文提出了一种多标签自适应对比学习方法（MACL），用于解决遥感图像检索中的语义重叠、标签分布不平衡和复杂类间共现问题。

Details

Motivation: 遥感图像中存在类别语义重叠、标签高度不平衡及复杂的类间共现模式，导致多标签检索困难。 Method: MACL结合了标签感知采样、频率敏感加权和动态温度缩放，以实现对常见和稀有类别之间的平衡表示学习。 Result: 在DLRSD、ML-AID和WHDLD三个基准数据集上实验表明，MACL优于基于对比损失的基线方法，有效缓解了语义不平衡问题。 Conclusion: MACL能够提升大规模遥感图像存档中的多标签检索性能，具有良好的泛化性和应用前景。 Abstract: Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang,Sizhe Cheng,Chenqi Yi

Main category: cs.CV

TL;DR: 提出PixelArena，利用语义分割任务评估多模态大模型的细粒度生成能力，发现Gemini 3 Pro Image在零样本下表现出高保真生成能力，展现前所未有的视觉智能。

Details

Motivation: 现有图像生成基准多关注美学，缺乏对细粒度生成能力的客观评估，需新方法衡量多模态模型的真实生成智能。 Method: 引入基于语义分割的评估框架PixelArena，以像素级精度分析模型在零样本设置下的生成表现，并进行定性与定量比较。 Result: Gemini 3 Pro Image能高保真生成语义掩码，展现强大泛化能力；与其他模型相比表现更优，并识别出若干失败案例。 Conclusion: 多模态模型在图像生成方面取得显著进展，PixelArena为未来多模态、推理、可解释性及基准测试研究提供了新方向。 Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

Haiyu Zhao,Yiwen Shan,Yuanbiao Gou,Xi Peng

Main category: cs.CV

TL;DR: 提出了一种轻量级的全合一视频恢复网络LaverNet，仅用362K参数即可有效应对时变退化问题，通过选择性传播退化无关特征，在性能上达到甚至超过现有大型模型。

Details

Motivation: 现有全合一视频恢复方法在处理时变退化时，易受退化影响而忽略视频内容，且依赖大模型掩盖了实际困难。 Method: 设计轻量网络LaverNet，引入一种新的传播机制，仅在帧间传递退化无关特征，减少退化对时序建模的干扰。 Result: LaverNet仅含362K参数，不足现有模型的1%，但在多个基准上表现相当甚至更优。 Conclusion: 轻量级网络也能实现强大的全合一视频恢复效果，关键在于合理的特征传播机制设计。 Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

Huayu Huang,Chen Chen,Banglei Guan,Ze Tan,Yang Shang,Zhang Li,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于岭估计的融合定位方法，结合序列图像的丰富场景信息与激光测距的高精度优势，提升在观测条件受限下的目标定位精度和鲁棒性。

Details

Motivation: 在长距离、小交角和大倾角等有限观测条件下，传统最小二乘估计因设计矩阵列向量严重多重共线性而导致病态问题，影响定位稳定性与鲁棒性。 Method: 提出一种基于岭估计的融合定位方法，融合序列图像的视觉信息与激光测距数据，通过岭估计抑制多重共线性，改善病态问题，提高估计稳定性。 Result: 实验结果表明，该方法相比基于单一信息的地面定位算法具有更高的定位精度，且在有限观测条件下显著提升了鲁棒性。 Conclusion: 岭估计有效缓解了复杂观测条件下的多重共线性问题，所提出的融合定位方法在精度和鲁棒性方面均优于传统方法，适用于无人机多传感器定位任务。 Abstract: Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

Nan Zhou,Zuxin Li,Fanhang Man,Xuecheng Chen,Susu Xu,Fan Dang,Chaopeng Hong,Yunhao Liu,Xiao-Ping Zhang,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为QUIDS的多智能体调度系统，用于在非专用车载移动群智感知（NVMCS）中优化信息质量（QoI），通过引入聚合感知质量（ASQ）指标和互惠信念感知调度算法，在覆盖范围、可靠性与预算间实现平衡，显著提升感知性能。

Details

Motivation: 在NVMCS系统中，车辆参与具有动态性和不确定性，导致感知覆盖不足和可靠性低，难以保证信息质量（QoI），现有方法未能有效联合优化覆盖与可靠性的权衡。 Method: 提出QUIDS系统，引入聚合感知质量（ASQ）作为综合衡量覆盖与可靠性的QoI指标，并设计互惠信念感知车辆调度算法，基于不确定性下的信念估计进行可靠性建模并分配激励，以在预算约束下最大化ASQ。 Result: 基于真实城市数据的评估显示，相比无调度场景，QUIDS将ASQ提升38%；相比现有最优方法提升10%；地图重构误差减少39%-74%。 Conclusion: QUIDS通过质量感知的激励机制联合优化覆盖与可靠性，可在无需专用基础设施的情况下实现低成本、高质量的城市感知，适用于交通与环境监测等智慧城市应用。 Abstract: This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

Soochang Song,Yongjune Kim

Main category: cs.CV

TL;DR: 提出了一种边缘到服务器的协作推理框架，用于视觉语言模型，通过选择性重传局部细节图像，在降低通信成本的同时保持推理精度。

Details

Motivation: 在典型部署中，将边缘设备捕获的图像下采样传输至服务器会导致细节丢失，从而降低视觉语言模型的推理准确性。 Method: 设计了一个两阶段框架：第一阶段服务器基于全局图像进行推理并利用模型注意力确定感兴趣区域（RoI），并通过输出token的最小熵判断是否需要重传；若超过阈值，则请求边缘设备发送保留细节的局部图像进行第二阶段联合推理。 Result: 在多个视觉语言模型上的实验表明，该框架显著减少了通信开销，同时保持了较高的推理准确率。 Conclusion: 所提出的协作推理框架有效平衡了通信效率与模型性能，适用于资源受限的边缘-云协同场景。 Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

Tao Hu,Weiyu Zhou,Yanjie Tu,Peng Wu,Wei Dong,Qingsen Yan,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为GMODiff的增益图驱动单步扩散框架，用于多曝光高动态范围（HDR）重建，通过估计保留低比特深度的增益图来克服传统潜在扩散模型在HDR应用中的局限性，并结合回归先验提升生成质量和推理效率。

Details

Motivation: 由于8位潜在压缩导致的动态范围表示受限、多步去噪带来的高推理成本以及生成模型固有的内容幻觉问题，直接将预训练的潜在扩散模型应用于HDR重建仍然具有挑战性。 Method: 将HDR重建重新定义为条件引导的增益图（GM）估计任务，利用回归模型提供信息丰富的初始估计，从而实现单步去噪；同时引入回归先验来指导LDM的去噪过程和潜在解码，以抑制幻觉并保持结构准确性。 Result: 实验证明，GMODiff在多个指标上优于现有的先进方法，且推理速度比之前的LDM-based方法快100倍。 Conclusion: GMODiff通过增益图估计与回归先验引导的单步扩散策略，有效解决了LDM在HDR重建中的动态范围限制、计算开销和内容幻觉问题，实现了高效、高质量的HDR重建。 Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

Haotian Ling,Zequn Chen,Qiuying Chen,Donglin Di,Yongjia Ma,Hao Li,Chen Wei,Zhulin Tao,Xun Yang

Main category: cs.CV

TL;DR: 本文提出EverybodyDance，一种基于身份匹配图（IMG）和掩码查询注意力（MQA）的系统性方法，解决多角色动画中的身份对应（IC）正确性问题，在IC准确性和视觉质量上均显著优于现有方法。

Details

Motivation: 现有姿态驱动的角色动画在单角色场景中进展显著，但在涉及位置交换的多角色场景中难以保证生成帧与参考帧之间的身份对应正确性，缺乏有效的建模与评估机制。 Method: 提出Identity Matching Graph（IMG），将生成帧与参考帧中的角色建模为带权完全二分图，利用Mask-Query Attention（MQA）计算角色间亲和度，并将IC正确性形式化为图结构度量进行训练优化；结合身份嵌入引导、多尺度匹配策略和预分类采样提升性能。 Result: 在自建的身份对应评估基准上，实验表明EverybodyDance在身份对应准确性和视觉质量方面显著优于当前最先进的基线方法。 Conclusion: 通过图结构建模与专门训练策略，EverybodyDance有效解决了多角色动画中的身份对应问题，推动了多角色一致性动画的发展。 Abstract: Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan,Bastien Van Delft,Wuyang Li,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了一种分解式的视频生成方法FVG，通过将文本到视频生成分为推理、构图和时序合成三个阶段，提升了生成视频的逻辑一致性和效率。

Details

Motivation: 现有文本到视频模型在复杂场景构建和时序逻辑遵循方面表现不佳，主要因初始帧语义错误导致。 Method: 提出Factorized Video Generation (FVG)，利用大语言模型重写初始场景描述，文本到图像模型生成高质量锚定帧，视频模型基于锚定帧进行时序动画生成。 Result: 在T2V CompBench上达到最先进水平，在VBench2上显著提升各模型表现，并可减少70%采样步数而不损失性能。 Conclusion: FVG提供了一种更高效、鲁棒且可控的视频生成路径。 Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

Zhanwei Li,Liang Li,Jiawan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为AFDAN的自适应频域对齐网络，用于解决医学图像分割中标注数据稀缺的问题，通过在频域中对齐特征实现跨域知识迁移，并在两个公开数据集上取得了优于现有方法的性能。

Details

Motivation: 高质量标注数据在医学图像分割中至关重要，但获取成本高且耗时，导致数据稀缺。因此需要有效的域适应方法来缓解这一问题。 Method: 提出AFDAN框架，包含三个核心模块：对抗域学习模块实现源域到目标域的特征迁移，源-目标频域融合模块融合跨域频域表示，空间-频域集成模块结合频域与空间特征以提升分割精度。 Result: 在新构建的VITILIGO2025数据集上达到90.9%的IoU，在DRIVE视网膜血管分割数据集上达到82.6%的IoU，均优于现有的最先进方法。 Conclusion: AFDAN通过频域特征对齐有效缓解了医学图像分割中的数据稀缺问题，实现了优越的跨域分割性能，具有较强的实用性和推广潜力。 Abstract: High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)--a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

Haodi He,Jihun Yu,Ronald Fedkiw

Main category: cs.CV

TL;DR: 本文提出了一种基于高斯点阵（Gaussian Splatting）的统一框架，利用少量未校准的人脸图像重建具有高视觉保真度和准确几何结构的3D人脸，并实现纹理解耦与标准图形管线兼容的应用。

Details

Motivation: 现有的3D人脸重建方法通常依赖大量视频数据或复杂光照假设，难以在少量图像下生成既几何准确又可用于标准图形管线的结果。本文旨在利用新兴的3D神经表示，在少图条件下实现高质量、结构化且可集成到常规渲染流程中的重建。 Method: 采用高斯点阵作为基础表示，因其比NeRF更显式且易于施加约束；利用分割标注对齐面部语义区域，仅用11张图像即可重建中性姿态；通过软约束将高斯点绑定到三角网格表面，提升重建结构化程度并反向优化网格几何；进一步将高斯点转换为视角相关的神经纹理，并使用可重光照模型分离纹理与光照，获得去光照的高分辨率反照率纹理。 Result: 实现了仅需11张未校准图像的高质量3D人脸重建；生成的三角网格可直接用于标准图形管线；成功将高斯点转化为可编辑的神经纹理，并获得高保真、去光照的纹理输出；系统支持不同光照条件下的异构图像输入，具备良好正则化能力。 Conclusion: 本文验证了高斯点阵在少样本、非受限条件下进行结构化3D人脸重建与纹理建模的潜力，提出的方法不仅提升了重建精度与结构化程度，还实现了与现有图形管线的无缝集成，推动了神经渲染技术在实际内容创作中的应用。 Abstract: We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng,Hao Guo,Yunpeng Bai,Yongkang Dai,Huaxi Huang,Yilei Shi

Main category: cs.CV

TL;DR: BrepLLM是首个使大语言模型能够解析和推理原始Brep数据的框架，通过两阶段训练 pipeline 实现3D几何与自然语言之间的跨模态对齐，在分类与描述任务上达到SOTA。

Details

Motivation: 现有的基于token序列的大语言模型难以直接处理包含复杂几何与拓扑信息的3D Brep模型，缺乏有效方法将结构化3D几何与自然语言连接。 Method: 提出BrepLLM框架，采用两阶段训练：第一阶段通过自适应UV采样将Brep转为图表示，并设计分层BrepEncoder提取几何与拓扑特征，利用对比学习对齐全局特征与CLIP文本嵌入；第二阶段将BrepEncoder集成至LLM，通过三阶段渐进训练（MLP语义映射、LLM微调、MQE增强几何多样性）对齐节点序列。 Result: 在3D物体分类与描述任务上取得SOTA结果，构建了含269,444个问答对的Brep2Text数据集用于实验验证。 Conclusion: BrepLLM成功弥合了3D Brep数据与大语言模型间的模态鸿沟，实现了对原始Brep数据的有效解析与推理。 Abstract: Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

[78] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 提出了一种无需训练的零样本对象计数框架CountZES，通过三个协同阶段实现精确且多样化的示例选择，显著提升了复杂场景下对未见类别的计数性能。

Details

Motivation: 现有零样本对象计数方法在处理未见类别时依赖开放词汇检测器或多实例候选，或使用随机采样难以准确划分对象实例，因此需要更精准的单实例示例提取方法。 Method: 提出CountZES框架，包含检测锚定示例（DAE）、密度引导示例（DGE）和特征一致性示例（FCE）三个阶段：DAE优化开放词汇检测以获得精确单实例；DGE通过密度驱动的自监督方式选取语义紧凑的示例；FCE利用特征空间聚类增强视觉一致性。 Result: 在多个自然、航拍和医学图像数据集上验证了CountZES的有效性，其在零样本对象计数任务中优于现有方法，并展现出良好的跨域泛化能力。 Conclusion: CountZES通过三阶段协同机制实现了无需训练的高质量示例选择，在平衡文本对齐、计数一致性和特征代表性方面表现优异，为零样本对象计数提供了有效解决方案。 Abstract: Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

[79] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li,Youngjung Uh

Main category: cs.CV

TL;DR: 提出一种无需训练的简单有效方法，通过几何角度优化文本嵌入以抑制不必要语义，提升文本到图像生成中的主体一致性和文本对齐性。

Details

Motivation: 现有文本到图像生成方法在保持主体一致性方面表现不佳，且常需昂贵的微调或图像条件依赖，而现有无需训练的方法存在语义泄漏问题。 Method: 从几何视角出发，通过精细化调整文本嵌入来抑制跨帧的不相关语义，从而缓解语义纠缠问题，实现训练自由的一致性生成。 Result: 实验表明，该方法在多个基准上显著优于现有基线，在主体一致性和文本对齐方面均有提升。 Conclusion: 所提方法是一种高效、无需训练的解决方案，有效解决了文本到图像生成中的语义泄漏与主体不一致问题。 Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[80] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano,Saptarshi Sinha,Jacob Chalk,Wei-Hong Li,Hideo Saito,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了一种基于文本条件扩散模型的人体运动生成方法，专注于物体拾取或放置前的凝视预示行为，通过构建包含23.7K个凝视预示运动序列的数据集进行训练和评估，在HD-EPIC数据集上实现了60%的预示成功率为和89%的到达成功率。

Details

Motivation: 旨在生成更自然、逼真的人类运动，特别是在物体拾取或放置任务中模拟人类从远距离凝视目标（gaze priming）到接近并触达目标位置的行为。 Method: 整合了五个公开数据集（HD-EPIC、MoGaze、HOT3D、ADT、GIMO），构建了包含23.7K个凝视预示运动序列的新数据集；采用文本条件扩散模型进行预训练，并在目标姿态或位置条件下进行微调以生成运动序列。 Result: 在HD-EPIC数据集上，当以目标物体位置为条件时，模型达到60%的Prime Success率和89%的Reach Success率；提出了新的评估指标'Prime Success'用于衡量凝视预示行为的自然性。 Conclusion: 所提出的方法能够有效生成具有自然凝视预示行为的人体运动，在真实性和任务完成率方面表现良好，验证了结合高级语义线索（如凝视）对提升生成运动质量的重要性。 Abstract: Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

[81] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: SNOW是一个无需训练、骨干网络无关的框架，通过融合视觉语言模型的语义与点云几何和时间一致性，实现统一的4D场景理解，支持精确的空间接地推理和自主导航。

Details

Motivation: 现有视觉语言模型缺乏3D几何和时间动态的接地能力，而几何感知方法语义信息不足，难以满足自主机器人在动态环境中对时空理解的需求。 Method: SNOW结合同步的RGB图像和3D点云，使用HDBSCAN聚类生成对象级提议，指导SAM2分割；提出Spatio-Temporal Tokenized Patch Encoding（STEP）编码局部语义、几何和时间特征，并增量构建4D场景图（4DSG）；通过轻量级SLAM后端实现空间锚定和全局对齐。 Result: 在多个基准测试中达到最先进性能，验证了SNOW在4D场景理解与空间接地推理上的有效性。 Conclusion: 结构化的4D先验（如4DSG）对于具身推理和自主机器人系统至关重要，SNOW为无需训练的开放世界4D场景理解提供了新范式。 Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

[82] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Senmao Li,Kai Wang,Salman Khan,Fahad Shahbaz Khan,Jian Yang,Yaxing Wang

Main category: cs.CV

TL;DR: 提出StageVAR，一种面向视觉自回归模型的阶段感知加速框架，在保持生成质量的同时实现最高3.4倍的加速。

Details

Motivation: 传统VAR模型在大规模生成时计算复杂度高，现有加速方法依赖人工设置且忽略生成不同阶段的重要性差异。 Method: 通过分析发现早期步骤对语义和结构一致性关键，后期主要用于细节优化，据此设计无需训练的即插即用加速策略，利用后期计算中的语义无关性和低秩特性进行剪枝或近似。 Result: 实现了最高3.4倍的加速，GenEval仅下降0.01，DPG下降0.26，优于现有加速方法。 Conclusion: 阶段感知设计是高效视觉自回归图像生成的有效原则。 Abstract: Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

Yuan Li,Yahan Yu,Youyuan Lin,Yong-Hao Yang,Chenhui Chu,Shin'ya Nishida

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的盲图像质量评估（BIQA）方法，利用人类标注作为奖励信号，使模型同时具备类人感知和自洽推理能力。

Details

Motivation: 旨在让BIQA模型不仅预测图像质量分数，还能生成与人类相似且逻辑自洽的感知-推理过程。 Method: 收集人类评估数据，采用强化学习框架，设计基于自生成描述推断质量的奖励机制，引导模型学习人类感知-推理链。 Result: 在标准评分指标上达到先进水平，ROUGE-1达0.512（基线0.443），显著提升模型与人类解释的一致性。 Conclusion: 该方法推动了可解释性BIQA的发展，使模型不仅能准确评分，还能生成类人、自洽的推理过程。 Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

[84] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu,Yuanyuan Liu,Lin Wei,Chang Tang,Yibing Zhan,Zijing Chen,Zhe Chen

Main category: cs.CV

TL;DR: 本文提出了一个基于眼行为的多模态情感识别（EMER）数据集和一种新的EMERT模型，通过引入眼行为作为补充线索来弥合面部表情识别与真实情感识别之间的差距。

Details

Motivation: 由于面部表情常被用作社交工具而非真实情绪的反映，现有基于面部表情的情感识别存在局限性，因此需要引入更可靠的情绪线索如眼行为来提升情感识别的鲁棒性。 Method: 构建了一个包含自发情绪诱导范式下的眼行为数据（如眼动序列和注视图）和面部表情视频的EMER数据集，并设计了EMERT模型，采用模态对抗特征解耦和多任务Transformer来融合眼行为与面部表情进行情感识别。 Result: 实验表明，EMERT显著优于现有的多模态方法，在七种基准协议下均表现出优越性能，验证了眼行为对情感识别的重要补充作用。 Conclusion: 眼行为是弥补面部表情识别与真实情感识别之间差距的关键因素，所提出的EMER数据集和EMERT模型为实现更鲁棒的情感识别提供了有效途径。 Abstract: Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

[85] YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images

Huma Hafeez,Matthew Garratt,Jo Plested,Sankaran Iyer,Arcot Sowmya

Main category: cs.CV

TL;DR: 本文提出YOLO11-4K，一种用于4K全景图像的高效实时目标检测框架，通过引入多尺度检测头和GhostConv主干网络，在保持高精度的同时显著降低计算延迟，并发布了一个新的标注数据集CVIP360用于评估。

Details

Motivation: 现有的目标检测器（如YOLO）针对标准分辨率图像优化，难以应对360度全景图像的高分辨率、大视场和空间畸变带来的计算挑战和小物体漏检问题。 Method: 提出YOLO11-4K框架，采用带P2层的多尺度检测头以提升对小目标的敏感性，并使用GhostConv减少计算复杂度；同时构建并公开CVIP360数据集，包含6,876个带边界框的4K全景图像帧。 Result: YOLO11-4K在0.50 IoU下达到0.95 mAP，每帧推理时间为28.3毫秒，相比YOLO11延迟降低75%（从112.3毫秒），且准确率更高（mAP从0.908提升至0.95）。 Conclusion: YOLO11-4K在效率与精度之间实现了良好平衡，适用于自动驾驶、监控和增强现实等高分辨率全景视觉任务，方法具有广泛适用性。 Abstract: The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.

[86] PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu,Jiajie Liu,Jinyan Zhang,Wenhao Li,Junsong Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseMoE的混合专家网络，用于单目3D人体姿态估计，通过解耦2D姿态与深度特征的编码过程，提升估计精度。

Details

Motivation: 现有的基于提升的方法在编码2D姿态和未知深度时存在特征纠缠问题，导致深度不确定性影响2D姿态精度，限制了整体性能。 Method: 设计了一个混合专家网络（PoseMoE），包含专门用于优化2D姿态特征和学习深度特征的专家模块，并引入跨专家知识聚合模块，实现2D姿态与深度之间的双向时空上下文信息融合。 Result: 实验表明，PoseMoE在Human3.6M、MPI-INF-3DHP和3DPW三个主流数据集上均优于传统提升方法。 Conclusion: 解耦2D姿态与深度特征的编码并进行协同优化，能有效减少不确定性传播，显著提升单目3D人体姿态估计性能。 Abstract: The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

[87] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou,Zhexiao Huang,Yuan Guo,Zhangxuan Gu,Tianyu Xia,Zichen Luo,Fei Tang,Dehan Kong,Yanyi Shang,Suling Ou,Zhenlin Guo,Changhua Meng,Shuheng Shen

Main category: cs.CV

TL;DR: 本文提出了VenusBench-GD，一个大规模、跨平台、双语的GUI元素定位基准，通过分层任务分类和高质量标注管道，实现对多模态模型在基础与高级定位任务上的全面评估。

Details

Motivation: 现有GUI定位基准存在数据量不足、领域覆盖窄或过于依赖特定平台的问题，缺乏适用于真实场景的综合性评估框架。 Method: 构建了一个涵盖多平台、多应用的大规模双语数据集；设计了高质量的数据标注流程；提出分层任务分类体系，将定位任务分为基础和高级两类共六个子任务。 Result: 实验表明通用多模态模型在基础任务上已可匹敌甚至超越专用GUI模型，但在高级任务上仍落后；专用模型存在过拟合和鲁棒性差的问题。 Conclusion: 需要多层次、综合性的评估框架来推动GUI代理的发展，VenusBench-GD为未来模型评估提供了更全面和现实的测试平台。 Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

[88] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi

Main category: cs.CV

TL;DR: 本文提出了一种用于骨架动作定位的自监督预训练方法，通过片段判别预训练任务和U形模块融合中间特征，提升了时序动作定位的性能。

Details

Motivation: 现有的自监督学习方法在骨架动作识别上取得了成功，但在动作定位任务中仍面临挑战，尤其是对帧级时序敏感特征的学习不足。 Method: 提出片段判别预训练任务，将骨架序列划分为非重叠片段并通过对比学习区分不同视频中的片段；采用U形模块融合骨干网络的中间特征以增强帧级定位能力。 Result: 在BABEL数据集上显著提升了多种对比学习方法的动作定位性能，并在PKUMMD上通过NTU RGB+D和BABEL的预训练实现了最先进的迁移学习效果。 Conclusion: 所提出的方法有效增强了骨架序列的时序表征能力，为自监督学习在时序动作定位中的应用提供了新思路。 Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[89] Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

Hossein Javidnia

Main category: cs.CV

TL;DR: 本文提出MAGINet，一种多尺度注意力引导的网络，用于从单张人脸图像中进行高精度内在分解，生成包括漫反射反照率在内的六种渲染通道，实现高质量的人脸重光照与材质编辑。

Details

Motivation: 在非约束光照下准确分解人脸图像的内在属性对于光真实感重光照和增强现实应用至关重要，但现有方法在细节保持和光照不变性方面存在不足。 Method: 提出MAGINet，采用分层残差编码、空间-通道注意力机制和自适应多尺度特征融合；通过RefinementNet将反照率图上采样并细化，并基于Pix2PixHD生成其余五个渲染通道。 Result: 在FFHQ-UV-Intrinsics数据集上训练，结合多种损失函数，实现了最先进的漫反射反照率估计性能，整体渲染质量显著优于先前方法。 Conclusion: 该方法能有效生成高质量的面部内在分解结果，支持真实感人脸重光照和材料编辑，具有较强的应用潜力。 Abstract: Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

[90] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Zhiwei Li,Yitian Pang,Weining Wang,Zhenan Sun,Qi Li

Main category: cs.CV

TL;DR: 本文提出了一种名为测试时填充（TTP）的轻量级防御框架，用于提升视觉-语言模型（如CLIP）在面对对抗性扰动时的鲁棒性，通过检测对抗输入并进行针对性自适应，在不牺牲干净样本准确率的前提下显著提升了对抗鲁棒性。

Details

Motivation: 现有的训练时防御方法依赖标注数据和昂贵的再训练，而测试时策略难以可靠区分干净与对抗输入，导致无法同时优化鲁棒性和准确率。 Method: TTP首先利用空间填充前后CLIP特征嵌入间的余弦相似性变化来检测对抗样本，并设定通用阈值；对检测出的对抗样本，采用可训练填充恢复注意力模式，并结合相似性感知集成策略提升预测鲁棒性；对干净样本则保持不变或可选地应用现有测试时技术进一步提升精度。 Result: 在多种CLIP骨干网络和细粒度基准上的实验表明，TTP持续优于当前最先进的测试时防御方法，显著提升了对抗鲁棒性且未损害干净样本准确率。 Conclusion: TTP是一种有效且轻量的测试时防御框架，能够在无需重新训练的情况下，实现可靠的对抗样本检测与自适应响应，兼顾了模型的鲁棒性与原始性能。 Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

[91] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang,Lei Ke,Boqiang Zhang,Tianyuan Qu,Hanxun Yu,Zhenpeng Huang,Meng Yu,Dan Xu,Dong Yu

Main category: cs.CV

TL;DR: 本文提出N3D-VLM，一种统一的多模态框架，通过集成原生3D对象感知与3D感知视觉推理，实现精确的3D定位和可解释的空间理解。

Details

Motivation: 现有基于2D图像的多模态模型缺乏内在的3D对象感知能力，难以理解3D场景中的空间关系和深度线索。 Method: 提出N3D-VLM框架，利用深度估计将大规模2D标注提升至3D空间，构建可扩展的数据生成管道，并支持3D对象定位与3D空间推理的联合训练。 Result: 在3D定位任务上达到最先进性能，并在3D空间推理方面显著优于现有方法。 Conclusion: N3D-VLM实现了原生3D感知与可解释的3D视觉语言理解，为多模态模型在3D场景中的应用提供了新方向。 Abstract: While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

[92] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur,Marwan Taher,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出一种动态重建系统，输入为单目RGB视频，输出为完整且持久的4D场景重建，通过3D刚性 primitives 和密集2D对应关系联合优化实现，并引入运动外推机制以维持遮挡物体的连续性。

Details

Motivation: 现有方法难以实现完整、持续的动态场景重建，尤其在处理遮挡和多物体运动时缺乏时空一致性与对象永久性。 Method: 将场景分解为多个刚性3D primitives，利用密集2D对应关系联合优化其刚体运动，并采用运动分组技术外推被遮挡物体的运动，实现4D（3D+时间）重建。 Result: 实现了可重放的4D时空感知重建，支持关节物体回放、多物体扫描和对象永久性，在多个数据集上显著优于现有方法。 Conclusion: 该方法通过primitive-based建模与运动外推机制，有效实现了从单目视频到完整4D场景重建的突破，提升了动态场景重建的完整性与实用性。 Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

[93] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

Yin Zhang,Yongqiang Zhang,Yaoyue Zheng,Bogdan Raducanu,Dan Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Causal-Tune的新型微调方法，通过在频域中分离因果与非因果特征分量，有效提升视觉基础模型在未见域语义分割中的泛化能力。

Details

Motivation: 现有方法在领域通用语义分割中忽略了预训练视觉模型中存在的伪影问题，这些伪影源于非因果因素，影响了特征的有效利用。 Method: 使用离散余弦变换提取特征频谱，通过高斯带通滤波器分离因果与非因果成分，并引入可学习的因果感知token在频域优化因果部分，最后通过逆变换恢复到空间域。 Result: 在多种跨域任务上验证了方法的有效性，尤其在雪天等恶劣天气条件下比基线提升4.8% mIoU。 Conclusion: Causal-Tune能够有效识别并解耦视觉基础模型中的因果与非因果因素，显著增强模型在未知域的鲁棒性和泛化性能。 Abstract: Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

[94] CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series

Nico Albert Disch,Saikat Roy,Constantin Ulrich,Yannick Kirchhoff,Maximilian Rokuss,Robin Peretzke,David Zimmerer,Klaus Maier-Hein

Main category: cs.CV

TL;DR: CRONOS是首个支持离散和连续时间戳的3D医疗扫描序列到图像预测统一框架，能够在不规则采样下实现体素级、多上下文、连续时间的医学图像预测。

Details

Motivation: 现有模型在处理3D医学扫描时间演化时受限于单次输入、固定时间网格或仅预测全局标签，难以实现不规则采样下的体素级时间预测。 Method: CRONOS通过学习一个时空速度场，将多个过去时间点的上下文体积数据映射至任意目标时间的3D体素空间，实现连续时间下的序列到图像预测。 Result: 在涵盖Cine-MRI、灌注CT和纵向MRI的三个公开数据集上，CRONOS在预测性能上优于其他基线模型，同时保持良好的计算效率。 Conclusion: CRONOS为多上下文、连续时间的3D医学图像预测提供了有效且统一的解决方案，推动了疾病进展建模与可重复多数据集基准测试的发展。 Abstract: Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.

[95] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong,Jiaqi Gu,Yujing Lou,Lubin Fan,Yixiong Zou,Yue Wu,Jieping Ye,Ruixuan Li

Main category: cs.CV

TL;DR: 提出了一种名为Sketch-in-Latents (SkiLa)的新范式，通过在统一特征空间中插入潜在的草图令牌，实现多模态大语言模型的视觉想象与文本推理的融合。

Details

Motivation: 现有MLLM在需要视觉想象力的任务上表现不足，而人类能在无预定义工具包的情况下进行灵活的视觉-文本想象交互，因此希望在模型内部构建类似的统一多模态思维过程。 Method: 利用MLLM共享的视觉-文本特征空间，在自回归推理过程中动态插入连续的潜在草图令牌（latent sketch tokens），交替进行文本推理和视觉草图生成，并通过潜在视觉语义重建机制确保其语义一致性。 Result: SkiLa在以视觉为中心的任务上表现出优越性能，并在多种通用多模态基准上展现出强泛化能力。 Conclusion: SkiLa实现了文本与视觉思维的统一建模，使MLLM能够原生地生成视觉思想，推动了模型在视觉想象力方面的进步。 Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

[96] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu,Tong Yu,Shenling Wang,Xudong Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于Swin-transformer的文本条件扩散模型Yuan-TecSwin，通过替换CNN模块增强非局部建模能力，并在ImageNet上取得了1.37的FID分数，达到SOTA水平。

Details

Motivation: 卷积神经网络（CNN）在扩散模型中的局部性限制了其对长距离语义信息的理解能力，因此需要引入更强的非局部建模机制。 Method: 采用Swin-transformer替代U型架构中的CNN模块用于编码器和解码器；改进文本编码器、有效利用文本嵌入并精心设计文本条件融合方式；使用自适应时间步搜索策略优化推理性能。 Result: 在ImageNet生成任务上实现了1.37的FID分数，为当前最优结果；无需额外模型即可完成各去噪阶段任务；人类难以区分生成图像与真实绘制图像。 Conclusion: Yuan-TecSwin通过引入Swin-transformer和优化文本对齐策略，显著提升了文本到图像生成的质量和非局部特征建模能力。 Abstract: Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

[97] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment

Ayush Bhavsar

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、面向应用的实时去雾增强管道Hazedefy，基于暗通道先验和大气散射模型，适用于消费级硬件。

Details

Motivation: 为了在消费级硬件上实现实时视频和直播画面的高效去雾处理，提升实际应用场景中的可见性和对比度。 Method: 采用伽马自适应重建、快速透射率近似、分数顶部像素平均的大气光估计稳定方法，并引入可选的颜色平衡步骤。 Result: 实验表明，该方法在无需GPU加速的情况下，显著提升了真实世界图像和视频的可见性与对比度。 Conclusion: Hazedefy是一种计算简单、部署实用的去雾方案，适合移动和嵌入式设备上的实时应用。 Abstract: This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.

[98] Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou,Zeqi Xiao,Tianyi Wei,Shuai Yang,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了Log-linear Sparse Attention (LLSA)，一种用于长序列扩散Transformer的高效稀疏注意力机制，通过分层结构将选择和注意力计算复杂度从二次降至对数线性，在保持生成质量的同时显著加速训练与推理。

Details

Motivation: 现有的Top-K稀疏注意力方法在处理长序列时仍存在二次选择开销，且需随序列增长增加K值以维持性能，根本原因在于单层设计无法有效捕捉全局结构。 Method: 提出LLSA，采用分层Top-K稀疏选择机制，逐级细化关键块的选择，并引入分层KV增强机制，在不同粒度上保留全局上下文；同时开发高效的GPU实现，前向和反向传播均仅使用稀疏索引，避免稠密注意力掩码。 Result: 在256x256像素序列的高分辨率图像生成任务中，LLSA将注意力推理速度提升28.27倍，DiT训练速度提升6.09倍，同时保持生成质量。 Conclusion: LLSA通过分层稀疏设计实现了对长序列DiTs的高效训练与推理，为扩展视觉生成模型的序列长度提供了可行路径。 Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

[99] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation

Kanwal Aftab,Graham Adams,Mark Scanlon

Main category: cs.CV

TL;DR: 本文提出了一种基于电插座类型的三阶段深度学习管道，用于室内多媒体地理定位，在打击人口贩卖和儿童剥削等犯罪中具有重要法医学应用潜力。

Details

Motivation: 由于房间布局相似、频繁翻新、视觉模糊、光照变化及数据集稀缺等问题，室内多媒体地理定位研究相对滞后，而该技术在数字取证中具有重要意义。 Method: 采用YOLOv11检测插座（mAP@0.5=0.843），Xception模型分类12种插座类型（准确率0.912），再将插座类型映射到国家（>90%置信度下准确率达0.96）。构建了两个专用数据集并进行数据增强。 Result: 在真实条件下的TraffickCam酒店图像子集上验证了方法有效性，相比传统使用专业拍摄图像的方法更具现实适用性。 Conclusion: 该框架为实际数字取证应用提供了可行方案，且代码、模型和数据均已开源，推动后续研究发展。 Abstract: Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.

[100] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen,Mingyue Cui,Xingyi Yang

Main category: cs.CV

TL;DR: 本文提出了DeContext，一种通过干扰多模态注意力路径来防御上下文扩散模型中未经授权图像编辑的新方法，有效阻断输入与输出之间的关联，同时保持图像质量。

Details

Motivation: 由于上下文扩散模型能够轻松且逼真地修改图像，导致隐私泄露风险增加，如身份冒用和恶意 misinformation，因此需要一种有效机制来防止个人图像被未经同意的编辑。 Method: 提出DeContext方法，通过在关键去噪步骤和特定Transformer模块中注入小而有针对性的扰动，削弱跨注意力路径，从而阻断上下文信息传播。 Result: 在Flux Kontext和Step1X-Edit模型上的实验表明，DeContext能持续阻止非授权图像编辑，同时保持良好的视觉质量。 Conclusion: 基于注意力机制的扰动是一种高效且鲁棒的防御手段，可用于保护图像免受现代大规模上下文扩散模型的滥用。 Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.

[101] SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu,Di Wang,Hebaixu Wang,Haoyang Chen,Wentao Jiang,Yilin Cheng,Haonan Guo,Wei Cui,Jing Zhang

Main category: cs.CV

TL;DR: 提出SARMAE，一种用于自监督SAR表征学习的噪声感知掩码自动编码器，构建了首个百万级SAR数据集SAR-1M，并设计了SARE和SARC方法以提升鲁棒性和语义一致性。

Details

Motivation: 现有SAR深度学习受限于数据稀缺和斑点噪声影响细粒度语义表征学习的问题。 Method: 构建大规模SAR-1M数据集，提出Speckle-Aware Representation Enhancement (SARE)注入SAR特有斑点噪声，以及Semantic Anchor Representation Constraint (SARC)利用配对光学图像先验对齐特征。 Result: 在多个SAR数据集上实验表明，SARMAE在分类、检测和分割任务中均达到最先进性能。 Conclusion: SARMAE通过噪声感知和语义一致性的设计，有效提升了SAR图像的表征学习效果，推动了SAR领域深度学习的发展。 Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.

[102] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis,Christos Sgouropoulos,Bill Psomas,Theodoros Giannakopoulos,Giorgos Sfikas,Ioannis Kakogeorgiou

Main category: cs.CV

TL;DR: 本文提出了REGLUE，一种统一的潜在扩散框架，通过联合建模VAE潜在表示、局部视觉基础模型（VFM）语义和全局[CLS]标记，增强了图像生成的质量与训练效率。

Details

Motivation: 现有的潜在扩散模型缺乏直接的语义监督，导致训练时间长且样本质量受限；同时，当前利用VFM语义的方法未能充分挖掘其多层次、非线性的空间语义信息。 Method: 引入REGLUE框架，将VAE图像潜在变量、紧凑的局部（块级）VFM语义和全局（图像级）[CLS]标记统一建模于单一SiT主干网络中；使用轻量级卷积语义压缩器非线性聚合多层VFM特征，并在扩散过程中与VAE潜在变量纠缠；结合外部对齐损失正则化内部表示。 Result: 在ImageNet 256x256上，REGLUE在FID指标和收敛速度方面均优于SiT-B/2、SiT-XL/2基线及REPA、ReDi、REG等方法；实验表明空间VFM语义、非线性压缩、全局标记和外部对齐均起到关键作用。 Conclusion: REGLUE通过全局-局部-潜在联合建模，有效整合了视觉基础模型的丰富语义，提升了潜在扩散模型的生成性能与训练效率，为未来研究提供了高效利用VFM语义的新范式。 Abstract: Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

[103] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger,Jan-Niklas Dihlmann,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 本文提出了FrameDiffuser，一种用于交互式应用的自回归神经渲染框架，通过结合G-buffer数据和前一帧输出生成时间一致且逼真的图像。

Details

Motivation: 现有的基于扩散的神经渲染方法在时间一致性或计算效率上存在不足，难以满足交互式应用的需求。 Method: 提出了一种双条件架构，结合ControlNet进行结构引导和ControlLoRA实现时间连贯性，并采用三阶段训练策略以实现稳定的自回归生成。模型针对特定环境进行专门化训练。 Result: FrameDiffuser能够在数百到数千帧中保持稳定的时间一致性，生成高质量的逼真图像，具有准确的光照、阴影和反射效果，优于通用方法。 Conclusion: FrameDiffuser通过环境特定训练和自回归机制，在保证推理速度的同时实现了高真实感和时间一致性，适用于交互式渲染应用。 Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

[104] Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray

Gonçalo Gaspar Alves,Shekoufeh Gorgi Zadeh,Andreas Husch,Ben Bausch

Main category: cs.CV

TL;DR: 本文提出了一种基于ResNet-50和三元组损失的主体指纹识别方法，用于检测多源数据集中的主体重叠，从而缓解数据泄露问题，在多种医学影像数据集上实现了高召回率。

Details

Motivation: 防止因不同开源数据集中同一主体重复出现而导致的数据泄露问题，避免模型性能被高估。 Method: 采用ResNet-50网络并结合三元组边界损失进行训练，将同一主体的所有图像映射到潜在空间中的特定区域，通过相似性匹配实现主体重识别，并在少量样本（few-shot）设置下评估性能。 Result: 在ChestXray-14和BraTS-2021两个医学影像数据集上均取得优异表现：ChestXray-14上20-way 1-shot达到99.10%的Mean Recall@K，500-way 5-shot为90.06%；BraTS-2021上20-way 1-shot达99.20%，100-way 3-shot为98.86%。 Conclusion: 主体指纹技术能有效识别跨数据集的相同主体，具备强大的few-shot识别能力，有助于发现和缓解医学图像分析中的数据泄露问题。 Abstract: Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.

[105] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

Serafino Pandolfini,Lorenzo Pellegrini,Matteo Ferrara,Davide Maltoni

Main category: cs.CV

TL;DR: 本文系统评估了最先进的深度伪造检测模型在局部修复检测任务上的泛化能力，发现基于多生成器训练的模型对中大范围或再生式修复具有较好的检测效果。

Details

Motivation: 生成式AI的进步使得图像局部编辑更加逼真，但现有检测器主要针对全图合成内容，对局部篡改的检测能力缺乏充分评估。 Method: 采用多个数据集，涵盖不同生成器、掩码大小和修复技术，对原用于全图深伪检测的模型进行跨任务评估。 Result: 实验表明，基于大量生成器训练的模型能部分迁移到局部修复检测任务，在中大型区域和再生式修复上表现良好，优于许多专用检测方法。 Conclusion: 当前先进的检测模型具备一定的局部编辑识别能力，尤其适用于较大范围或特定类型的修复操作，显示出跨任务泛化的潜力。 Abstract: The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.

[106] SDFoam: Signed-Distance Foam for explicit surface reconstruction

Antonella Rech,Nicola Conci,Nicola Garau

Main category: cs.CV

TL;DR: 本文提出了一种名为SDFoam的混合隐式-显式方法，通过联合学习显式的Voronoi图和隐式的符号距离场（SDF），在保持渲染效率的同时显著提升了NeRF类方法的网格重建精度。

Details

Motivation: 现有神经辐射场（NeRF）和基于splatting的方法（如3DGS、RadiantFoam）在视图合成中表现良好，但在精确网格重建方面仍存在不足，尤其是表面模糊、浮点物多和拓扑错误等问题。 Method: 提出SDFoam，结合显式的Voronoi Diagram（VD）用于高效光线追踪渲染，同时引入隐式的Signed Distance Field（SDF）进行几何正则化；通过Eikonal损失约束优化，并使近表面的Voronoi单元面与SDF零水平集对齐，实现更精确的表面重建。 Result: 在多种场景下，SDFoam显著提高了网格重建精度（Chamfer距离），同时保持了与RadiantFoam相当的训练速度和光度质量（PSNR、SSIM），减少了浮点物并改善了拓扑结构。 Conclusion: SDFoam通过融合显式Voronoi结构与隐式SDF，在不牺牲渲染效率的前提下，有效解决了现有方法在网格重建上的缺陷，为高质量视图合成与几何恢复提供了新的解决方案。 Abstract: Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.

[107] A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece,Zhehua Mao,Netanell Avisdris,Brian Dromey,Raffaele Napolitano,Dafna Ben Bashat,Francisco Vasconcelos,Danail Stoyanov,Leo Joskowicz,Sophia Bano

Main category: cs.CV

TL;DR: 本文介绍了一个公开的、多中心、多设备的胎儿超声图像基准数据集，包含专家标注的解剖标志点，用于临床常用的胎儿生物测量。该数据集包含来自三个临床中心、七种不同设备的4,513张去标识化超声图像，并提供了标准化的训练/测试划分、评估代码和基线结果，旨在促进跨中心、跨设备的AI辅助胎儿生长评估研究。

Details

Motivation: 手动标注胎儿超声图像中的解剖标志点耗时、依赖操作者，且在不同设备和中心间存在变异性，限制了自动化方法的可重复性。因此，需要一个多源标注的数据集来推动人工智能辅助胎儿生长评估的发展。 Method: 收集了来自三个临床中心、使用七种不同超声设备获取的4,513张胎儿超声图像，并由专家对关键解剖标志点进行标注。提供了标准的、按受试者划分的训练/测试集分割、评估代码和基线模型结果，以支持公平和可复现的方法比较。同时利用自动生物测量模型量化了域偏移的影响。 Result: 该数据集覆盖所有主要胎儿生物测量指标，是首个公开的多中心、多设备、带标志点标注的胎儿超声数据集。实验表明，在单一中心内训练和评估会高估模型性能，而多中心测试更能反映真实泛化能力。 Conclusion: 该数据集为胎儿生物测量中的域适应和多中心泛化提供了强大基准，有助于开发更可靠的跨中心AI辅助胎儿生长评估方法，所有数据和代码均已公开。 Abstract: Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

[108] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang,Pengfei Ren,Buyuan Zhang,Da Li,Tianhao Han,Haoyang Zhang,Liang Xie,Hongbo Chen,Erwei Yin

Main category: cs.CV

TL;DR: 本文提出了一种用于在线微手势识别的多视图自监督骨架数据生成管道，并发布了首个大规模公开基准数据集OMG-Bench，同时提出了HMATr模型，通过分层记忆库和可学习的位置感知查询统一手势检测与分类，显著优于现有方法。

Details

Motivation: 由于缺乏公开的高质量数据集和通用算法，在线微手势识别面临挑战，尤其是微手势动作细微、标注困难，限制了该领域的发展。 Method: 构建了一个多视图自监督管道来自动生成手部骨架数据，结合启发式规则与专家精修实现半自动标注；提出了HMATr模型，采用分层记忆增强Transformer架构，利用帧级和窗口级记忆库保留历史上下文，并通过可学习的位置感知查询隐式编码手势位置与语义。 Result: 发布了OMG-Bench数据集，包含40类细粒度微手势、13,948个实例和1,272个序列；HMATr在检测率上比现有最优方法提升7.6%，建立了新的基准性能。 Conclusion: 所提出的管道和HMATr模型有效推动了基于骨架的在线微手势识别发展，OMG-Bench为该任务提供了重要资源，HMATr的架构设计为处理细微、快速连续的手势提供了新思路。 Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

[109] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Yunkai Yang,Yudong Zhang,Kunquan Zhang,Jinxiao Zhang,Xinying Chen,Haohuan Fu,Runmin Dong

Main category: cs.CV

TL;DR: 本文提出了一种面向任务的遥感语义分割数据合成框架TODSynth，结合多模态扩散Transformer和基于任务反馈的采样策略，显著提升了少样本和复杂场景下的合成数据质量。

Details

Motivation: 现有的可控生成方法在遥感图像语义分割中面临语义掩码控制复杂和采样质量不稳定的问题，限制了合成数据的有效性。 Method: 提出TODSynth框架，包含具有统一三重注意力的多模态扩散Transformer（MM-DiT），以及基于任务反馈的即插即用采样策略；引入控制-校正流匹配（CRFM）方法，在生成初期动态调整采样方向以减少不稳定性。 Result: 实验表明，该方法在多种遥感语义分割任务中优于现有最先进方法，尤其在少样本和复杂场景下生成更稳定、更具任务针对性的合成数据。 Conclusion: TODSynth通过联合注意力机制与任务导向的采样优化，有效提升了合成数据的质量与下游任务性能，为遥感数据扩增提供了新思路。 Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

[110] TreeNet: A Light Weight Model for Low Bitrate Image Compression

Mahadev Prasad Panda,Purnachandra Rao Makkena,Srivatsa Prativadibhayankaram,Siegfried Fößel,André Kaup

Main category: cs.CV

TL;DR: 提出了一种基于二叉树结构的低复杂度图像压缩模型TreeNet，通过注意力特征融合机制提升性能，在低码率下相比JPEG AI平均BD-rate改善4.83%，模型复杂度降低87.82%。

Details

Motivation: 降低基于学习的图像压缩方法的计算复杂度，以促进其广泛应用。 Method: 提出TreeNet，采用二叉树结构的编码器-解码器架构，并引入注意力特征融合机制来有效整合多分支特征。 Result: 在三个常用数据集上评估，TreeNet在低码率下相比JPEG AI平均BD-rate提升4.83%，模型复杂度降低87.82%；消融实验验证了不同潜在表示的影响。 Conclusion: TreeNet在保持高效重建能力的同时显著降低了模型复杂度，为实际应用中的学习型图像压缩提供了有效解决方案。 Abstract: Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.

[111] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

Zhiyang Guo,Ori Zhang,Jax Xiang,Alan Zhao,Wengang Zhou,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Make-It-Poseable的新框架，将3D角色摆姿问题转化为潜在空间中的变换任务，通过直接操纵角色的潜在表示来实现高保真、拓扑一致的姿势生成，并支持多种3D编辑应用。

Details

Motivation: 现有3D角色摆姿方法在蒙皮权重预测、拓扑完整性和姿态一致性方面存在不足，限制了其鲁棒性和泛化能力，本文旨在解决这些问题。 Method: 提出Make-It-Poseable框架，使用潜在空间变形代替传统顶点变形；核心是一个基于骨骼运动操纵形状token的潜在姿态变换器，并结合密集姿态表示、潜在空间监督策略和自适应补全模块。 Result: 该方法在姿态生成质量上表现优越，能够处理拓扑变化并保持几何细节，同时自然支持部件替换和精细化等3D编辑任务。 Conclusion: Make-It-Poseable通过将角色摆姿引入潜在空间建模，显著提升了生成质量与鲁棒性，为3D角色动画和编辑提供了新的有效途径。 Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.

[112] FlowDet: Unifying Object Detection and Generative Transport Flows

Enis Baty,C. P. Bridges,Simon Hadfield

Main category: cs.CV

TL;DR: FlowDet是首个将现代条件流匹配技术应用于目标检测的工作，通过更简单的直线传输路径，在不同推理步数下实现了比扩散模型更快的性能扩展，并在多个数据集和设置下优于生成式和非生成式检测系统。

Details

Motivation: 受DiffusionDet启发，该工作希望将目标检测重新定义为更广泛的生成传输问题，以克服扩散模型中弯曲随机路径带来的限制，并提升检测性能与推理效率之间的平衡。 Method: FlowDet采用条件流匹配（Conditional Flow Matching）方法，学习从噪声分布到真实边界框分布的直线传输路径，支持灵活调整边界框数量和推理步数而无需重新训练，从而实现高效且可扩展的检测框架。 Result: 实验表明，FlowDet在COCO和LVIS数据集上分别取得了比DiffusionDet高+3.6% AP和+4.2% AP$_{rare}$的性能提升，尤其在召回率受限场景下表现更优，且在多种主干网络和数据集上均展现出优越的性能。 Conclusion: FlowDet成功将条件流匹配引入目标检测，提供了一种比扩散模型更高效、更可扩展的生成式检测框架，展示了在生成传输路径设计上的优势，为目标检测开辟了新的方法路径。 Abstract: We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.

[113] Kling-Omni Technical Report

Kling Team,Jialu Chen,Yuanzheng Ci,Xiangyu Du,Zipeng Feng,Kun Gai,Sainan Guo,Feng Han,Jingbin He,Kang He,Xiao Hu,Xiaohua Hu,Boyuan Jiang,Fangyuan Kong,Hang Li,Jie Li,Qingyu Li,Shen Li,Xiaohan Li,Yan Li,Jiajun Liang,Borui Liao,Yiqiao Liao,Weihong Lin,Quande Liu,Xiaokun Liu,Yilun Liu,Yuliang Liu,Shun Lu,Hangyu Mao,Yunyao Mao,Haodong Ouyang,Wenyu Qin,Wanqi Shi,Xiaoyu Shi,Lianghao Su,Haozhi Sun,Peiqin Sun,Pengfei Wan,Chao Wang,Chenyu Wang,Meng Wang,Qiulin Wang,Runqi Wang,Xintao Wang,Xuebo Wang,Zekun Wang,Min Wei,Tiancheng Wen,Guohao Wu,Xiaoshi Wu,Zhenhua Wu,Da Xie,Yingtong Xiong,Yulong Xu,Sile Yang,Zikang Yang,Weicai Ye,Ziyang Yuan,Shenglong Zhang,Shuaiyu Zhang,Yuanxing Zhang,Yufan Zhang,Wenzheng Zhao,Ruiliang Zhou,Yan Zhou,Guosheng Zhu,Yongjie Zhu

Main category: cs.CV

TL;DR: Kling-Omni是一个端到端的通用视频生成框架，能够从多模态视觉语言输入中直接生成高质量视频，支持文本、图像和视频上下文输入，具备强大的推理与编辑能力。

Details

Motivation: 现有视频生成系统通常功能分离，缺乏统一框架来整合生成、编辑与智能推理任务，限制了多模态内容创作的潜力。 Method: 提出Kling-Omni框架，采用端到端设计，将多种输入（文本、图像、视频）编码为统一的多模态表示，并结合大规模预训练策略与推理优化基础设施。 Result: 实验表明，Kling-Omni在上下文内生成、基于推理的编辑和多模态指令遵循方面表现优异，能生成电影级质量的智能视频内容。 Conclusion: Kling-Omni不仅是一种内容创作工具，更是迈向能够感知、推理、生成并交互动态复杂世界的多模态世界模拟器的重要一步。 Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

[114] R3ST: A Synthetic 3D Dataset With Realistic Trajectories

Simone Teglia,Claudia Melis Tonti,Francesco Pro,Leonardo Russo,Andrea Alfarano,Leonardo Pentassuglia,Irene Amerini

Main category: cs.CV

TL;DR: 提出了一种名为R3ST的合成数据集，通过结合真实世界轨迹与合成3D环境，解决了现有合成数据集中车辆运动不够真实的问题，为道路车辆轨迹预测研究提供了更准确、真实的多模态标注数据。

Details

Motivation: 现有真实数据集缺乏精确的真值标注，而合成数据集虽能提供丰富标注但车辆轨迹不够真实，限制了轨迹预测模型的发展。 Method: 构建一个合成3D环境，并融合来自无人机拍摄的SinD数据集中的真实世界车辆轨迹，生成具有真实行为的合成数据集R3ST。 Result: R3ST数据集成功结合了真实轨迹与合成环境，提供了精确的多模态真值标注和逼真的车辆运动模式。 Conclusion: R3ST弥合了合成数据与真实轨迹之间的差距，推动了交通场景中车辆轨迹预测的研究进展。 Abstract: Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

[115] KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Shuting Zhao,Zeyu Xiao,Xinrong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为KineST的新型运动学引导状态空间模型，用于从稀疏信号中重建准确且时间连贯的全身姿态，适用于AR/VR场景。

Details

Motivation: 现有方法在全身体态重建中面临高计算成本或仅单独建模时空依赖的问题，难以兼顾准确性、时间连贯性和效率。 Method: 提出KineST模型，采用运动学引导的双向扫描策略嵌入运动学先验，并结合混合时空表示学习方法紧密耦合空间与时间上下文，同时引入几何角速度损失以增强运动稳定性。 Result: 实验表明，KineST在轻量级框架下实现了优于现有方法的姿态重建精度和时间一致性。 Conclusion: KineST通过融合运动学先验与时空联合建模，在AR/VR中实现了高效、准确且稳定的全身体态追踪。 Abstract: Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

[116] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian,Boyao Han,Chen Shi,Lei Xiao,Long Yang,Shaoshuai Shi,Li Jiang

Main category: cs.CV

TL;DR: GeoPredict 是一种几何感知的视觉-语言-动作（VLA）框架，通过引入预测性运动学和几何先验来增强连续动作策略，提升机器人在需要精确3D推理任务中的表现。

Details

Motivation: 现有的VLA模型主要依赖2D感知且反应式决策，难以应对需要精细3D空间理解的机器人操作任务，因此需要引入几何与运动预测能力以提升其可靠性与泛化性。 Method: 提出GeoPredict框架，包含一个轨迹级模块用于编码运动历史并预测多步3D关键点轨迹，以及一个预测性3D高斯几何模块，沿未来关键点轨迹进行工作区几何预测与轨迹引导优化；这些模块仅在训练时作为深度渲染监督，推理时仅需轻量附加查询标记。 Result: 在RoboCasa Human-50、LIBERO和真实世界操作任务上实验表明，GeoPredict持续优于强VLA基线方法，尤其在几何密集和空间要求高的场景中表现更优。 Conclusion: GeoPredict通过引入几何与运动预测先验，在不增加推理负担的前提下显著提升了VLA模型在复杂3D操作任务中的性能，为实现可靠的空间推理提供了有效路径。 Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

[117] DenseBEV: Transforming BEV Grid Cells into 3D Objects

Marius Dähling,Sebastian Krebs,J. Marius Zöllner

Main category: cs.CV

TL;DR: 本文提出了一种名为DenseBEV的新型两阶段锚点生成方法，用于多摄像头3D目标检测，通过直接利用BEV特征单元作为锚点，并结合基于BEV的非最大抑制和混合时序建模，显著提升了检测性能，尤其在小物体检测上表现突出，在nuScenes和Waymo数据集上均取得领先结果。

Details

Motivation: 传统模型使用随机查询或辅助网络生成锚点，效率较低且不够直观，本文旨在提出一种更高效、更直观的锚点生成方式，充分利用BEV特征的空间结构和时序信息。 Method: 将BEV特征图中的每个网格单元直接作为对象查询（锚点），采用BEV-based NMS筛选有效查询以提升训练效率，并引入混合时序建模机制融合先验检测信息，增强时序一致性。 Result: 在nuScenes数据集上显著提升NDS和mAP指标，行人检测mAP提升3.8%；在Waymo数据集上LET-mAP达60.7%，超过此前最优方法5.4个百分点，尤其对小物体检测效果显著。 Conclusion: DenseBEV通过密集BEV查询与高效筛选机制，实现了端到端、高性能的多摄像头3D目标检测，无需后处理即可实现优异性能，在多个基准上达到先进水平。 Abstract: In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.

[118] Next-Generation License Plate Detection and Recognition System using YOLOv8

Arslan Amin,Rafia Mumtaz,Muhammad Jawad Bashir,Syed Mohammad Hassan Zaidi

Main category: cs.CV

TL;DR: 本研究探讨了YOLOv8系列模型在车牌识别（LPR）和字符识别任务中的性能，提出了一种高效的检测与识别流程，适用于智能交通系统。

Details

Motivation: 为了提升复杂环境下车牌检测与识别的实时性与准确性，推动智能交通系统的发展。 Method: 采用YOLOv8 Nano进行车牌检测，YOLOv8 Small进行字符识别，并基于x轴位置对检测到的字符进行排序，构建优化的识别流水线。 Result: YOLOv8 Nano在LPR任务上达到0.964的精确率和0.918的mAP50；YOLOv8 Small在字符识别任务上达到0.92的精确率和0.91的mAP50。 Conclusion: 所提出的YOLOv8优化流水线在保持计算效率的同时实现了高精度，为边缘设备上的实际部署提供了可靠基础，推动智慧城市建设。 Abstract: In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.

[119] Radiology Report Generation with Layer-Wise Anatomical Attention

Emmanuel D. Muñiz-De-León,Jorge A. Rosales-de-Golferichs,Ana S. Muñoz-Rodríguez,Alejandro I. Trejo-Castro,Eduardo de Avila-Armenta,Antonio Martínez-Torteya

Main category: cs.CV

TL;DR: 提出一种紧凑的图像到文本架构，利用冻结的DINOv3 ViT编码器和带有层次解剖注意力机制的GPT-2解码器，仅基于单张胸部X光正面图像生成放射学报告的“发现”部分，在MIMIC-CXR数据集上显著提升了病理检测和结构连贯性指标。

Details

Motivation: 现有最先进的多模态放射报告生成系统依赖大规模训练、临床元数据和多视图成像，资源消耗大且难以普及，因此需要一种轻量级、仅需单视图图像的方法以提升可及性。 Method: 采用冻结的DINOv3 Vision Transformer作为视觉编码器，结合GPT-2文本解码器，并引入基于肺部和心脏分割掩码的层次化解剖注意力机制，通过分层高斯平滑引导模型关注临床关键区域，不增加可训练参数。 Result: 在MIMIC-CXR数据集上，CheXpert五种关键病理的Macro-F1提升168%（0.083→0.238），Micro-F1提升146%（0.137→0.337），14项观察指标提升86%（0.170→0.316），RadGraph F1提升9.7%。 Conclusion: 尽管模型规模小且仅依赖图像输入，解码器层面的解剖引导能有效增强空间定位和临床相关区域的报告一致性，证明了轻量化设计在放射报告生成中的可行性与高效性。 Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.

[120] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song,Jinzhou Li,Rao Fu,Devin Murphy,Kaichen Zhou,Rishi Shiv,Yaqi Li,Haoyu Xiong,Crystal Elaine Owens,Yilun Du,Yiyue Luo,Xianyi Cheng,Antonio Torralba,Wojciech Matusik,Paul Pu Liang

Main category: cs.CV

TL;DR: 本文提出了OpenTouch，首个野外第一人称全手触觉数据集，包含5.1小时同步的视频-触觉-姿态数据和2900个带文本标注的剪辑，用于推动多模态感知与具身学习研究。

Details

Motivation: 当前在第一人称视觉感知中缺乏对全手接触状态（何时、何地、多大力度）的可靠触觉数据，且没有现成的野外数据集将第一人称视频与全手触觉信号对齐。 Method: 构建OpenTouch数据集，采集5.1小时野外场景下同步的第一人称视频、全手触觉和手部姿态数据，并制作2900个精细标注的视频片段；基于此提出触觉检索与分类基准任务。 Result: 实验证明触觉信号可作为抓取理解的紧凑而有力的线索，增强跨模态对齐，并能从野外视频查询中可靠地检索触觉信息。 Conclusion: OpenTouch为多模态第一人称感知、具身学习和高接触复杂度的机器人操作提供了重要资源，有望推动相关领域发展。 Abstract: The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

[121] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Amita Kamath,Kai-Wei Chang,Ranjay Krishna,Luke Zettlemoyer,Yushi Hu,Marjan Ghazvininejad

Main category: cs.CV

TL;DR: 本文指出现有文本到图像（T2I）模型评估基准（如GenEval）存在随时间偏离人类判断的“基准漂移”问题，提出新基准GenEval 2和评估方法Soft-TIFA，以提升覆盖性和与人类判断的一致性，并强调持续审计自动化评估基准的重要性。

Details

Motivation: 现有T2I评估基准因静态设计无法适应模型快速演进而产生基准漂移，导致评估结果偏离人类判断，亟需更鲁棒、可持续的评估体系。 Method: 分析GenEval的漂移现象，构建包含更丰富视觉原语和更高组合性的新基准GenEval 2，并提出基于视觉原语组合判断的Soft-TIFA评估方法，减少对整体性评分模型（如VQAScore）的依赖。 Result: 实验证明GenEval已严重偏离人类判断（最大绝对误差达17.7%），而GenEval 2更具挑战性且Soft-TIFA与人类判断更一致，漂移风险更低。 Conclusion: 基准漂移是自动化T2I评估的重大隐患，GenEval 2和Soft-TIFA为缓解该问题提供了有效方案，但持续维护和更新评估基准至关重要。 Abstract: Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

[122] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu,Lei Ke,Xiaohang Zhan,Longxiang Tang,Yuqi Liu,Bohao Peng,Bei Yu,Dong Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了RePlan，一种用于复杂指令和场景下的图像编辑框架，通过结合视觉-语言规划器与扩散编辑器，实现精确的多区域并行编辑，并在新提出的IV-Edit基准上表现出色。

Details

Motivation: 现有基于指令的图像编辑模型在面对复杂的指令与混乱或模糊的视觉场景（即IV-复杂性）时表现不佳，缺乏对细粒度区域的精准控制和可靠推理能力。 Method: 提出RePlan，采用“先规划后执行”框架：视觉-语言规划器通过逐步推理分解指令并将其对齐到目标区域；扩散编辑器利用无需训练的注意力-区域注入机制进行并行编辑。使用基于GRPO的强化学习在1K纯指令样本上优化规划器。同时构建新基准IV-Edit评估细粒度编辑能力。 Result: RePlan在IV-Complex设置下显著优于使用更大数据集训练的强基线模型，提升了区域编辑精度和整体保真度，在新提出的IV-Edit基准上验证了其有效性。 Conclusion: RePlan通过显式区域对齐的规划机制和训练-free的编辑策略，有效应对了指令-视觉复杂性挑战，为复杂场景下的指令驱动图像编辑提供了高效且精确的解决方案。 Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

[123] Pixel Seal: Adversarial-only training for invisible image and video watermarking

Tomáš Souček,Pierre Fernandez,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Tom Sander,Alexandre Mourachko

Main category: cs.CV

TL;DR: 本文提出了Pixel Seal，一种新的图像和视频隐形水印技术，通过对抗性训练、三阶段训练计划和高分辨率适应显著提升了水印的鲁棒性和不可感知性。

Details

Motivation: 现有水印方法在平衡鲁棒性和不可感知性方面存在困难，且在高分辨率图像和视频中表现不佳。 Method: 提出了一种仅使用对抗性训练的方法，消除了不可靠的像素级不可感知损失；引入了三阶段训练计划以稳定收敛；采用基于JND的衰减和训练时推理模拟解决高分辨率适应问题。 Result: 在多种图像类型和变换下对Pixel Seal进行了全面评估，显示其在鲁棒性和不可感知性方面明显优于现有最先进方法。 Conclusion: Pixel Seal为实际应用中的图像和视频来源追踪提供了一个高效且可扩展的解决方案。 Abstract: Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.

[124] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Valay Bundele,Mehran Hosseinzadeh,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 提出ReMeDI-SAM3，一种无需训练的记忆增强型SAM3扩展方法，用于解决内窥镜视频中外科器械分割中的遮挡、记忆更新不当和身份恢复困难等问题，在EndoVis数据集上显著优于SAM3。

Details

Motivation: 现有方法如SAM3在外科手术视频分割中存在记忆更新不加区分、记忆容量固定以及遮挡后身份恢复能力弱的问题，难以应对频繁遮挡、快速运动和反光等挑战。 Method: 提出ReMeDI-SAM3，包含三个核心组件：(i) 相关性感知的记忆过滤与专用于存储遮挡前帧的遮挡感知记忆；(ii) 分段插值策略以扩大有效记忆容量；(iii) 基于特征的重识别模块结合时间投票机制，实现遮挡后可靠的身份判别。 Result: 在EndoVis17和EndoVis18数据集的零样本设置下，相比原始SAM3，mcIoU分别提升了约7%和16%，性能超过此前许多需训练的方法。 Conclusion: ReMeDI-SAM3通过改进记忆管理与身份恢复机制，显著提升了外科器械在复杂视频场景下的分割准确性与鲁棒性，且无需额外训练，具有良好的应用潜力。 Abstract: Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

[125] M-PhyGs: Multi-Material Object Dynamics from Video

Norika Wada,Kohei Yamashita,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 本文提出了一种名为M-PhyGs的新方法，用于从自然场景视频中估计多材质复杂自然物体（如花朵）的材料组成和物理参数，结合了3D和2D级联损失与时间小批量策略，并在新构建的Phlowers数据集上验证了其有效性。

Details

Motivation: 现有方法无法准确估计具有复杂材质组成和几何结构的真实物体的物理参数，尤其是在非均匀、多材质情况下，因此需要一种能处理自然复杂性的新方法。 Method: 提出Multi-material Physical Gaussians (M-PhyGs)，通过短时自然视频联合实现物体材质分割与连续力学参数估计，引入级联3D和2D损失函数并采用时间小批量训练以提高效率。 Result: 在新构建的Phlowers数据集上实验表明，M-PhyGs能有效分割多材质区域并准确恢复力学参数，优于现有方法。 Conclusion: M-PhyGs能够高效且准确地从真实视频中估计多材质自然物体的物理属性，为理解复杂物体的物理交互提供了新途径。 Abstract: Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.

[126] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu

Main category: cs.CV

TL;DR: 提出LinkedOut，一种基于视频大语言模型（VLLM）的新型表示方法，用于实现高效、多视频输入、无需手工标注的视频推荐系统。

Details

Motivation: 现有VLLM在视频理解中虽有潜力，但因高延迟、不支持多视频输入及语言输出限制，难以应用于如视频推荐等下游任务；缺乏能同时保留像素级细节并利用世界知识的表示方法。 Method: 通过VLLM从原始帧中提取语义 grounded 且知识感知的token，引入可提示查询和辅助模态引导，并设计跨层知识融合的MoE机制以选择合适抽象层级的特征。 Result: 在标准基准上实现了最先进的视频推荐性能，支持快速推理、多视频历史输入，并消除了语言瓶颈；可解释性分析验证了层多样性和逐层融合的有效性。 Conclusion: LinkedOut是首个直接在原始帧上运行且无需手工标签的VLLM-based视频推荐方法，为充分利用VLLM的世界知识先验和视觉推理提供了实用路径。 Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

[127] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Kaiwen Jiang,Xueting Li,Seonwook Park,Ravi Ramamoorthi,Shalini De Mello,Koki Nagano

Main category: cs.CV

TL;DR: 本文提出了一种将2D扩散模型知识蒸馏到前馈编码器中的方法，以实现从单张图像快速生成3D一致、高表达性的肖像动画，在保持高速推理的同时显著提升动画质量。

Details

Motivation: 现有2D肖像动画方法在3D一致性与速度上存在不足，而3D感知方法虽快且具3D一致性，但表情细节较差。本文旨在结合两者优势，实现在真实场景（如数字孪生、远程临场）中可用的高质量实时动画。 Method: 通过将基于2D扩散模型的知识蒸馏到前馈编码器中，将野外单张图像转化为3D一致、快速且富有表现力的可动画表示；采用解耦设计，隐式学习动作，避免依赖预定义参数模型，并使用轻量级局部融合策略高效融合3D结构与动画信息。 Result: 该方法在107.31 FPS下实现动画与姿态控制，速度远超现有扩散模型方法，同时动画质量媲美当前最优方法，优于其他在速度与质量间折衷的设计。 Conclusion: 本文方法成功融合了2D扩散模型的高质量表达与3D前馈模型的高效性，在保持3D一致性与高帧率的同时实现了富有表现力的肖像动画，为实际应用提供了可行方案。 Abstract: Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

[128] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Kai Qiu,Chong Luo,Zuxuan Wu

Main category: cs.CV

TL;DR: FlashPortrait是一种基于扩散变换器的端到端方法，用于生成身份保持的无限长度人像视频，推理速度提升高达6倍。

Details

Motivation: 现有基于扩散的长人像动画方法难以保证身份一致性，需要更高效且稳定的方法。 Method: 引入归一化面部表情模块对齐特征与扩散隐变量，并采用动态滑窗与加权融合策略；利用高阶隐变量导数跳过去噪步骤以加速推理。 Result: 在多个基准上实现了更优的ID保持效果和最多6倍的推理加速，生成视频质量平滑连贯。 Conclusion: FlashPortrait有效解决了长时人像动画中的身份一致性和推理效率问题，适用于高质量、无限长度的人像视频生成。 Abstract: Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

[129] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding,Yang Zhou,Xi Chen,Miao Yang,Jiarong Ou,Rui Chen,Xin Tao,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了Alchemist，一种基于元梯度的自动数据选择框架，用于提升文本到图像模型训练的数据效率。通过数据评分与剪枝两个阶段，Alchemist利用轻量级评分器结合多粒度感知和Shift-Gsampling策略，有效筛选高质量图文对，在仅用50%数据时即可超越全数据训练的效果。

Details

Motivation: 现有的文本到图像模型受限于训练数据质量，而传统数据筛选方法依赖人工或单一启发式规则，缺乏高效、自动化的数据选择机制。因此需要一种可扩展、自动化的数据选择方法来提升训练效率和生成质量。 Method: 提出Alchemist框架，包含两个阶段：1）数据评分：使用轻量级评分器基于梯度信息评估每个样本的影响，并引入多粒度感知增强；2）数据剪枝：采用Shift-Gsampling策略选择高价值子集进行训练。整体基于元梯度优化，从数据角度迭代优化模型。 Result: 在合成和网络爬取数据集上实验表明，使用Alchemist筛选出的50%数据进行训练，其视觉质量和下游任务性能均优于使用完整数据集训练的结果，验证了方法的有效性和数据效率提升。 Conclusion: Alchemist是首个面向文本到图像生成的自动、可扩展、基于元梯度的数据选择框架，显著提升了训练效率与生成质量，为大规模图文数据筛选提供了新思路。 Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

[130] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong,Haotian Yang,Angtian Wang,Yizhi Wang,Yiding Yang,Canyu Zhang,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出了VIVA，一个基于视觉语言模型引导编码和奖励优化的可扩展框架，用于指令驱动的视频编辑，显著提升了对复杂自然语言指令的泛化能力和编辑质量。

Details

Motivation: 现有基于扩散模型的视频编辑方法通常在简单编辑操作的配对数据上训练，难以泛化到多样且复杂的现实世界指令，存在泛化能力不足的问题。 Method: 提出VIVA框架：1）利用视觉语言模型（VLM）将文本指令、源视频首帧和参考图像编码为视觉接地的指令表示，为扩散Transformer提供细粒度空间与语义上下文；2）引入Edit-GRPO后训练阶段，采用组相对策略优化，基于相对奖励直接优化模型在指令忠实性、内容保持和美学质量方面的表现；3）设计了一种合成生成高保真、多样化视频-指令配对数据的 pipeline。 Result: 实验表明，VIVA在指令遵循、泛化能力和编辑质量方面优于现有最先进方法，尤其在处理复杂真实指令时表现突出。 Conclusion: VIVA通过VLM引导的表示学习与基于相对奖励的策略优化，有效解决了指令驱动视频编辑中的泛化难题，为构建更智能、灵活的视频编辑系统提供了可行路径。 Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

[131] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen,Yifan Wang,Zhengqin Li,Homanga Bharadhwaj,Yujin Chen,Chuan Qin,Ziyi Kou,Yuan Tian,Eric Whitmire,Rajinder Sodhi,Hrvoje Benko,Eli Shlizerman,Yue Liu

Main category: cs.CV

TL;DR: 本文提出了EgoMAN数据集和模型，用于实现交互阶段感知的3D手势轨迹预测，结合视觉语言推理与运动生成。

Details

Motivation: 现有3D手势轨迹预测研究受限于缺乏语义监督的数据集以及推理与动作关联较弱的模型。 Method: 提出EgoMAN数据集（包含219K 6DoF轨迹和3M结构化问答对）和EgoMAN模型，采用推理到运动框架，通过轨迹-标记接口连接视觉语言推理与运动生成，并进行渐进式训练。 Result: 该方法生成准确且具备阶段感知的3D手势轨迹，在真实场景中表现出良好的泛化能力。 Conclusion: EgoMAN在语义增强的轨迹预测任务中有效结合了推理与运动，提升了模型在复杂交互场景中的性能。 Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[132] SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu,Chih-hao Lin,Henry Che,Aditi Tiwari,Chuhang Zou,Shenlong Wang,Derek Hoiem

Main category: cs.CV

TL;DR: 本文提出了SceneDiff方法和相应的多视角变化检测基准，通过利用预训练的3D、分割和图像编码模型，在不同视角下鲁棒地检测场景中物体的变化，实验表明其性能显著优于现有方法。

Details

Motivation: 在不同时间捕获的同一场景图像或视频之间检测物体增删移位具有重要意义，但视角变化会导致物体误判为发生变化，现有方法难以应对这一挑战。 Method: 提出SceneDiff方法，将两次捕获对齐到3D空间，提取物体区域，并结合空间和语义特征进行比较以检测变化；同时构建了首个具有实例标注的多视角变化检测基准，包含350个视频对。 Result: 在多视角和双视角基准上的实验显示，该方法相较于现有方法在AP指标上分别提升了94%和37.4%。 Conclusion: SceneDiff方法有效解决了多视角下的物体变化检测问题，性能显著优于现有方法，具备良好的泛化能力和应用潜力。 Abstract: We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

[133] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju,Yongyuan Liang,Yen-Jen Wang,Nandiraju Gireesh,Yuanliang Ju,Seungjae Lee,Qiao Gu,Elvis Hsieh,Furong Huang,Koushil Sreenath

Main category: cs.CV

TL;DR: 本文提出了MomaGraph，一种用于家庭环境中移动操作机器人的统一场景表示方法，结合了空间-功能关系和部件级交互元素，并发布了大规模任务驱动的场景图数据集MomaGraph-Scenes及评估套件MomaGraph-Bench。基于此训练了7B参数的视觉语言模型MomaGraph-R1，在多项任务中实现开源模型中的最先进性能。

Details

Motivation: 现有场景表示方法通常分离空间与功能关系、忽略物体状态和时序更新，且缺乏针对当前任务的相关信息，难以满足家庭移动操作机器人对紧凑且语义丰富的场景理解需求。 Method: 提出MomaGraph这一统一的场景表示框架，集成空间-功能关系与可交互部件信息；构建MomaGraph-Scenes数据集和MomaGraph-Bench评估套件；开发基于强化学习训练的7B视觉语言模型MomaGraph-R1，采用Graph-then-Plan框架进行零样本任务规划。 Result: MomaGraph-R1在MomaGraph-Bench上达到71.6%的准确率（比最优基线高11.4%），在多个公共基准上具有良好泛化能力，并成功迁移到真实机器人实验中。 Conclusion: MomaGraph为家庭环境中的具身智能体提供了更丰富、动态且任务导向的场景表示方案，推动了场景图在机器人导航与操作中的实际应用。 Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

[134] SFTok: Bridging the Performance Gap in Discrete Tokenizers

Qihang Rao,Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: SFTok是一种离散图像分词器，通过引入自强制引导重建和去偏拟合训练策略，在高压缩率下实现了先进的图像重建质量。

Details

Motivation: 现有的离散分词器在多模态系统中表现落后于连续分词器，且存在训练与推理不一致的问题。 Method: 提出SFTok，采用多步迭代机制、自强制引导视觉重建和去偏拟合训练策略，解决离散分词器的训练-推理不一致性。 Result: 在仅64个token每图的高压缩率下，SFTok在ImageNet上达到1.21 rFID的重建质量，并在类到图像生成任务中取得2.29 gFID的优异表现。 Conclusion: SFTok显著提升了离散分词器的性能，使其在高分辨率图像生成中具有更强竞争力，推动其在多模态系统中的应用。 Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

[135] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin,Meixi Song,Dizhe Zhang,Wenxuan Lu,Haodong Li,Bo Du,Ming-Hsuan Yang,Truong Nguyen,Lu Qi

Main category: cs.CV

TL;DR: 本文提出了一种全景度量深度基础模型，通过数据在环范式和创新的框架设计，在多种场景距离下实现了良好的泛化能力。

Details

Motivation: 现有的深度估计模型在处理多样场景（尤其是室内外混合、合成与真实数据混合）时存在域差距问题，难以实现稳定准确的度量深度预测。因此需要一个能跨域泛化的全景深度基础模型。 Method: 构建了一个大规模混合数据集，结合公开数据集、基于UE5模拟器和文本到图像模型生成的高质量合成数据以及网络上的真实全景图像；采用三阶段伪标签整理流程减少域差距；使用DINOv3-Large作为主干网络，并引入即插即用的范围掩码头、以清晰度为中心和以几何为中心的优化策略来增强多距离鲁棒性和跨视图几何一致性。 Result: 在多个基准测试（如Stanford2D3D、Matterport3D和Deep360）上表现出色，展现出强大的零样本泛化能力，尤其在多样化的真实世界场景中实现了稳健且稳定的度量深度预测。 Conclusion: 所提出的方法通过数据与模型协同设计，有效缩小了域间差异，提升了全景深度估计在复杂现实场景中的泛化性与可靠性，为构建通用深度感知模型提供了可行路径。 Abstract: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}

[136] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen,Yihua Du,Wenhang Ge,Jing He,Chirui Chang,Donghao Zhou,Zhen Yang,Luozhou Wang,Xin Tao,Ying-Cong Chen

Main category: cs.CV

TL;DR: 提出UniStereo数据集和StereoPilot模型，实现高质量、高效的单目到立体视频转换，支持多种立体格式。

Details

Motivation: 现有“Depth-Warp-Inpaint”多阶段方法存在误差累积、深度模糊和格式不一致问题，且缺乏统一的大规模数据集支持立体视频转换研究。 Method: 构建大规模统一立体视频数据集UniStereo，提出无需显式深度图或扩散采样的前馈网络StereoPilot，引入可学习域切换器和循环一致性损失以适应不同立体格式。 Result: StereoPilot在视觉质量和计算效率上显著优于现有最先进方法，并实现了对平行与会聚立体格式的无缝适应。 Conclusion: 通过UniStereo数据集和StereoPilot模型，为单目到立体视频转换提供了更高效、一致且实用的解决方案，推动了立体显示内容的生成发展。 Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

[137] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang,Kaituo Feng,Dongyang Chen,Zhongyu Wang,Zhixun Li,Sicheng Gao,Meng Meng,Xu Zhou,Manyuan Zhang,Yuzhang Shang,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了AdaTooler-V，一种能够自适应调用视觉工具的多模态大语言模型，通过强化学习和新构建的数据集在多种视觉推理任务中取得领先性能。

Details

Motivation: 现有开源多模态大语言模型常盲目调用视觉工具，导致推理开销增加和性能下降，缺乏对工具使用必要性的判断能力。 Method: 提出AT-GRPO强化学习算法，基于样本的工具收益分数动态调整奖励尺度，并构建AdaTooler-V-CoT-100k和AdaTooler-V-300k两个数据集用于SFT冷启动和RL训练。 Result: 在十二个基准测试中表现出色，AdaTooler-V-7B在高分辨率基准V*上达到89.8%的准确率，超过GPT-4o和Gemini 1.5 Pro。 Conclusion: AdaTooler-V能有效实现自适应工具使用，在提升推理效率的同时增强了多模态推理能力，推动了高效多模态模型的发展。 Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

[138] DVGT: Driving Visual Geometry Transformer

Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Shengyin Jiang,Long Chen,Zhi-Xin Yang,Jiwen Lu

Main category: cs.CV

TL;DR: 提出Driving Visual Geometry Transformer (DVGT)，一种面向自动驾驶的密集几何感知模型，能从无姿态多视角图像序列中重建全局稠密3D点图，无需精确相机参数或外部传感器对齐，支持任意相机配置并在多种场景下显著优于现有方法。

Details

Motivation: 缺乏针对自动驾驶、适应不同场景和相机配置的密集3D几何感知模型。 Method: 使用DINO骨干网络提取图像特征，通过交替的 intra-view 局部注意力、cross-view 空间注意力和 cross-frame 时间注意力建模图像间的几何关系，并用多头解码器预测全局点云地图和自车位姿。 Result: 在nuScenes、OpenScene、Waymo、KITTI和DDAD等多个大规模驾驶数据集上训练并验证，DVGT在不同场景下显著优于现有方法。 Conclusion: DVGT是一种无需显式3D几何先验和相机参数的通用视觉几何感知框架，可直接从图像序列预测度量尺度的3D结构，适用于多样化的自动驾驶部署环境。 Abstract: Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

[139] EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai,Chaoyang Wang,Guocheng Gordon Qian,Willi Menapace,Sergey Tulyakov,Bernard Ghanem,Peter Wonka,Ashkan Mirzaei

Main category: cs.CV

TL;DR: 本文提出了一个简单而有效的基于指令的视频编辑框架EasyV2V，在数据、模型和控制方面进行了创新，实现了最先进的视频编辑效果。

Details

Motivation: 视频编辑在一致性、控制性和泛化性方面仍面临挑战，现有方法不够灵活和高效。 Method: 通过组合现有专家模型生成多样化的视频对，将图像编辑对扩展为视频，引入密集标注片段和过渡监督，并利用预训练文本到视频模型，采用序列拼接和轻量级LoRA微调进行训练。 Result: EasyV2V支持多种输入形式，如视频+文本、视频+掩码+文本等，实现了优越的编辑效果，在多个基准上超越了当前系统和商业工具。 Conclusion: EasyV2V通过简化设计和统一控制机制，为指令驱动的视频编辑提供了高效且通用的解决方案。 Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

[140] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu,Chengzhi Mao,Yaojie Liu,Alan Yuille,Wen-Sheng Chu

Main category: cs.CV

TL;DR: 提出AuditDM框架，通过强化学习训练多模态大模型作为审计员，主动发现并修正模型分歧，揭示多种失败模式，并用于模型改进。

Details

Motivation: 现有评估方法缺乏可解释性，难以充分暴露多模态大模型的能力差距。 Method: 使用强化学习微调一个多模态大模型作为审计员，生成最大化目标模型分歧的挑战性问题和反事实图像，自动发现失败模式。 Result: 在Gemma-3和PaliGemma-2等SOTA模型上发现了超过20种不同的失败类型，基于这些发现进行微调后，在16个基准上持续提升性能，使3B模型超越28B模型。 Conclusion: 随着数据扩展效益递减，有针对性的模型审计为模型诊断与改进提供了有效路径。 Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

[141] Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu,Ziqiao Ma,Wenhao Chai,Xuweiyi Chen,Weiyang Jin,Joyce Chai,Saining Xie,Stella X. Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为Next-Embedding Predictive Autoregression (NEPA)的视觉自监督学习方法，通过预测图像块嵌入实现生成式预训练，无需像素重建、离散化或对比损失，在ImageNet和ADE20K等任务上取得优异性能。

Details

Motivation: 受自然语言中生成式预训练成功的启发，探索是否可以在视觉领域采用类似原则，实现从学习表征到学习模型的转变。 Method: 采用因果掩码和停止梯度机制，训练模型直接预测未来的图像块嵌入（基于过去的嵌入），使用纯Transformer架构在ImageNet-1k上进行预训练，仅依赖下一嵌入预测作为学习目标。 Result: ViT-B和ViT-L模型在ImageNet-1K微调后分别达到83.8%和85.3%的top-1准确率，并在ADE20K语义分割任务上展现出良好的迁移能力。 Conclusion: 基于嵌入生成式预训练提供了一种简单、可扩展且可能跨模态通用的视觉自监督学习新范式。 Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

[142] Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu,Jia-Bin Huang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为生成式重对焦（Generative Refocusing）的两步方法，通过DeblurNet和BokehNet实现从单张图像中恢复全焦图像并生成可控景深效果，创新性地采用半监督训练结合合成配对数据与真实非配对散景图像，提升了真实感与控制灵活性。

Details

Motivation: 现有单图重对焦方法依赖全焦输入、合成数据且对光圈控制有限，难以生成真实散景效果，因此需要一种更灵活、真实的方法。 Method: 提出两步法：首先用DeblurNet从多种模糊输入恢复全焦图像，再用BokehNet生成可控散景；采用半监督训练，结合合成配对数据与利用EXIF元数据提取真实光学特性的非配对真实散景图像。 Result: 在散焦去模糊、散景合成和重对焦基准测试中均达到最先进性能，并支持文本引导调整和自定义光圈形状。 Conclusion: Generative Refocusing实现了高质量、可控的单图像重对焦，突破了对全焦输入和纯合成数据的依赖，显著提升了真实感和应用灵活性。 Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

[143] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang,Hao Ouyang,Qiuyu Wang,Yue Yu,Yihao Meng,Wen Wang,Ka Leong Cheng,Shuailei Ma,Qingyan Bai,Yixuan Li,Cheng Chen,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: WorldCanvas 是一个结合文本、轨迹和参考图像的多模态框架，用于生成可提示、可控制的丰富世界事件视频，支持多智能体交互、对象出入场和身份一致性，推动世界模型从被动预测向用户主导的交互式模拟发展。

Details

Motivation: 现有方法在生成包含语义意图、运动控制和视觉一致性的复杂世界事件视频时存在局限，尤其是对多智能体交互、对象进出场景以及身份保持的支持不足。因此需要一种更强大、可控且语义丰富的生成框架。 Method: 提出 WorldCanvas 框架，将轨迹（编码运动、时间和可见性）与自然语言（表达语义意图）和参考图像（提供视觉身份锚定）相结合，通过多模态条件控制实现对复杂世界事件的精细控制，并利用轨迹引导的扩散模型生成时空连贯的视频。 Result: 生成的视频在时间上连贯，并展现出涌现的一致性，即使对象暂时消失也能保持身份和场景一致性；能够实现多智能体互动、对象入场/离场、外观引用控制和反直觉事件等复杂场景。 Conclusion: WorldCanvas 通过融合多模态输入实现了对世界事件的高度可控生成，使世界模型从被动预测工具转变为可由用户主动塑造的交互式模拟器，拓展了其在仿真和创意应用中的潜力。 Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Table of Contents

cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

[2] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

[4] Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

[6] Convolutional Lie Operator for Sentence Classification

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

[9] A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

[13] Sigma-Moe-Tiny Technical Report

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

[15] Hacking Neural Evaluation Metrics with Single Hub Text

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

[24] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

[30] In-Context Algebra

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

[34] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

[41] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

[44] Auto-Vocabulary 3D Object Detection

[45] LAPX: Lightweight Hourglass Network with Global Context

[46] Collimator-assisted high-precision calibration method for event cameras

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

[48] Flexible Camera Calibration using a Collimator System

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

[53] Towards Closing the Domain Gap with Event Cameras

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

[57] Enhanced 3D Shape Analysis via Information Geometry

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

[59] Image Compression Using Singular Value Decomposition

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

[65] GFLAN: Generative Functional Layouts

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

[72] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models