cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar,Juhna Park,Aparna Garimella,Vivek Gupta

Main category: cs.CL

TL;DR: 提出TabReX，一种无需参考表格、基于属性驱动的图推理框架，用于评估大语言模型生成的表格质量。

Details

Motivation: 现有指标在评估LLM生成表格时存在忽略结构或依赖固定参考的局限，缺乏通用性和可解释性。 Method: 将源文本和生成表格转化为规范知识图谱，通过LLM引导的匹配对齐，并计算可解释的评分以衡量结构和事实保真度。 Result: TabReX在专家排名相关性上表现最优，对复杂扰动更稳定，并支持细粒度的模型与提示分析。 Conclusion: TabReX为结构化生成系统的可信、可解释评估建立了新范式。 Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Joel Mire,Maria Antoniak,Steven R. Wilson,Zexin Ma,Achyutarama R. Ganti,Andrew Piper,Maarten Sap

Main category: cs.CL

TL;DR: 本文提出了SocialStoryFrames，一种用于捕捉读者对故事反应的推理形式化框架，并开发了两个模型SSF-Generator和SSF-Classifier，结合人类调查与专家标注进行验证，应用于包含6140个社交媒体故事的数据集，以分析叙事意图的频率与依赖关系及不同社区间的叙事实践差异。

Details

Motivation: 现有计算模型难以捕捉读者对故事产生的丰富解释性、情感性和评价性反应，缺乏对读者响应的细粒度建模，限制了对叙事行为的深入分析。 Method: 基于叙事理论、语言语用学和心理学构建读者反应分类体系，提出SocialStoryFrames框架；设计SSF-Generator和SSF-Classifier两个模型，通过人类调查（N=382）和专家标注分别验证，并在SSFCorpus（6,140条社交媒体故事）上进行实证分析。 Result: 模型能有效识别作者意图、因果推断、情感反应和价值判断；分析揭示了不同在线社区中叙事意图的分布模式与相互依赖关系，以及叙事实践的多样性差异。 Conclusion: SocialStoryFrames通过结合细粒度的情境敏感建模与通用的读者反应分类法，为大规模研究在线社区中的 storytelling 提供了新工具和方法论基础。 Abstract: Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Armağan Amcalar,Eyup Cinar

Main category: cs.CL

TL;DR: 本文提出了一种基于Mermaid指令图的结构化提示框架BRAID，通过限制推理过程提升大语言模型在多个基准测试中的推理准确性和成本效率。

Details

Motivation: 大语言模型在推理过程中存在性能、成本和token使用之间的非线性关系，传统无约束自然语言推理易导致高成本和低效，因此需要一种更高效的结构化推理方法。 Method: 提出BRAID框架，利用Mermaid语法构建机器可读的有界推理图，引导模型进行结构化推理，并在AdvancedIF、GSM-Hard和SCALE MultiChallenge等多个基准上评估不同GPT模型的表现。 Result: 实验表明，与传统提示相比，BRAID显著提高了推理准确性和成本效率，尤其在生产环境中的自主代理系统中表现更优。 Conclusion: BRAID是一种有效且可扩展的技术，能够优化自主代理系统的推理效率，为大模型的实际应用提供了高性价比的解决方案。 Abstract: Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

Kieran Henderson,Kian Omoomi,Vasudha Varadarajan,Allison Lahnala,Charles Welch

Main category: cs.CL

TL;DR: 本文研究了不同类型自我披露信息对预测标注者判断社会规范的影响，发现人口统计学信息比态度、关系和经验更具影响力，且少量相关评论即可取得良好效果。

Details

Motivation: 探索在主观任务中，何种类型的个人信息最有助于预测标注者的标签，以改进个体特征建模。 Method: 对自我披露句子进行分类，并构建标注者模型，通过多种消融实验和分析评估不同信息类型的影响。 Result: 人口统计学信息影响最大；基于理论的方法优于自动聚类；仅需少量相关评论；更多样化的自我披露样本带来最佳性能。 Conclusion: 在预测社会规范判断时，使用多样化且与人口统计相关的自我披露信息最为有效。 Abstract: Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng,Sinan Wang,Zhengxiang Cheng,Yao Wan,Dongping Chen

Main category: cs.CL

TL;DR: 本文提出了Sage，一种无需人类标注即可评估LLM作为评判者（LLM-as-a-Judge）质量的新框架，基于理性选择理论的两个公理：局部自洽性和全局逻辑一致性。实验表明该方法稳定且与现有基准高度相关，并揭示当前主流LLM在担任评判者时存在显著可靠性问题。

Details

Motivation: 现有LLM-as-a-Judge评估依赖人工标注的真实标签，存在人类偏见且难以扩展，因此需要一种无需人工干预、更可靠和可扩展的评估方法。 Method: 受理性选择理论启发，提出两个新指标：局部自洽性（成对偏好稳定性）和全局逻辑一致性（偏好传递性）。构建包含650个问题的数据集，结合结构化基准与真实用户查询，进行系统实验。 Result: Sage指标表现出高稳定性，并与LLMBar、RewardBench2等监督基准高度相关；发现顶级模型如Gemini-2.5-Pro和GPT-5在约四分之一的难题中无法保持偏好一致；揭示了“情境偏好”现象，说明显式评分标准有助于提升一致性；微调、面板式评判和深度推理可改善判断一致性；同时发现人类判断也存在显著不一致。 Conclusion: Sage为评估LLM-as-a-Judge提供了一个可靠、无需人工标注的新范式，揭示了当前模型和人类在判断一致性上的局限性，推动更稳健的自动评估方法发展。 Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

[6] Convolutional Lie Operator for Sentence Classification

Daniela N. Rim,Heeyoul Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于李群卷积（Lie Convolutions）的新型卷积句法分类器SCLie和DPCLie，通过捕捉语言中复杂的非欧几里得对称性，在句子分类任务上优于传统卷积模型，表明引入几何结构有助于提升文本建模能力。

Details

Motivation: 传统CNN在文本中提取局部、位置不变特征方面有效，但对语言中复杂变换的建模能力有限。受李群能捕捉非欧几里得对称性的启发，探索其在自然语言处理中的应用潜力。 Method: 将李群卷积（Lie Convolutions）引入基于卷积的句子分类器，构建两种新模型SCLie和DPCLie，利用李群操作建模语言中的复杂变换。 Result: SCLie和DPCLie在实验中优于传统的卷积句子分类器，验证了李群卷积在捕捉非常规语言变换方面的有效性。 Conclusion: 引入李群结构有助于提升句子分类性能，表明探索具有几何先验的新范式对语言建模具有前景。 Abstract: Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim

Main category: cs.CL

TL;DR: 提出一种语义驱动的强化学习方法（SRL）用于医学报告生成，基于报告级临床正确性奖励提升大视觉-语言模型的生成准确性，在IU X-Ray和MIMIC-CXR数据集上取得当前最优的临床效能表现。

Details

Motivation: 现有医学报告生成方法依赖于词元级训练目标，导致生成文本虽具放射科语言风格但临床准确性不足，缺乏对关键医学发现的正确语义对齐。 Method: 采用基于大视觉-语言模型（LVLM）的语义驱动强化学习（SRL），引入组相对策略优化（GRPO），通过报告级奖励——基于关键影像发现提取的生成与参考报告间的余弦相似度（MCCS）进行优化，并加入轻量级推理格式约束以生成结构化思考报告。 Result: 在IU X-Ray和MIMIC-CXR数据集上，MRG-R1分别取得CE-F1为51.88和40.39的性能，显著优于传统token-level监督方法，验证了语义驱动强化学习在提升临床正确性上的有效性。 Conclusion: 优化基于临床语义对齐的报告级奖励比传统的词元级重叠更能有效提升医学报告生成的临床正确性，为医学大视觉语言模型的训练提供了语义强化监督的新方向。 Abstract: Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

Yash Bhaskar,Sankalp Bahad,Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: 本文提出了一种用于检测由虚假叙事驱动的仇恨言论（Faux-Hate）的系统，针对混合印地语-英语社交媒体文本，采用多任务学习和领域特定预训练方法，在二分类检测及目标与严重性预测任务中取得了有竞争力的结果。

Details

Motivation: 由于社交媒体上虚假信息与仇恨言论交织传播，导致有害内容迅速扩散，因此需要有效识别由虚假叙事引发的仇恨言论（Faux-Hate）现象。 Method: 结合先进的自然语言处理技术与领域特定的预训练，并采用多任务学习框架，以同时处理二元Faux-Hate检测和目标与严重性预测两个子任务。 Result: 所提出的系统在Faux-Hate共享任务中取得了具有竞争力的结果，验证了多任务学习在该复杂问题上的有效性。 Conclusion: 利用多任务学习和领域特定预训练可以有效提升对代码混合文本中由虚假叙事驱动的仇恨言论的检测性能。 Abstract: Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.

Mengfan Shen,Kangqi Song,Xindi Wang,Wei Jia,Tao Wang,Ziqiang Han

Main category: cs.CL

TL;DR: 本文提出了一种基于LoRA微调Qwen2.5-7B模型并结合提示工程的领域自适应信息抽取管道，用于从中文微博警方通报中提取结构化信息。

Details

Motivation: 由于社交媒体文本噪声大、形式多样，传统方法难以高效准确地从非结构化警方通报中提取关键信息。 Method: 采用低秩适应（LoRA）对Qwen2.5-7B模型进行参数高效微调，并结合针对性提示工程，在4933个手动标注样本上训练以抽取15个关键字段。 Result: 在死亡人数检测上准确率超过98.36%，死亡数和省份级位置提取的精确匹配率分别为95.31%和95.54%，显著优于基础和指令微调模型。 Conclusion: 该管道为特定领域多任务信息抽取提供了高效且可验证的解决方案，有助于将非结构化社会文本转化为可靠结构化数据。 Abstract: Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam

Main category: cs.CL

TL;DR: 提出一种独立于大语言模型的医疗事实核查模块，结合领域特定的摘要模型以减少幻觉，基于MIMIC-III数据集使用LoRa微调，并通过自然语言处理中的离散逻辑进行细粒度验证，实验显示较高的精确率、召回率和F1分数。

Details

Motivation: 大语言模型在医疗决策中易产生幻觉输出，威胁患者安全，需提高生成内容的可靠性与准确性。 Method: 采用LoRa在MIMIC-III数据集上微调摘要模型，并设计一个不依赖大语言模型的事实核查模块，利用数值检验和基于电子健康记录的自然语言离散逻辑进行细粒度事实验证。 Result: 事实核查模块达到0.8904的精确率、0.8234的召回率和0.8556的F1分数；摘要模型获得0.5797的ROUGE-1分数和0.9120的BERTScore。 Conclusion: 所提方法有效降低医疗文本生成中的幻觉问题，事实核查模块表现良好，提升了大语言模型在关键医疗应用中的可信度。 Abstract: In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

Qizhou Chen,Chengyu Wang,Taolin Zhang,Xiaofeng He

Main category: cs.CL

TL;DR: 提出基于信息瓶颈理论的大模型知识编辑框架IBKE，实现高效、泛化性强的知识更新。

Details

Motivation: 现有大模型编辑方法难以在不重训练的情况下实现跨领域的知识修正，且易产生副作用，需更通用、精确的编辑方法。 Method: 基于信息瓶颈理论，通过压缩和隔离关键信息，利用紧凑的潜在表示引导基于梯度的更新，开发了IBKE编辑框架。 Result: 在多种大模型架构和基准任务上验证了IBKE的有效性，表现出最先进的准确性、更好的编辑泛化性和特异性。 Conclusion: IBKE为开放域知识编辑提供了理论严谨且实用的新范式，提升了大模型在实际应用中的可用性和可信度。 Abstract: Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE's effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu,Yijie Jin,Jiajun Li,Yi Tu,Guoping Long,Dandan Tu,Tianqi Hou,Junchi Yan,Zhijie Deng

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的即插即用解码算法LoPA，通过优化Token填充顺序（TFO）显著提升扩散大语言模型（dLLM）的并行推理速度，在D2F模型上实现了高达10.1 TPF的解码效率，并结合多设备推理系统实现每秒超过千token的吞吐量。

Details

Motivation: 当前基于置信度的解码策略在dLLM推理中并行度受限，通常每轮仅生成1-3个token，限制了推理速度。作者发现解码并行度高度依赖于Token填充顺序（TFO），因此提出需寻找更优TFO以提升效率。 Method: 提出Lookahead PArallel Decoding（LoPA）算法，通过并行分支同时探索多种候选TFO，并根据分支置信度选择最具未来并行潜力的填充顺序；此外设计了支持分支并行（BP）的多设备推理系统以支持高并发解码。 Result: 在GSM8K数据集上，LoPA将D2F-Dream模型的TPF从1-3提升至10.1，且性能优于Dream基线；多GPU部署下单样本吞吐量达1073.9 tokens/秒。 Conclusion: LoPA通过优化TFO显著提升了dLLM的解码并行度和推理效率，结合专用推理系统实现了训练-free的高性能加速，为dLLM的实际应用提供了有效解决方案。 Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

[13] Sigma-Moe-Tiny Technical Report

Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong

Main category: cs.CL

TL;DR: 本文提出了Sigma-MoE-Tiny，一种具有极高稀疏性的Mixture-of-Experts语言模型，每层最多包含96个专家但每个token仅激活一个专家，实现20B总参数中仅激活0.5B。为解决极端稀疏带来的专家负载均衡问题，提出渐进式稀疏化策略，并通过稳定训练和后训练提升性能，在极低激活参数下达到领先水平。

Details

Motivation: 在Mixture-of-Experts（MoE）模型中实现更高的效率与可扩展性，探索极端稀疏设置下的模型性能边界，同时解决低层中现有负载均衡机制失效的问题。 Method: 采用细粒度专家分割，每层最多96个专家，每个token仅激活一个专家；引入渐进式稀疏化调度策略以改善专家负载均衡；在高质量语料上进行预训练并结合后训练提升能力。 Result: 实现了20B总参数、仅激活0.5B的极高稀疏模型；训练过程稳定，无不可恢复的损失尖峰；在同类或更大规模模型中表现出顶级性能。 Conclusion: Sigma-MoE-Tiny展示了在极端稀疏条件下仍能保持高性能的可能性，提出的渐进式稀疏化策略有效缓解了负载均衡问题，为未来高稀疏MoE架构提供了实践参考。 Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

Yehor Tereshchenko,Mika Hämäläinen,Svitlana Myroniuk

Main category: cs.CL

TL;DR: 本研究比较了GPT推理与非推理模型在芬兰语与四种低资源乌拉尔语（科米-兹梁、莫克沙、埃尔齐亚、乌德穆尔特）之间的翻译表现，发现推理模型的拒绝率低16个百分点。

Details

Motivation: 现有大语言模型翻译评估多集中于高资源语言，缺乏对低资源及濒危语言的表现理解，尤其是乌拉尔语系语言。 Method: 使用文学文本平行语料库，分析不同GPT架构在翻译任务中的拒绝率，对比推理与非推理模型的表现差异。 Result: 推理模型相较非推理模型在翻译低资源乌拉尔语时表现出显著更低的拒绝率，差距达16个百分点。 Conclusion: 推理架构更有利于低资源语言的翻译尝试，为濒危语言保护和相关研究提供了重要参考。 Abstract: The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI's GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.

[15] Hacking Neural Evaluation Metrics with Single Hub Text

Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai

Main category: cs.CL

TL;DR: 提出一种在离散空间中寻找对抗性文本的方法，揭示神经文本评估指标（如COMET）的脆弱性，发现的单一枢纽文本在多个翻译任务中表现出异常高的评分并具有跨语言对泛化能力。

Details

Motivation: 担忧当前基于神经网络的文本评估指标因黑箱特性而缺乏可靠性和安全性，需揭示其潜在漏洞。 Method: 在离散文本空间中设计方法寻找一个能被持续评为高质量的单一对抗性枢纽文本，以检测评估指标的可靠性。 Result: 该方法找到的枢纽文本在En--Ja和En--De翻译任务中分别达到79.1 COMET%和67.8 COMET%，超过M2M100模型为各句子单独生成的翻译，并验证了其在Ja--En和De--En等多语言对上的泛化性。 Conclusion: 当前神经评估指标存在严重漏洞，单一对抗文本即可获得高分且跨语言有效，提示需重新审视其可靠性与安全性。 Abstract: Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi,Javier Garcia Gilabert,Zachary Hopton,Vilém Zouhar,Carlos Escolano,Gerard I. Gállego,Jorge Iranzo-Sánchez,Ahrii Kim,Dominik Macháček,Patricia Schmidtova,Maike Züfle

Main category: cs.CL

TL;DR: 本论文提出了“Hearing to Translate”测试套件，系统评估了5种最先进的语音大语言模型（SpeechLLMs）与16种级联及直接系统的性能，在16个基准、13种语言对和9种复杂条件下发现当前级联系统整体仍最可靠，SpeechLLMs仅在特定场景下表现相当，且语音基础模型（SFM）表现落后。

Details

Motivation: 研究语音大语言模型（SpeechLLMs）是否在语音到文本翻译质量上优于传统级联架构，解决该问题尚无定论。 Method: 构建首个全面的测试套件Hearing to Translate，对5种先进SpeechLLMs与16种结合领先语音基础模型和多语言大语言模型的直接及级联系统进行严格基准测试，覆盖多种语言对与挑战性语音条件。 Result: 在广泛评估中发现级联方法整体最可靠；当前SpeechLLMs仅在部分设置下与级联系统相当；独立SFM表现较差；集成大语言模型（无论内置还是流水线形式）对高质量语音翻译至关重要。 Conclusion: 尽管SpeechLLMs发展迅速，目前将语音模型与大语言模型结合的级联方式仍是更可靠的语音翻译方案，直接整合语音与语言能力的模型仍有待改进。 Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

Darshil Chauhan,Adityasinh Solanki,Vansh Patel,Kanav Kapoor,Ritvik Jain,Aditya Bansal,Dhruv Kumar,Prateek Narang

Main category: cs.CL

TL;DR: 本文提出了一种高效且隐私保护的低秩自适应框架（LoRA），用于在资源受限和数据隐私敏感的环境中改进多语言语音识别模型在真实临床音频上的性能，显著降低了词错误率并缓解了灾难性遗忘。

Details

Motivation: 由于数据隐私限制、计算资源有限以及严重的声学域偏移，现有的自动语音识别模型在实际临床环境中的表现严重下降，难以实际部署应用。 Method: 采用低秩自适应（LoRA）技术实现边缘设备上的持续学习，并结合多域经验回放策略以减少灾难性遗忘，同时保障患者数据隐私。 Result: 在真实世界临床音频（Gram Vaani）上，所提方法相比基础模型实现了17.1%的相对词错误率（WER）降低，并将灾难性遗忘减少了47%。 Conclusion: 该框架为在高影响的真实场景中构建可靠、可自我改进的ASR系统提供了可行路径，尤其适用于农村医疗等资源受限环境。 Abstract: Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Primoz Kocbek,Leon Kopitar,Gregor Stiglic

Main category: cs.CL

TL;DR: 本研究探讨了使用大语言模型（LLM）简化生物医学文本以提高健康素养的方法，比较了基于提示模板、双AI代理和微调等策略，并评估了GPT-4o和GPT-4o-mini模型的表现。

Details

Motivation: 由于生物医学文献通常难以理解，限制了公众的健康素养，因此需要有效的文本简化方法，使非专业读者也能理解关键信息。 Method: 使用包含生物医学摘要通俗化版本的公开数据集，开发并评估了三种方法：基于提示模板的基线方法、双AI代理方法和微调方法；采用GPT-4o和GPT-4o-mini作为基础模型，并结合Flesch-Kincaid年级水平、SMOG指数、SARI、BERTScore、G-Eval等定量指标以及5点李克特量表进行定性评估。 Result: GPT-4o-mini模型在各项指标中表现优于其他方法，而微调方法表现不佳；G-Eval这一基于LLM的自动评估指标与人工评分趋势高度一致，显示出良好的评估潜力。 Conclusion: 轻量级大模型（如GPT-4o-mini）在生物医学文本简化任务中具有优异表现，无需复杂微调即可实现高质量输出，同时LLM-based自动评估指标（如G-Eval）可有效替代部分人工评价。 Abstract: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

Primoz Kocbek,Gregor Stiglic

Main category: cs.CL

TL;DR: 本研究参与了CLEF 2025 SimpleText赛道任务1，比较了基于提示工程的无上下文方法与微调方法在科学文本简化中的表现，使用gpt-4.1系列模型。

Details

Motivation: 旨在探索不同模型和方法在句子级和文档级科学文本简化任务中的有效性。 Method: 采用gpt-4.1、gpt-4.1-mini和gpt-4.1-nano模型，对比无上下文提示工程方法与微调方法。 Result: gpt-4.1-mini在无上下文设置下表现稳健，微调模型结果参差，gpt-4.1-nano-ft在文档级简化中某一情况下表现突出。 Conclusion: 不同粒度的文本简化具有复杂性，简单模型配合良好提示可媲美或优于微调方法。 Abstract: This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero,David Montero,Roman Orus

Main category: cs.CL

TL;DR: 本文提出了Refusal Steering，一种在推理时无需重新训练即可精细控制大语言模型对政治敏感话题拒绝行为的方法。

Details

Motivation: 现有的基于模式的拒绝检测方法脆弱且不够精确，难以在保持安全性的同时灵活控制模型的拒绝行为。 Method: 使用LLM-as-a-judge来评估拒绝置信度，并提出一种岭正则化方法计算更精确的引导向量，以隔离拒绝-服从方向。 Result: 该方法成功移除了模型在政治敏感话题上的拒绝行为，同时在JailbreakBench上保持安全，在通用基准上接近基线性能，并且可在不同规模模型间泛化。 Conclusion: 激活引导能够有效去除政治相关拒绝行为，同时保留对有害内容的安全对齐，为推理时可控制、透明的内容审核提供了可行路径。 Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He,Zekai Qu,Zeyuan Liu,Yinghao Chen,Yuxin Zuo,Cheng Qian,Kaiyan Zhang,Weize Chen,Chaojun Xiao,Ganqu Cui,Ning Ding,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为JustRL的简化强化学习方法，用于大语言模型，采用单阶段训练和固定超参数，在更低计算成本下实现了最先进的性能，并表明许多现有复杂性可能是不必要的。

Details

Motivation: 当前强化学习在大语言模型中的应用趋向于高复杂度设计（如多阶段训练、动态超参调整等），本文质疑这种复杂性是否必要。 Method: 提出JustRL，使用单阶段训练和固定超参数，不对不同模型进行调参，避免使用常见的‘标准技巧’如长度惩罚和鲁棒验证器。 Result: 在两个15亿参数的推理模型上，JustRL在九个数学基准上的平均准确率达到54.9%和64.3%，计算量仅为复杂方法的一半；训练过程稳定，未出现崩溃或停滞；引入常见技巧反而可能损害性能。 Conclusion: 许多当前为解决训练不稳定性而引入的复杂机制可能并非必要，一个简单、稳定且充分放大的基线方法即可有效解决问题，建议社区重新审视复杂性的价值。 Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

William English,Chase Walker,Dominic Simon,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为GinSign的框架，用于将自然语言 grounding 到系统签名中以实现时序逻辑翻译，通过分层分类方法显著提升了接地翻译的准确率，并在多个领域实现了95.5%的逻辑等价性，较现有技术提升1.4倍。

Details

Motivation: 现有自然语言到时序逻辑的翻译框架要么依赖准确的原子 grounding，要么接地翻译准确率较低，限制了其在可信自主系统中的应用。 Method: 提出GinSign框架，引入一个分层的 grounding 模型：首先预测谓词标签，然后选择合适类型的常量参数，将自由生成问题转化为结构化分类问题，从而可使用更小的掩码语言模型并减少对大型语言模型的依赖。 Result: 实验表明，忽略 grounding 的框架虽能生成语法正确的升阶LTL，但语义上与目标表达式不等价；而GinSign支持下游模型检测，在多个领域实现了95.5%的接地逻辑等价得分，比现有最佳方法提高1.4倍。 Conclusion: GinSign通过结构化、分层的 grounding 方法有效解决了自然语言到时序逻辑翻译中的语义不匹配问题，显著提升了翻译的准确性和实用性，适用于构建可信的自主系统。 Abstract: Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically -- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar

Main category: cs.CL

TL;DR: 提出一种基于推理链增强的RAG框架，通过三阶段结构化推理（文档裁决、冲突分析、有据综合）处理检索结果中的冲突与不准确信息，并引入CATS评估流程提升答案正确性和行为一致性。

Details

Motivation: 现有RAG方法在面对检索源冲突、过时或主观信息时表现不佳，且缺乏统一的推理监督机制。 Method: 设计包含文档级裁决、冲突分析和有据合成的三阶段推理增强框架，并构建带有引用链接的答案生成机制；同时提出CATS评估流程，利用LLM-as-a-Judge评估 groundedness、事实正确性、拒绝准确性和冲突行为对齐。 Result: 在539个查询的数据集上实验显示显著优于基线模型，Qwen模型经监督微调后端到端正确率从0.069提升至0.883，行为遵从度从0.074提升至0.722。 Conclusion: 该框架有效提升了RAG系统在复杂、冲突信息下的推理能力与可解释性，为构建可信、可审计的检索增强生成系统提供了可行路径。 Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

Primož Kocbek,Azra Frkatović-Hodžić,Dora Lalić,Vivian Hui,Gordan Lauc,Gregor Štiglic

Main category: cs.CL

TL;DR: 本研究比较了在生物医学问答中使用多模态检索增强生成（MM-RAG）时，将图表转换为文本与直接使用OCR-free视觉检索的效果，发现在中小型模型中，文本化处理更优，而在前沿大模型中，视觉检索性能接近甚至超越传统方法，且ColFlor等轻量级检索器表现高效。

Details

Motivation: 研究在多模态RAG中何时应将图表转为文本、何时应使用无需OCR的视觉检索，特别是在图像密集型领域如糖生物学中的适用性权衡。 Method: 构建包含120道选择题的基准测试集，按检索难度分层；实现四种增强方式（无、文本RAG、多模态转换、基于ColPali的晚期交互视觉检索），使用Docling解析和Qdrant索引，评估多个开源与闭源模型，并扩展测试GPT-5系列及多种视觉检索器。 Result: Gemma-3-27B-IT上，文本与多模态增强优于OCR-free视觉检索（准确率0.722–0.740 vs. 0.510）；GPT-4o上三者接近（多模态0.808，文本0.782，ColPali 0.745）；GPT-5系列中ColPali/ColFlor提升至0.828，彼此无显著差异，但GPT-5-nano落后8–10%。 Conclusion: 检索策略选择依赖于模型容量：中小模型更适合将视觉内容转为文本以降低理解负担，而前沿大模型下OCR-free视觉检索更具竞争力；ColFlor在性能相当情况下具有更小开销，是强生成器下的高效默认选项。 Abstract: Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

William English,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz

Main category: cs.CL

TL;DR: 本文提出了一种名为Grammar Forced Translation (GraFT)的自然语言到时序逻辑翻译框架，通过限制每步输出的有效词元集来降低任务复杂度，显著提升了端到端和跨领域翻译准确率。

Details

Motivation: 现有方法在原子命题提取、共指消解和小样本学习方面存在困难，难以实现准确的自然语言到形式语言的转换。 Method: 提出GraFT框架，利用问题特性限制语言模型每步生成的词元范围，将 lifting 和翻译两个步骤的解空间缩小，从而简化任务并提升学习效率。 Result: 在CW、GLTL和Navi基准上评估显示，GraFT相比现有方法平均提升端到端翻译准确率5.49%，跨领域翻译准确率提升14.06%。 Conclusion: 通过约束输出语法空间，GraFT有效提高了NL到TL翻译的性能和泛化能力，为数据受限场景下的形式化语言生成提供了高效解决方案。 Abstract: Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

Aditya Yadavalli,Tiago Pimentel,Tamar I Regev,Ethan Wilcox,Alex Warstadt

Main category: cs.CL

TL;DR: 提出一种信息论方法，利用大模型量化语音韵律（如情感、讽刺）相对于文本额外传递的信息量，发现韵律在缺乏长期语境时比文本多传递一个数量级以上的情感和讽刺信息。

Details

Motivation: 韵律承载了文本无法表达的重要语义信息（如情感、讽刺），但缺乏量化其信息贡献的方法，因此需要一种能衡量不同沟通渠道（音频 vs 文本）所传递特定语义维度信息量的框架。 Method: 使用大型语音和语言模型估计话语意义的特定维度（如情感、讽刺、疑问）与其传播通道（音频或文本）之间的互信息，从而量化各通道的信息贡献。 Result: 在缺乏长距离上下文的情况下，音频通道（即韵律）关于讽刺和情感的信息传输量比文本通道高出一个数量级以上；而对于疑问特征，韵律提供的额外信息较少。 Conclusion: 韵律在传达情感和讽刺等语义方面起主导作用，尤其当上下文有限时；本文提出的方法可推广至更多语义维度、通信渠道和语言的研究。 Abstract: Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

Main category: cs.CL

TL;DR: 本文提出了LLMCcache，一种基于语义相似性重用中间激活的逐层缓存框架，可在BERT和GPT-2等模型上实现最高3.1倍的推理加速，且精度损失小于0.5%。

Details

Motivation: Transformer模型在多种任务中表现优异，但其高推理延迟限制了在实时和大规模部署中的应用，现有缓存机制作用范围有限。 Method: 提出LLMCcache，一种模型无关、支持编码器和解码器架构、可在任意Transformer层进行缓存的逐层缓存框架；引入轻量级指纹机制匹配语义相似输入，并设计自适应驱逐策略应对缓存过期。 Result: 在SQuAD、WikiText-103和OpenBookQA数据集上实验表明，LLMCcache可实现最高3.1倍的推理速度提升，精度损失控制在0.5%以内。 Conclusion: LLMCcache是一种实用且通用的Transformer推理优化方案，具有广泛适用性和良好性能增益。 Abstract: Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Tzu-Han Lin,Wei-Lin Chen,Chen-An Li,Hung-yi Lee,Yun-Nung Chen,Yu Meng

Main category: cs.CL

TL;DR: 本文提出AdaSearch，一种两阶段、结果驱动的强化学习框架，旨在提升大语言模型在使用搜索工具时对自身知识边界的认识，减少不必要的搜索调用，同时保持任务性能，并增强决策的可解释性。

Details

Motivation: 现有基于强化学习的搜索代理容易过度依赖搜索或忽视已有参数知识，导致成本增加、噪声暴露或产生幻觉。现有方法通过奖励塑形减少搜索调用，但存在奖励工程复杂、信用分配模糊和易被策略性规避的问题。需要一种更清晰、可解释的方法来平衡内部知识与外部搜索。 Method: 提出AdaSearch，一个两阶段的强化学习框架：第一阶段专注于解决问题并明确区分是否需要搜索；第二阶段基于最终任务结果进行训练，将搜索决策过程显式建模。通过F1-based决策指标评估代理对自身知识的意识，并以任务结果而非调用次数作为优化目标。 Result: 实验表明，AdaSearch显著提升了模型对知识边界的识别能力，减少了不必要的搜索调用，同时保持了强大的任务性能。该方法在多个模型族和规模上均表现优越，并提供了更透明和可解释的决策过程。 Conclusion: AdaSearch通过解耦问题求解与搜索决策，实现了更智能、自适应的搜索调用机制，为高风险领域（如金融和医疗）中的可信AI代理提供了重要进展。 Abstract: Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad

Main category: cs.CL

TL;DR: 本文提出了Multimodal RewardBench 2 (MMRB2)，这是首个针对处理图像与文本交错序列的通用模型奖励模型的综合基准，涵盖四个任务并提供高质量的人类专家标注偏好对，用于评估和改进多模态奖励模型。

Details

Motivation: 现有的奖励模型主要针对纯文本大语言模型，而在多模态（尤其是图像与文本交错）场景下的研究尚不充分，缺乏标准化、具有挑战性的评估基准。 Method: 构建了一个包含1,000个专家标注偏好对/任务的多模态奖励模型基准MMRB2，覆盖文本生成图像、图像编辑、交错生成和多模态推理任务；采用集成过滤策略确保标注一致性，并评估了包括闭源和开源在内的23种模型作为裁判的表现。 Result: Gemini 3 Pro达到75-80%准确率，GPT-5和Gemini 2.5 Pro为66-75%，优于GPT-4o的59%；开源模型Qwen3-VL-32B达到64%，接近Gemini 2.5 Flash；人类表现超过90%；且MMRB2评分与下游任务性能高度相关。 Conclusion: MMRB2是一个有效且具有挑战性的多模态奖励模型基准，揭示了当前模型仍有显著提升空间，尤其在逼近人类判断水平方面，并为未来多模态奖励建模提供了重要方向。 Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

[30] In-Context Algebra

Eric Todd,Jannik Brinkmann,Rohit Gandikota,David Bau

Main category: cs.CL

TL;DR: 本文研究了在符号意义不固定的情况下，transformer模型如何通过上下文进行符号推理以解决算术问题，发现了三种机制：交换复制、单位元识别和基于闭包的消去。

Details

Motivation: 探索在符号含义随序列变化的条件下，transformer是否仍能发展出有效的推理机制。 Method: 设计了一个新任务，其中符号到代数群元素的映射在不同序列间变化，并使用针对性的数据分布进行因果机制测试。 Result: 模型在该任务上达到接近完美的准确率，并能泛化到未见的代数群；发现了三种一致学习到的机制：专用头进行答案复制、识别包含单位元的事实、利用群闭包性质进行约束。 Conclusion: 当训练模型在意义不固定的变量上下文中推理时，模型会发展出明确的符号推理机制，补充了以往基于固定符号几何表示的研究发现。 Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Nikhil Prakash,Donghao Ren,Dominik Moritz,Yannick Assogba

Main category: cs.CL

TL;DR: 提出了一种名为Constructive Circuit Amplification的新方法，通过识别推理过程中的关键token和相关模型组件，仅更新这些稀疏子网络来提升特定任务性能，在数学推理中准确率最高提升11.4%，且仅修改1.59%的模型参数，对其他能力影响极小。

Details

Motivation: 基于先前研究发现大模型中存在负责特定任务的稀疏子网络（电路），并且微调主要通过增强已有电路来提升性能，因此探索是否可以直接干预这些电路以实现精准的能力提升。 Method: 提出Constructive Circuit Amplification方法，从模型推理轨迹中识别关键token和负责目标任务的模型组件，并仅对这些组件进行更新。 Result: 在多个模型上应用于数学推理任务时，准确率最高提升+11.4%，仅修改1.59%的模型参数，同时在MMLU、TriviaQA和TruthfulQA等基准上对其他能力影响极小。 Conclusion: 通过选择性地更新稀疏的模型组件，可以可靠地增强目标能力，验证了电路级干预作为模型编辑手段的有效性和精确性。 Abstract: Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

Yan Yang,George Bebis,Mircea Nicolescu

Main category: cs.CV

TL;DR: 提出了一种两步生成式数据增强框架，结合基于规则的掩码扭曲和使用GAN的无配对图像到图像转换，以生成更真实的戴口罩人脸样本。

Details

Motivation: 解决戴口罩人脸识别中数据稀缺和分布偏移的问题。 Method: 结合基于规则的掩码扭曲与使用GAN的无配对图像到图像翻译，并引入非掩码保留损失和随机噪声注入。 Result: 相比仅使用基于规则的扭曲方法，该方法在定性上有持续改进，并提升了样本多样性。 Conclusion: 所提方法有效增强了戴口罩人脸识别的数据增强效果，为未来数据中心化增强方法提供了方向。 Abstract: Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出了JARVIS，一种受I-JEPA启发的自监督框架，用于增强多模态大语言模型（MLLMs）的视觉理解能力，通过引入视觉基础模型作为上下文和目标编码器，减少对文本监督的依赖，并提升在视觉中心任务上的表现。

Details

Motivation: 现有的MLLMs主要依赖文本描述进行视觉学习，导致监督信号主观且不完整，同时由于多模态指令调优规模较小，模型容易过拟合语言先验而忽略视觉细节。 Method: 将I-JEPA学习范式融入MLLM的标准视觉-语言对齐训练流程中，使用冻结的视觉基础模型作为上下文和目标编码器，训练基于LLM早期层的预测器来从图像中学习结构和语义规律，减少对语言监督的依赖。 Result: 在多个标准MLLM基准测试中，JARVIS在不同LLM家族上均一致提升了视觉中心任务的性能，同时未损害多模态推理能力。 Conclusion: JARVIS通过引入自监督的视觉增强机制，有效缓解了MLLMs对语言先验的依赖，增强了其视觉理解能力，为构建更鲁棒的多模态模型提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Dwip Dalal,Utkarsh Mishra,Narendra Ahuja,Nebojsa Jojic

Main category: cs.CV

TL;DR: 本文提出了一个名为CityNav的新基准，用于评估多模态大语言模型在稀疏标注的真实城市环境中的视觉导航能力，并引入了VoP方法以提升模型的推理和导航性能。

Details

Motivation: 现有的具身智能体评估基准大多以语言为中心或依赖模拟环境，难以反映真实世界中复杂、知识密集型任务所需的推理能力。因此，需要一种能够评估MLLM在真实环境中顺序决策能力的新任务和基准。 Method: 提出“稀疏定位视觉导航”任务，并构建CityNav基准，包含四个全球城市的50多个决策点，仅依靠视觉输入和内部多模态推理进行导航；引入VoP（路径言语化）方法，通过从MLLM中提取显式的认知地图（关键地标和方向）来增强推理过程。 Result: 实验表明，当前最先进的MLLM及标准推理技术（如思维链、反思）在此任务上表现不佳；而VoP显著提升了导航成功率。 Conclusion: CityNav为评估MLLM在真实世界知识密集型任务中的能力提供了新标准，VoP展示了显式认知地图在提升具身智能体推理能力方面的有效性。 Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: 提出R4，一种无需训练的4D时空检索增强推理框架，通过在度量空间和时间中锚定对象级语义描述，为视觉语言模型赋予结构化终身记忆。

Details

Motivation: 受人类在四维空间中构建持久、结构化内部表征能力的启发，旨在提升视觉语言模型对时空信息的感知与推理能力。 Method: 持续构建4D知识数据库，将自然语言查询分解为语义、空间和时间键进行检索，并直接在4D空间中执行检索以实现无需训练的推理。 Result: 在具身问答和导航基准上显著优于基线方法，实现了更优的时空信息检索与推理性能。 Conclusion: R4为动态环境中的具身4D推理提供了一种新范式，支持跨代理共享持久世界模型，并实现无需训练的 episodic 与协作式推理。 Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar,Fenil Bardoliya,Pavan K. Turaga,Chitta Baral,Vivek Gupta

Main category: cs.CV

TL;DR: 本文提出了一个名为“感知观测站”的新框架，用于系统评估多模态大语言模型（MLLMs）的视觉感知能力，超越传统的任务准确率指标，通过受控扰动和真实数据集分析模型在视觉基础、关系结构和鲁棒性方面的表现。

Details

Motivation: 现有评估方法过于依赖最终任务准确率，忽视了MLLMs在视觉感知上的真实基础是否牢固，尤其是在视觉编码器基本不变而语言组件不断扩大的背景下，难以判断性能提升是源于真正的视觉理解还是依赖文本知识。 Method: 提出“感知观测站”框架，包含多个垂直测试维度，如人脸匹配、文本识别、局部到全局理解等，并使用带有真实标签的数据集，结合像素级增强和基于扩散模型的风格化错觉进行系统性扰动测试。 Result: 该框架揭示了MLLMs在不同扰动下的感知保持能力和推理局限性，能够区分模型对视觉结构的理解程度及其归因准确性。 Conclusion: 感知观测站为评估MLLMs的视觉感知能力提供了更精细、可解释的基准，有助于未来模型在真正视觉接地方面的发展。 Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal,Yuchen Liu,Luigi Palmieri,Ilche Georgievski,Marco Aiello

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型的上下文感知多人体行为预测框架CAMP-VLM，利用合成数据进行微调，在第三方视角下显著提升了多人行为预测的准确性。

Details

Motivation: 现有研究主要集中在单人场景和第一人称视角下的行为预测，而实际机器人应用需要从第三人称视角理解多人行为，缺乏合适的数据集和方法支持。 Method: 提出CAMP-VLM框架，结合视觉输入中的上下文特征和场景图中的空间感知，使用基于视觉语言模型的方法进行多人体行为预测，并通过光真实感模拟器生成的合成数据进行监督微调和直接偏好优化。 Result: 在合成和真实世界序列上评估显示，CAMP-VLM比最佳基线模型的预测准确率最高提升66.9%。 Conclusion: CAMP-VLM有效提升了第三人称视角下的多人体行为预测性能，展示了合成数据与VLM结合在该任务中的巨大潜力。 Abstract: Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 本研究探索了视觉-语言模型（VLM）在少样本多光谱目标检测中的潜力，通过改进Grounding DINO和YOLO-World以融合文本、可见光与热成像模态，在数据稀缺情况下显著优于现有方法，并在全监督设置下表现优越。

Details

Motivation: 由于标注多光谱数据稀少，深度检测器训练受限，而文本语义信息可作为补充监督信号，因此探索利用VLM提升多光谱检测的数据效率。 Method: 将Grounding DINO和YOLO-World适配至多光谱输入，提出有效的跨模态融合机制，整合文本、视觉与热成像信息。 Result: 在FLIR和M3FD数据集上，VLM-based检测器在少样本和全监督设置下均优于或媲美专用多光谱模型。 Conclusion: 大规模VLM学习到的语义先验可有效迁移到未见的光谱模态，为数据高效的多光谱感知提供了新路径。 Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario,Mason J. Earles

Main category: cs.CV

TL;DR: 本文评估了多种视觉-语言模型（VLMs）在农业分类任务中的零样本性能，发现其表现远低于专用监督模型（如YOLO11），尤其是在开放生成式提示下表现较差；受限提示和语义评判可提升性能，但当前现成的VLM尚不适合作为独立的农业诊断系统。

Details

Motivation: 探讨现成的视觉-语言模型（VLMs）在农业决策支持中的可靠性，填补其在农业领域应用性能认知的空白。 Method: 在AgML集合的27个农业分类数据集上，对多种开源和闭源VLM进行基准测试，涵盖植物病害、虫害与损伤、植物与杂草种类识别等任务；采用零样本设置，比较多选提示与开放提示下的性能，并使用LLM-based语义评判提升开放答案的评估准确性。 Result: 所有VLM在零样本下均显著落后于监督模型YOLO11；最佳VLM（Gemini-3 Pro）在多选提示下平均准确率约62%，开放提示下通常低于25%；引入语义评判后开放准确率有所提升（如从21%升至30%）；Qwen-VL-72B在开源模型中表现最佳；植物与杂草分类较易，虫害与损伤识别最难。 Conclusion: 当前现成的VLM尚不足以作为独立的农业诊断工具，但在配合受限接口、明确标签体系和领域适配评估方法时，可作为辅助组件使用。 Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

Lars Beckers,Arno Waes,Aaron Van Campenhout,Toon Goedemé

Main category: cs.CV

TL;DR: 提出了一种通过视觉感知和自适应决策主动增强花园生物多样性的机器人割草框架，利用深度特征空间分析选择性保留植被斑块。

Details

Motivation: 传统割草方式不利于生物多样性，而被动再野化方法效果有限，因此需要一种能主动促进生态丰富的智能割草系统。 Method: 使用预训练的ResNet50网络提取植物图像的生态有意义嵌入，通过全局偏差度量估计无需物种级监督的生物多样性，并驱动选择性割草算法动态调整割草行为。 Result: 在模拟和真实花园数据集上验证了该系统，结果显示嵌入空间分散性与专家评估的生物多样性有强相关性，证明了方法的有效性。 Conclusion: 该框架可将单一草坪转变为富有生态价值的生境，大规模应用有望显著提升城市生物多样性。 Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.

Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Ziyuan Liu,Abhinav Valada

Main category: cs.CV

TL;DR: 提出一种生成视频-动作对的方法，通过扩展预训练视频扩散模型并引入桥接注意力机制，实现基于文本指令和初始观测的机器人策略学习，显著优于现有方法。

Details

Motivation: 现有方法在视频与动作模态间的耦合不足，且缺乏精确的动作标注，限制了视频数据在机器人策略学习中的应用。 Method: 扩展预训练视频扩散模型，加入专用动作扩散模型；设计桥接注意力机制实现跨模态交互；引入动作细化模块将粗略动作转化为精确控制。 Result: 在多个公开基准和真实世界数据集上验证了该方法生成的视频质量更高、动作更准确，显著优于现有基线方法。 Conclusion: 该方法实现了高质量视频与精确动作的联合生成，为利用大规模视频数据进行机器人学习提供了可扩展的框架。 Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

Jiaheng Geng,Jiatong Du,Xinyu Zhang,Ye Li,Panqu Wang,Yanjun Huang

Main category: cs.CV

TL;DR: 提出了一种用于端到端自动驾驶的闭环评估平台，通过生成真实场景中的对抗性交互来有效发现模型在安全关键corner case下的性能退化。

Details

Motivation: 现有对抗性评估方法多针对简化仿真环境中的模型，缺乏对基于真实世界数据训练的端到端自动驾驶系统的有效评估手段。 Method: 构建一个闭环评估平台，结合基于流匹配的真实图像生成器与高效的对抗性交通策略，实现对端到端自动驾驶模型在真实场景中的对抗性测试。 Result: 平台能够高效稳定地生成逼真的驾驶图像，并通过UniAD、VAD等模型验证了其在corner case下检测模型性能下降的能力。 Conclusion: 该平台能有效识别端到端自动驾驶模型的安全隐患，有助于提升系统在真实环境中的安全性与鲁棒性。 Abstract: Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model's potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

Hao Tang,Hanyu Liu,Alessandro Perelli,Xi Chen,Chao Li

Main category: cs.CV

TL;DR: 提出一种基于3D多通道patch扩散模型的方法，结合解剖先验和球谐注意力机制，从低角分辨率dMRI准确预测高角分辨率FOD。

Details

Motivation: 现有方法在从单壳低角分辨率dMRI估计FOD时精度不足，而多壳高角分辨率采集耗时长，限制了临床应用；扩散模型虽有潜力但难以高效处理大量球谐系数。 Method: 设计3D多通道patch扩散模型，引入脑解剖先验的FOD-patch适配器以提升学习效率，加入体素级条件协调模块增强全局理解，并采用球谐注意力模块捕捉SH系数间的复杂相关性。 Result: 实验结果表明该方法在HAR-FOD预测任务中性能优于现有最先进方法，显著提升预测精度与效率。 Conclusion: 所提方法能高效、准确地从低角分辨率dMRI生成高角分辨率FOD，具有良好的临床应用前景。 Abstract: Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.

[44] Auto-Vocabulary 3D Object Detection

Haomeng Zhang,Kuan-Chuan Peng,Suhas Lohit,Raymond A. Yeh

Main category: cs.CV

TL;DR: 本文提出了一种无需用户指定类别的自动词汇3D物体检测方法AV3DOD，利用2D视觉-语言模型生成类别名称，并引入语义分数（SS）评估生成质量，在ScanNetV2和SUNRGB-D上实现了定位和语义质量的最先进性能。

Details

Motivation: 现有开放词汇3D检测方法仍依赖用户指定类别，限制了其在真实场景中的自动化应用，因此需要完全自动化的类别生成机制。 Method: 提出AV3DOD框架，结合图像描述、伪3D框生成和特征空间语义扩展，利用2D视觉-语言模型生成候选类别，并引入语义分数（SS）评估生成类别质量。 Result: 在ScanNetV2和SUNRGB-D数据集上达到SOTA性能，相比CoDA提升3.48 mAP，语义分数（SS）相对提高24.5%。 Conclusion: AV3DOD实现了真正意义上的无监督开放词汇3D检测，在无需人工输入类别的情况下同时优化了检测性能与语义合理性。 Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

[45] LAPX: Lightweight Hourglass Network with Global Context

Haopeng Zhao,Marsha Mariya Kappan,Mahdi Bamdad,Francisco Cruz

Main category: cs.CV

TL;DR: 提出了一种名为LAPX的轻量级人体姿态估计模型，基于Hourglass网络和自注意力机制，在保持低参数量（2.3M）的同时实现了高精度和实时性，适用于边缘设备。

Details

Motivation: 现有SOTA姿态估计模型计算成本高，轻量化模型在边缘设备上部署不友好且精度不足，需兼顾效率与准确性。 Method: 基于LAP工作，引入自注意力机制以捕获全局上下文信息，改进阶段设计并优化轻量级注意力模块，构建Hourglass架构的LAPX模型。 Result: 在MPII和COCO两个基准数据集上取得具有竞争力的结果，仅使用2.3M参数，并展现出实时推理能力。 Conclusion: LAPX在保持高精度的同时显著降低计算成本，适合部署于边缘设备，平衡了模型大小、速度与性能。 Abstract: Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.

[46] Collimator-assisted high-precision calibration method for event cameras

Zibin Liu,Shunkun Liang,Banglei Guan,Dongcai Tan,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种利用带闪烁星图模式的准直仪进行事件相机几何标定的新方法，通过线性求解和非线性优化实现长距离高精度参数标定。

Details

Motivation: 事件相机在长距离测量场景下的几何标定仍具挑战，尤其是内在和外在参数的精确确定。 Method: 使用带有闪烁星图模式的准直仪，基于球面运动模型线性求解相机参数，再通过非线性优化精细调整。 Result: 在多种真实实验条件下，该方法在准确性和可靠性方面均优于现有标定方法。 Conclusion: 所提方法能有效提升事件相机在长距离、高精度测量中的标定性能。 Abstract: Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Jintao Zhang,Kaiwen Zheng,Kai Jiang,Haoxu Wang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: TurboDiffusion是一种加速视频生成的框架，通过注意力加速、步数蒸馏和W8A8量化等技术，实现端到端扩散生成速度提升100-200倍，同时保持视频质量。

Details

Motivation: 为了显著提高扩散模型在视频生成中的效率，解决传统方法计算量大、生成速度慢的问题。 Method: 采用低比特SageAttention和可训练稀疏线性注意力（SLA）加速注意力计算，使用rCM进行高效的步数蒸馏，并结合W8A8量化技术将模型参数和激活值量化为8位，同时引入多种工程优化。 Result: 在多个Wan系列模型上实验表明，TurboDiffusion在单张RTX 5090 GPU上即可实现100-200倍的生成速度提升，且视频质量保持不变。 Conclusion: TurboDiffusion通过多种关键技术组合，极大提升了扩散模型的视频生成效率，具备实际应用价值。 Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

[48] Flexible Camera Calibration using a Collimator System

Shunkun Liang,Banglei Guan,Zhenbao Yu,Dongcai Tan,Pengju Sun,Zibin Liu,Qifeng Yu,Yang Shang

Main category: cs.CV

TL;DR: 本文提出了一种基于设计的准直器系统的新相机标定方法，利用角度不变性约束和球面运动模型，将目标与相机之间的相对运动从6自由度减少到3自由度的纯旋转运动，并提出了多图像的闭式线性求解器、两图像的最小求解器以及单准直器图像标定算法。该方法在合成和真实实验中验证了其可行性和优越性。

Details

Motivation: 传统的相机标定方法在控制环境和灵活性方面存在局限，需要更可靠且可控的标定环境以及减少对相机运动的依赖。 Method: 利用设计的准直器系统的独特光学几何特性，引入角度不变性约束，证明相对运动符合球面运动模型，从而降低自由度；在此基础上提出闭式线性求解器、最小求解器及单图像标定算法。 Result: 在合成和真实实验中验证了所提方法的可行性，并显示其性能优于现有基线方法。 Conclusion: 所提出的基于准直器系统的相机标定方法通过引入角度不变性约束和球面运动模型，实现了更灵活、快速且高精度的标定，为相机标定提供了新思路。 Abstract: Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa,Yang Yang,Risa Shinoda,Hiroaki Santo,Kenji Oyama,Fumio Okura,Takenao Ohkawa

Main category: cs.CV

TL;DR: 本文提出了一种名为CattleAct的方法，用于从单张图像中自动检测放牧牛之间的行为交互，适用于智能畜牧业管理，如发情检测。该方法通过将交互分解为个体牛的行为组合，利用大规模动作数据集学习动作潜在空间，并通过对比学习微调以嵌入稀有交互，构建统一的动作-交互潜在空间。

Details

Motivation: 由于缺乏包含交互行为的综合性牛行为数据集，且放牧牛的交互属于罕见事件，导致牛的行为交互检测面临挑战，而现有的人类交互检测方法难以直接应用。 Method: 首先从大规模牛动作数据集中学习动作潜在空间，然后使用对比学习对预训练的潜在空间进行微调，以嵌入稀有交互，从而构建统一的动作与交互潜在空间，并结合视频和GPS输入开发实用系统。 Result: 在商业规模牧场上的实验表明，所提方法在交互检测精度上优于基线方法。 Conclusion: CattleAct是一种数据高效的方法，能够有效检测放牧牛之间的行为交互，具备实际应用潜力，并已公开实现代码。 Abstract: This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

Ze Yuan,Wenbin Li,Shusen Zhao

Main category: cs.CV

TL;DR: 提出了一种结合迭代方法与深度学习的双能谱CT重建框架，包含基于OPMT的知识驱动模块和基于ResDynUNet++的数据驱动模块，有效提升重建质量。

Details

Motivation: 为了解决双能谱CT重建中通道不平衡和界面附近大伪影等问题，同时实现快速且准确的基材料分解。 Method: 采用OPMT技术从投影数据重建中间基材料图像，随后使用改进的ResDynUNet++网络（基于UNet++并引入残差动态卷积）对结果进行精细化处理。 Result: 在合成体模和真实临床数据上实验表明，该方法能有效抑制伪影、平衡通道响应，并显著提升重建图像质量。 Conclusion: 所提出的混合重建框架在双能谱CT中表现出优异性能，兼具快速收敛与高精度优势，具有良好的应用潜力。 Abstract: We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

Yueyang Hu,Haiyong Jiang,Haoxuan Song,Jun Xiao,Hao Pan

Main category: cs.CV

TL;DR: 提出了一种基于SAM分割图的传播方法SegGraph，用于少样本3D部件分割，通过构建段图并利用图神经网络传播2D基础模型特征，有效融合几何结构与语义信息，显著提升分割性能。

Details

Motivation: 现有方法在将2D基础模型知识迁移到3D分割时，忽略了3D几何结构或未充分利用SAM提供的高质量分组线索，导致欠分割和标签不一致问题。 Method: 提出SegGraph方法，构建段图（节点为片段，边表示空间关系），通过图神经网络传播由2D基础模型提取的特征，并采用视图方向加权融合策略保持片段内语义一致性。 Result: 在PartNet-E上实验表明，该方法比现有最优方法至少提升6.9% mIoU，在小部件和部件边界处表现尤为突出。 Conclusion: SegGraph能有效结合SAM的分割线索与几何结构信息，增强了对3D物体局部细节的理解能力，显著提升了少样本3D部件分割性能。 Abstract: This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM's segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: https://github.com/YueyangHu2000/SegGraph.

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Chao Li,Dasha Hu,Chengyang Li,Yuming Jiang,Yuncheng Shen

Main category: cs.CV

TL;DR: 本文提出了一种用于无监督域适应的新型提示调优方法C-DGPA，通过协同优化边缘分布和条件分布对齐，显著提升了视觉-语言模型在跨域任务中的性能，并在多个基准上实现了最先进的结果。

Details

Motivation: 现有提示调优方法主要关注边缘分布对齐，忽略了条件分布差异，导致类别原型错位和语义判别能力下降。为此，本文旨在解决条件分布不对齐的问题，提升模型在跨域场景下的泛化能力。 Method: 提出C-DGPA（Class-Centric Dual Alignment Generative Prompt Adaptation），采用双分支架构：一支通过动态对抗训练实现边缘分布对齐；另一支引入类别映射机制（CMM）进行条件分布对齐，标准化语义提示理解并防止对源域过度依赖。 Result: 在OfficeHome、Office31和VisDA-2017数据集上进行了广泛实验，C-DGPA在所有基准上均取得了新的最先进性能。 Conclusion: C-DGPA通过联合优化边缘和条件分布对齐，有效融合领域知识到提示学习中，生成域不变且具有强语义判别性的表示，显著提升了VLMs在无监督域适应任务中的表现。 Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.

[53] Towards Closing the Domain Gap with Event Cameras

M. Oltan Sevinc,Liao Wu,Francisco Cruz

Main category: cs.CV

TL;DR: 本文探讨了使用事件相机作为传统相机的替代方案，以解决昼夜光照差异导致的域差距问题。实验结果表明，事件相机在不同光照条件下表现更稳定，且在跨域场景中具有优于灰度帧的基准性能。

Details

Motivation: 传统相机在训练数据与部署环境不匹配时性能显著下降，特别是在昼夜光照变化的情况下存在明显的域差距问题。 Method: 提出使用事件相机代替传统相机，并通过实验评估其在不同光照条件下的性能表现，比较其与灰度帧在跨域场景中的表现差异。 Result: 事件相机在不同光照条件下保持了更一致的性能，域偏移惩罚通常与灰度帧相当或更小，在跨域场景中表现出更高的基准性能。 Conclusion: 事件相机是一种有潜力的替代方案，能够有效缓解由昼夜光照变化引起的域差距问题，提升自动驾驶系统在多变环境下的鲁棒性。 Abstract: Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright,Zhibo Wang,Dmytro Klepachevskyi,Yuhao Chen,Sirisha Rambhatla,David Clausi,John Zelek

Main category: cs.CV

TL;DR: Avatar4D是一个可迁移的生成定制化合成人体运动数据集的管道，适用于特定领域应用，特别是在体育场景中验证其有效性，并通过Syn2Sport数据集展示其在姿态估计、零样本迁移和跨域泛化中的潜力。

Details

Motivation: 现有方法主要关注通用日常动作，缺乏对特定领域（如体育）中复杂人体运动的细粒度控制和灵活性，且依赖手动标注。 Method: 提出Avatar4D，一种无需人工标注即可实现对身体姿态、外观、摄像机视角和环境上下文进行精细控制的4D人体运动生成 pipeline，并构建大规模合成数据集Syn2Sport用于体育动作分析。 Result: 在Syn2Sport上评估了多种先进姿态估计模型，展示了其在监督学习、零样本迁移到真实数据及跨体育项目泛化中的有效性，并验证了合成数据与真实数据在特征空间中的对齐程度。 Conclusion: Avatar4D能够生成可扩展、可控且可迁移的领域特定人体运动数据集，无需依赖真实世界数据，为特定任务提供强大支持。 Abstract: We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose,Ravi K. Rajendran,Biplob Debnath,Konstantinos Karydis,Amit K. Roy-Chowdhury,Srimat Chakradhar

Main category: cs.CV

TL;DR: 本文提出了一种名为VALOR的新方法，通过强化学习和两阶段对齐框架来提升医学视觉-语言模型在放射学报告生成中的视觉对齐性和临床准确性。

Details

Motivation: 现有的放射学报告生成方法存在跨模态对齐不足的问题，容易产生幻觉，难以保证报告的临床准确性和视觉依据性。 Method: 提出VALOR框架，采用基于强化学习的后对齐方法，分两个阶段：首先利用文本奖励提升术语准确性，再对齐视觉投影模块与疾病发现，增强对关键图像区域的关注。 Result: 在多个基准测试上实验表明，VALOR显著提升了事实准确性和视觉接地能力，优于当前最先进的方法。 Conclusion: VALOR有效改善了医学视觉-语言模型在放射学报告生成中的跨模态对齐问题，实现了更准确且视觉可解释的报告生成。 Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang,Sangwoo Mo,Stella X. Yu,Sima Behpour,Liu Ren

Main category: cs.CV

TL;DR: 本文提出了一种名为OAK的新模型，用于开放性即席视觉场景分类，通过结合CLIP和GCD的目标，在少量标注样本和大量无标签数据下实现高效、可解释的类别扩展。

Details

Motivation: 即席类别是为特定目标动态创建的，不同于固定类别，现有方法难以有效发现其潜在上下文并进行语义扩展，因此需要一种能自适应、可泛化的分类方法。 Method: OAK在冻结的CLIP输入端引入少量可学习的上下文令牌，并联合优化CLIP的图文对齐目标和GCD的视觉聚类目标，实现基于语义扩展和视觉聚类的即席类别发现与扩展。 Result: 在Stanford和Clevr-4数据集上，OAK在多个分类任务中达到最先进性能，例如在Stanford Mood上达到87.4%的新类别准确率，超过CLIP和GCD逾50%，并生成可解释的显著性图。 Conclusion: OAK能够有效支持开放即席分类，兼具高性能、可解释性和泛化能力，为AI代理在动态任务中实现自适应视觉理解提供了可行方案。 Abstract: Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

[57] Enhanced 3D Shape Analysis via Information Geometry

Amit Vishwakarma,K. S. Subrahamanian Moosath

Main category: cs.CV

TL;DR: 本文提出了一种基于信息几何的3D点云形状分析框架，通过将点云表示为高斯混合模型（GMM），并定义具有理论上下界的修正对称KL散度（MSKL），实现了稳定且能反映几何变化的点云比较。

Details

Motivation: 传统几何度量（如Hausdorff和Chamfer距离）难以捕捉全局统计结构且对异常值敏感，现有KL散度近似方法存在无界或数值不稳定问题。 Method: 将点云建模为统计流形上的高斯混合模型（GMM），证明GMM空间构成统计流形，并提出具有理论上下界的修正对称KL散度（MSKL）。 Result: 在MPI-FAUST和G-PCD数据集上的实验表明，MSKL具有数值稳定性，其值单调反映几何变化，优于传统距离和现有KL近似方法。 Conclusion: MSKL为点云比较提供了一种鲁棒、稳定的度量方式，适用于需要精确形状分析的应用领域。 Abstract: Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Zhihao Zhang,Xuejun Yang,Weihua Liu,Mouquan Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于编码器-解码器网络（EDN）的学习框架，用于单视图新视角合成中的高质量初始噪声生成，通过离散化欧拉反演方法构建噪声配对数据集，显著提升了扩散模型的生成质量。

Details

Motivation: 在扩散模型中，高质量的初始噪声能提升生成效果，但缺乏专门学习此类噪声的框架。 Method: 设计离散化欧拉反演方法注入语义信息以构建随机与高质量噪声的配对数据集，并提出基于编码器-解码器网络（EDN）的框架将随机噪声转换为高质量噪声。 Result: EDN可无缝集成到SV3D和MV-Adapter等NVS模型中，在多个数据集上显著提升性能。 Conclusion: 所提方法有效生成高质量初始噪声，增强了单视图新视角合成的图像生成质量与模型通用性。 Abstract: Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

[59] Image Compression Using Singular Value Decomposition

Justin Jiang

Main category: cs.CV

TL;DR: 本研究探讨了使用奇异值分解（SVD）和低秩矩阵近似进行图像压缩的效果，发现尽管重构图像在视觉上与原图相似，但压缩效率远低于JPEG、JPEG2000和WEBP等标准格式，尤其在低误差要求下甚至比原始图像更大，因此不适用于实际应用。

Details

Motivation: 由于图像占互联网数据的很大一部分，高效的图像压缩对降低存储和带宽需求至关重要。该研究旨在评估SVD和低秩近似在图像压缩中的潜力。 Method: 采用奇异值分解（SVD）对灰度和多通道图像进行低秩矩阵逼近，并使用相对Frobenius范数误差和压缩比来评估性能。 Result: 低秩近似能在视觉上生成与原图相似的图像，但在相同误差水平下，其压缩效率始终低于JPEG、JPEG2000和WEBP；在低误差容忍度下，SVD压缩后的数据大小甚至超过原始图像。 Conclusion: SVD-based图像压缩方法在压缩效率上无法与现有工业标准格式竞争，不适合实际应用。 Abstract: Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng,Zeeshan Hayder,Wei Liu,Hesheng Wang,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了一种名为ARMFlow的新型自回归框架，用于3D人类反应动作生成，解决了高运动保真度、实时推理和在线场景适应性三大挑战。

Details

Motivation: 现有方法无法同时满足3D人类反应生成中的高运动保真度、实时推理和自回归适应性三个关键需求。 Method: 提出ARMFlow，基于MeanFlow的自回归框架，包含因果上下文编码器和MLP速度预测器，并引入Bootstrap Contextual Encoding（BSCE）训练策略；还提出了离线版本ReMFlow。 Result: ARMFlow在InterHuman和InterX数据集上的FID指标超过现有在线方法40%以上，单步推理速度快，且误差累积少，性能媲美最先进的离线方法。 Conclusion: ARMFlow通过全局上下文编码、单步高效推理和BSCE有效缓解了在线3D反应动作生成中的关键限制，实现了高性能与实时性的平衡。 Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

Satya Narayana Panda,Vaishnavi Kukkala,Spandana Iyer

Main category: cs.CV

TL;DR: 本研究开发了一种结合临床影像与家族病史的多模态AI框架，用于提升皮肤疾病诊断准确性，尤其在遗传性皮肤病中表现出潜力，并强调可解释性与临床整合。

Details

Motivation: 由于皮肤科专家稀缺且临床表现复杂，皮肤病诊断具有挑战性；而家族病史虽对疾病风险和治疗反应有重要影响，却常被忽视，因此需要一种能整合家族史与影像数据的AI辅助诊断系统。 Method: 提出一种多模态AI框架，结合基于深度学习的图像分析与结构化临床数据（包括家族病史），采用可解释的卷积神经网络与融合遗传风险因素的临床决策树，并设计前瞻性临床试验验证其在多样化医疗环境中的有效性。 Result: 集成家族病史后，AI系统在黑色素瘤、银屑病和特应性皮炎等遗传性皮肤病中的诊断准确率提高；专家反馈显示其有助于早期检测与个性化建议，系统具备可解释性并支持临床工作流集成。 Conclusion: 该多模态AI框架能有效整合家族病史与影像数据，提升皮肤病诊断性能，具备临床应用潜力，未来将通过正式临床试验进一步验证其在真实世界中的效果。 Abstract: Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

Qi Zhang,Yunfei Gong,Zhidan Xie,Zhizi Wang,Antoni B. Chan,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了两种半监督的多视角人群计数框架，通过基于模型预测或模型不确定性的多视角融合模型排序来解决标注数据有限的问题。

Details

Motivation: 由于收集和标注多视角图像困难，现有的多视角人群计数数据集场景和帧数有限，因此需要减少对大量标注数据的依赖。 Method: 第一种方法通过约束较少视角的预测值不超过更多视角的预测值来进行模型排序；第二种方法利用模型不确定性进行排序，要求更多视角的不确定性不高于更少视角的不确定性，并以半监督方式引入训练过程。 Result: 实验表明，所提出的方法在多视角人群计数任务中优于其他半监督方法，尤其在标签数据有限的情况下表现突出。 Conclusion: 基于预测和不确定性的模型排序策略能有效提升半监督多视角人群计数性能，为缓解标注数据稀缺问题提供了新思路。 Abstract: Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

Paloma Casteleiro Costa,Parnian Ghapandar Kashani,Xuhui Liu,Alexander Chen,Ary Portes,Julien Bec,Laura Marcu,Aydogan Ozcan

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的多通道像素超分辨率框架FLIM_PSR_k，可在5倍像素尺寸下重建高分辨率荧光寿命成像（FLIM）图像，显著提升空间分辨率和信噪比，推动其在临床诊断中的应用。

Details

Motivation: FLIM在临床应用中受限于像素驻留时间长和信噪比低，导致分辨率与成像速度之间的权衡更严格。 Method: 采用条件生成对抗网络（cGAN）框架训练FLIM_PSR_k模型，实现从大像素尺寸输入中恢复高分辨率FLIM图像，支持多通道输入并具备快速推理能力。 Result: 在患者来源的肿瘤组织样本上盲测显示，FLIM_PSR_k实现了5倍超分辨率（k=5），输出图像的空间带宽积提高25倍，显著恢复了低分辨率输入中丢失的精细结构，并在多种图像质量指标上表现出统计学显著提升。 Conclusion: FLIM_PSR_k通过提升FLIM的有效空间分辨率，推动荧光寿命成像向更快、更高分辨率及兼容低数值孔径和微型化平台的方向发展，增强了其在转化医学中的潜力。 Abstract: Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Rui Gui,Yang Wan,Haochen Han,Dongxing Mao,Fangming Liu,Min Li,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出了TextEditBench，一个专注于图像中文本编辑的综合评估基准，强调语义、几何和上下文一致性，并引入“语义期望”（SE）新指标来衡量模型在文本编辑中的推理能力。

Details

Motivation: 文本编辑在图像生成中仍是一个未被充分探索的领域，现有模型难以在保持语义、几何和上下文一致的同时生成可读文本。 Method: 提出TextEditBench评估基准，聚焦文本区域，设计需推理的编辑场景，并引入语义期望（SE）指标评估模型在语义一致性、上下文连贯性和跨模态对齐方面的能力。 Result: 实验表明，当前最先进的模型虽能执行简单文本指令，但在上下文依赖推理、物理一致性和布局感知整合方面仍表现不佳。 Conclusion: TextEditBench为文本引导的图像编辑与多模态推理提供了新的评估平台，推动该领域的进一步发展。 Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

[65] GFLAN: Generative Functional Layouts

Mohamed Abouagour,Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: 本文提出了GFLAN，一种通过将平面图生成分解为拓扑规划和几何实现两个阶段的生成框架，以解决现有深度学习方法在建筑推理上的不足。

Details

Motivation: 现有深度学习方法难以捕捉建筑推理，如拓扑关系优先于几何实例化、功能约束在网络中的传播以及局部连接决策产生的流线模式。 Method: 提出GFLAN框架：第一阶段使用具有双编码器的卷积网络生成房间中心点；第二阶段构建异构图并采用Transformer增强的图神经网络联合回归房间边界。 Result: 该方法在给定建筑外轮廓和门位置的情况下，能有效生成符合功能与几何约束的平面图，优于直接像素或墙线生成的方法。 Conclusion: 通过显式分解拓扑规划与几何实现，GFLAN提升了平面图生成中对建筑逻辑的建模能力，为自动化布局设计提供了新的范式。 Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

Amna Amir,Erchan Aptoula

Main category: cs.CV

TL;DR: 本文提出了一种用于多标签遥感图像检索的自适应对比学习方法MACL，通过标签感知采样、频率敏感加权和动态温度缩放，有效缓解语义不平衡问题，在多个基准数据集上优于现有对比学习方法。

Details

Motivation: 多标签遥感图像检索面临类别语义重叠、标签分布高度不平衡和复杂类间共现模式等挑战，现有方法难以平衡常见类与稀有类的表示学习。 Method: 提出Multi-Label Adaptive Contrastive Learning (MACL)，扩展对比学习框架，引入标签感知采样、频率敏感加权和动态温度调整机制，实现对常见和稀有类别更均衡的表征学习。 Result: 在DLRSD、ML-AID和WHDLD三个基准数据集上的实验表明，MACL持续优于基于对比损失的基线方法，显著提升检索性能。 Conclusion: MACL能有效缓解遥感图像中多标签语义不平衡问题，提供更可靠的检索结果，具备在大规模遥感数据存档中的应用潜力。 Abstract: Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang,Sizhe Cheng,Chenqi Yi

Main category: cs.CV

TL;DR: 提出PixelArena，利用语义分割任务以像素级精度评估多模态大模型的细粒度生成能力，发现Gemini 3 Pro Image在零样本设置下展现出高保真生成语义掩码的新能力。

Details

Motivation: 现有图像生成基准多关注美学，缺乏对细粒度生成能力的客观评估，因此需要新方法来精确衡量多模态模型的生成智能。 Method: 提出PixelArena，采用语义分割任务作为评估手段，通过像素级精度进行零样本测试，对多模态模型的生成能力进行定性和定量分析。 Result: Gemini 3 Pro Image在语义掩码生成任务中表现出高保真和强泛化能力，显著优于其他模型，同时揭示了其在新生成任务中的视觉智能与局限性。 Conclusion: 研究标志着多模态生成模型的重大进展，为未来在多模态、推理、可解释性和基准设计方面的研究提供了重要启示。 Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

Haiyu Zhao,Yiwen Shan,Yuanbiao Gou,Xi Peng

Main category: cs.CV

TL;DR: 提出了一种轻量级的全功能视频恢复网络LaverNet，仅含362K参数，通过选择性传播去降质特征，在时间建模中减轻降质影响，实现了优于现有大模型的性能。

Details

Motivation: 现有全功能视频恢复方法在处理时变降质时，降质信息会干扰时间建模，且依赖大模型掩盖了根本问题。 Method: 设计了轻量网络LaverNet，引入一种新的传播机制，仅在帧间传递与降质无关的特征，从而提升时间一致性并减少干扰。 Result: LaverNet参数不到现有模型的1%，在多个基准上达到相当甚至更优的性能。 Conclusion: 轻量化的网络结构结合特征选择性传播可有效解决时变降质下的视频恢复难题，证明小模型也能实现强大的全功能视频恢复。 Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

Huayu Huang,Chen Chen,Banglei Guan,Ze Tan,Yang Shang,Zhang Li,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于岭估计的融合定位方法，结合序列图像的丰富场景信息与激光测距的高精度优势，提升了在观测条件受限情况下的目标定位精度和鲁棒性。

Details

Motivation: 在长距离、小交角和大倾角等受限条件下，传统最小二乘估计因设计矩阵列向量严重多重共线性而导致病态问题，定位结果不稳定且鲁棒性差。 Method: 采用基于岭估计的融合定位方法，融合序列图像的视觉信息与激光测距数据，以抑制多重共线性带来的不良影响。 Result: 实验结果表明，所提方法相比基于单一信息的地基定位算法具有更高的定位精度，且在受限观测条件下显著提升了鲁棒性。 Conclusion: 岭估计有效缓解了观测条件受限时的多重共线性问题，所提出的融合定位方法在精度和稳定性方面均优于传统方法。 Abstract: Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

Nan Zhou,Zuxin Li,Fanhang Man,Xuecheng Chen,Susu Xu,Fan Dang,Chaopeng Hong,Yunhao Liu,Xiao-Ping Zhang,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为QUIDS的多智能体调度系统，通过质量感知激励机制，在非专用车载移动群智感知（NVMCS）系统中联合优化感知覆盖与可靠性，显著提升信息质量（QoI），并在真实城市数据上验证了其有效性。

Details

Motivation: 在非专用车载移动群智感知系统中，如何在预算有限和车辆动态参与的情况下，同时保证感知覆盖和可靠性，是提升信息质量（QoI）的主要挑战。 Method: 提出QUIDS系统，引入聚合感知质量（ASQ）指标来量化QoI，并设计基于信念感知的互惠车辆调度算法，在不确定性下估计感知可靠性并分配激励，以联合优化覆盖与可靠性。 Result: 实验表明，相比无调度场景，QUIDS将ASQ提升38%；相比现有最优方法提升10%；地图重建误差降低39%-74%。 Conclusion: QUIDS通过质量感知的激励机制，在无需专用基础设施的情况下，实现了低成本、高质量的城市感知，适用于交通与环境监测等智慧城市应用。 Abstract: This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

Soochang Song,Yongjune Kim

Main category: cs.CV

TL;DR: 提出一种边缘到服务器的视觉语言模型协同推理框架，通过选择性重传局部细节图像，在降低通信成本的同时保持推理精度。

Details

Motivation: 现有方法将原始图像缩放后传输到服务器进行推理，容易丢失细粒度细节，导致精度下降。 Method: 设计了一个两阶段框架：第一阶段服务器基于全局图像推理并利用VLM内部注意力确定感兴趣区域（RoI），并通过输出token的最小熵判断是否需要重传；若超过阈值，则请求边缘设备发送保留细节的局部图像，第二阶段结合全局和局部图像联合推理。 Result: 在多个VLM架构上的实验表明，该框架显著减少了通信开销，同时保持了较高的推理准确率。 Conclusion: 所提出的协同推理框架有效平衡了通信效率与模型性能，适用于资源受限的边缘部署场景。 Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

Tao Hu,Weiyu Zhou,Yanjie Tu,Peng Wu,Wei Dong,Qingsen Yan,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为GMODiff的增益图驱动单步扩散框架，用于多曝光高动态范围（HDR）重建，通过条件引导的增益图估计和回归先验来解决现有扩散模型在HDR任务中的动态范围受限、推理成本高和内容幻觉等问题，显著提升了效率与质量。

Details

Motivation: 现有的预训练潜在扩散模型（LDMs）在HDR重建中面临三大挑战：8位潜压缩导致的动态范围受限、多步去噪带来的高推理成本，以及生成模型固有的内容幻觉问题。因此需要一种更高效且保真度更高的方法。 Method: 将HDR重建重新定义为条件引导的增益图（Gain Map, GM）估计任务，使用回归模型生成的信息丰富估计初始化去噪过程，并利用回归先验指导LDM的去噪和潜解码过程，实现单步去噪生成高质量增益图。 Result: 实验表明，GMODiff在多个指标上优于现有的最先进方法，推理速度比之前的LDM-based方法快100倍，同时有效抑制了内容幻觉并保持了结构准确性。 Conclusion: GMODiff通过引入增益图估计与回归先验引导的单步扩散策略，成功克服了LDM在HDR重建中的关键瓶颈，实现了高效、高质量的多曝光HDR重建。 Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

Haotian Ling,Zequn Chen,Qiuying Chen,Donglin Di,Yongjia Ma,Hao Li,Chen Wei,Zhulin Tao,Xun Yang

Main category: cs.CV

TL;DR: 本文提出EverybodyDance，通过身份匹配图（IMG）和掩码查询注意力（MQA）解决多角色动画中的身份对应（IC）问题，显著提升了位置交换场景下的动画一致性与视觉质量。

Details

Motivation: 现有方法在单角色动画中表现良好，但在多角色且存在位置交换的场景中难以保持正确的身份对应关系，导致生成结果混乱。因此需要一种能确保参考帧与生成帧之间身份正确匹配的方法。 Method: 提出Identity Matching Graph（IMG），将参考帧和生成帧的角色建模为加权完全二分图的节点集，利用Mask-Query Attention（MQA）计算角色间的亲和度，并将IC正确性形式化为图结构指标进行优化；同时引入身份嵌入引导、多尺度匹配策略和预分类采样等训练策略。 Result: 在自建的身份对应评估基准上，实验表明EverybodyDance在IC准确性和视觉保真度方面均显著优于现有最先进方法。 Conclusion: 通过图建模与注意力机制联合优化身份对应关系，EverybodyDance有效解决了多角色动画中复杂交互下的身份一致性问题，推动了该领域的发展。 Abstract: Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan,Bastien Van Delft,Wuyang Li,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了Factorized Video Generation (FVG) 方法，将文本到视频生成分解为推理、构图和时序合成三个阶段，通过引入锚定帧提升生成视频的逻辑一致性与效率，在多个基准上达到最先进性能，并显著加速采样过程。

Details

Motivation: 现有文本到视频模型在复杂场景组合和逻辑时序理解上常出错，根源在于无法生成语义正确的初始帧。 Method: 将文本到视频生成解耦为三个阶段：1）推理阶段使用大语言模型重写提示以明确初始场景；2）构图阶段用文生图模型生成高质量锚定帧；3）时序合成阶段由微调后的视频模型基于锚定帧生成动态视频。 Result: FVG 在 T2V CompBench 上达到最先进水平，在 VBench2 上显著提升各模型表现，并能减少 70% 采样步数而不损失性能。 Conclusion: FVG 提供了一种更高效、鲁棒且可控的视频生成路径，通过分解任务和引入视觉锚定改善了传统端到端模型的局限。 Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

Zhanwei Li,Liang Li,Jiawan Zhang

Main category: cs.CV

TL;DR: 提出了一种名为AFDAN的自适应频域对齐网络，用于解决医学图像分割中因标注数据稀缺带来的跨域适应问题，通过在频域对齐特征并融合空间与频率信息，显著提升了分割性能，在VITILIGO2025和DRIVE数据集上均取得领先结果。

Details

Motivation: 医学图像分割依赖高质量标注数据，但其获取成本高、耗时长，导致数据稀缺，限制了模型性能。因此需要有效的域适应方法来缓解源域与目标域之间的分布差异，实现跨域知识迁移。 Method: 提出AFDAN框架，包含三个核心模块：对抗域学习模块实现源域到目标域的特征迁移；源-目标频域融合模块融合跨域的频率表示；空间-频率集成模块结合空间与频率特征以提升分割精度。整体在频域进行特征对齐与增强。 Result: 在VITILIGO2025数据集上达到90.9%的IoU，在DRIVE数据集上达到82.6%的IoU，优于现有最先进方法。 Conclusion: AFDAN通过频域特征对齐和多模态融合有效缓解了医学图像分割中的数据稀缺问题，实现了优越的跨域分割性能，具有较强的泛化能力和应用潜力。 Abstract: High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)--a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

Haodi He,Jihun Yu,Ronald Fedkiw

Main category: cs.CV

TL;DR: 本文提出一种基于高斯点阵的三维人脸重建方法，利用未标定图像和分割标注实现中性姿态重建，并将高斯点转换为可应用于标准图形管线的视图相关神经纹理，支持解耦光照与纹理，适用于文本驱动资产生成。

Details

Motivation: 旨在利用日益流行的三维神经表示，从无标定的人脸图像中构建统一且一致的解释，克服传统方法对大量视频帧或严格标定数据的依赖。 Method: 采用高斯点阵（Gaussian Splatting）代替NeRF，因其更显式且易于施加约束；利用分割标注对齐面部语义区域，结合软约束于三角化曲面以提升重建精度，并将高斯点映射到纹理空间形成视图相关的神经纹理，使用可重光照模型分离光照与材质。 Result: 仅需11张图像即可重建中性姿态人脸，获得高质量几何结构和高分辨率去光照albedo纹理，兼容标准图形管线，支持在不修改其他场景元素的情况下应用高斯点阵渲染。 Conclusion: 该方法实现了高效、灵活且高保真的三维人脸重建与纹理生成，能够在不同光照条件下训练并集成到文本驱动的资产创建流程中，具有实际应用价值。 Abstract: We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng,Hao Guo,Yunpeng Bai,Yongkang Dai,Huaxi Huang,Yilei Shi

Main category: cs.CV

TL;DR: BrepLLM是首个使大语言模型能够解析和推理原始Brep数据的框架，通过两阶段训练 pipeline 实现3D几何与自然语言之间的跨模态对齐，在分类与描述任务上达到SOTA。

Details

Motivation: 现有基于token序列的大型语言模型难以直接处理包含复杂几何与拓扑信息的3D Brep模型，缺乏有效方法将结构化3D几何与自然语言关联。 Method: 提出BrepLLM框架，采用两阶段训练：第一阶段通过自适应UV采样将Brep转为图表示，并设计分层BrepEncoder提取几何与拓扑特征，利用对比学习对齐全局特征与CLIP文本嵌入；第二阶段将BrepEncoder集成至LLM，采用三阶段渐进训练策略（MLP语义映射、LLM微调、MQE增强几何多样性）对齐节点序列。 Result: 在3D物体分类与图像描述任务上实现最先进性能，构建了包含269,444个Brep-文本问答对的数据集Brep2Text。 Conclusion: BrepLLM成功弥合了结构化3D Brep数据与自然语言之间的模态鸿沟，为3D内容理解与生成提供了有效解决方案。 Abstract: Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

[78] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 提出了一种无需训练的零样本对象计数框架CountZES，通过三个协同阶段实现精确且多样化的示例选择，显著提升了复杂场景中对未见类别的计数性能。

Details

Motivation: 现有零样本对象计数方法在处理未见类别时依赖开放词汇检测器或多实例候选，或随机采样无法准确划分实例，难以实现精确计数。 Method: 提出CountZES框架，包含检测锚定示例（DAE）、密度引导示例（DGE）和特征共识示例（FCE）三个阶段：DAE优化开放词汇检测以获取单实例示例；DGE通过密度驱动的自监督方式选择语义紧凑的示例；FCE利用特征空间聚类增强视觉一致性。 Result: 在多个自然、航拍和医学图像数据集上实验表明，CountZES在零样本对象计数任务中优于现有方法，具有良好的跨域泛化能力。 Conclusion: CountZES通过协同的三阶段策略实现了高质量的零样本示例选择，有效平衡了文本对齐、计数一致性和特征代表性，为复杂场景下的对象计数提供了高效且无需训练的解决方案。 Abstract: Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

[79] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的简单有效方法，通过几何角度优化文本嵌入以抑制不必要语义，解决文本到图像生成中主体一致性与语义对齐的问题。

Details

Motivation: 现有文本到图像生成方法在保持主体一致性方面表现不佳，且常依赖计算成本高的微调或图像条件调节，而现有无训练方法存在语义泄漏问题。 Method: 提出一种从几何视角出发的方法，通过精细化调整文本嵌入来抑制跨帧的不必要语义，从而缓解语义纠缠问题。 Result: 大量实验表明，该方法在主体一致性和文本对齐方面显著优于现有基线方法。 Conclusion: 所提方法在无需训练的前提下有效提升了文本到图像生成中的主体一致性和语义准确性，适用于视觉叙事等应用。 Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[80] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano,Saptarshi Sinha,Jacob Chalk,Wei-Hong Li,Hideo Saito,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了一种基于文本条件扩散模型的人体运动生成方法，专注于物体/位置拾取前的凝视预示行为。作者整合了五个公开数据集，构建了包含23.7K个凝视预示运动序列的数据集，并引入“Prime Success”指标评估生成动作的自然性。实验表明，该模型在HD-EPIC数据集上实现了60%的预示成功率为和89%的到达成功率。

Details

Motivation: 人类在拾取或放置物体前会通过凝视预示目标位置，这是自然行为的重要组成部分。然而现有运动生成模型未能充分模拟这一过程，缺乏对远距离目标识别与接近动作的联合建模。因此需要构建专门数据集并设计能够再现此类自然行为的生成模型。 Method: 首先从HD-EPIC、MoGaze、HOT3D、ADT和GIMO五个公开数据集中提取23.7K个带有凝视预示特征的人体运动序列；然后预训练一个文本条件扩散模型，并进一步以目标姿态或位置为条件进行微调，使其能生成符合上下文的接近与抓取动作。 Result: 提出了新的评估指标'Prime Success'用于衡量凝视预示行为的准确性；在HD-EPIC数据集上，模型以目标位置为条件时达到60%的Prime Success和89%的Reach Success，验证了生成动作在远距离目标引导下的合理性与有效性。 Conclusion: 通过构建首个大规模凝视预示运动数据集并引入新评估指标，本文展示了结合目标位置条件的扩散模型在生成自然人体接近与抓取动作方面的潜力，推动了更逼真人机交互与虚拟角色行为的发展。 Abstract: Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

[81] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax

Main category: cs.CV

TL;DR: SNOW是一个无需训练、骨干网络无关的框架，通过融合视觉语言模型的语义与点云几何和时间一致性，实现统一的4D场景理解，支持精确的空间接地推理和自主机器人导航。

Details

Motivation: 现有视觉语言模型缺乏3D几何和时间动态的接地能力，而几何感知则语义稀疏，难以满足自主机器人对动态环境时空理解的需求。 Method: SNOW结合同步的RGB图像和3D点云，使用HDBSCAN聚类生成对象级提议，并指导SAM2进行分割；提出Spatio-Temporal Tokenized Patch Encoding（STEP）编码多模态token，捕捉局部语义、几何和时间特征；构建4D场景图（4DSG）作为推理先验，并通过轻量级SLAM后端实现空间锚定与全局对齐。 Result: 在多个基准测试中达到最先进性能，验证了SNOW在4D场景理解和空间接地推理方面的有效性。 Conclusion: 结构化的4D先验对于具身推理和自主机器人至关重要，SNOW提供了一个通用且高效的解决方案，实现了语义、几何与时间动态的统一建模。 Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

[82] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Senmao Li,Kai Wang,Salman Khan,Fahad Shahbaz Khan,Jian Yang,Yaxing Wang

Main category: cs.CV

TL;DR: 提出StageVAR，一种面向视觉自回归模型的阶段感知加速框架，在保持生成质量的同时实现高达3.4倍的加速。

Details

Motivation: 传统VAR模型在大规模生成步骤中计算复杂度高，现有加速方法依赖手动选步且忽略不同生成阶段的重要性差异。 Method: 通过分析发现早期步骤对语义和结构一致性至关重要，而后期步骤主要用于细节优化，据此设计无需训练的即插即用加速策略，利用后期计算中的语义无关性和低秩特性进行剪枝或近似。 Result: 实现了最高3.4倍的加速，GenEval仅下降0.01，DPG下降0.26，性能优于现有加速方法。 Conclusion: 阶段感知设计是高效视觉自回归图像生成的有效原则。 Abstract: Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

Yuan Li,Yahan Yu,Youyuan Lin,Yong-Hao Yang,Chenhui Chu,Shin'ya Nishida

Main category: cs.CV

TL;DR: 本文提出了一种通过强化学习使模型在盲图像质量评估（BIQA）中具备类人且自洽推理能力的方法，利用人类标注作为奖励信号，并设计自我生成描述驱动的奖励机制，实现了与当前最先进方法相当的评分预测性能，并在解释一致性上显著优于基线。

Details

Motivation: 希望让模型不仅准确预测图像质量，还能像人类一样通过感知-推理级联过程进行可解释、自洽的判断，提升模型与人类认知的一致性。 Method: 收集人类评估数据以捕捉感知-推理过程，采用强化学习框架，将人类注释作为奖励信号，并设计基于自生成描述的内在奖励，促使模型从自身生成的描述中推断图像质量，从而实现自洽推理。 Result: 在标准相关系数指标上达到与现有最先进BIQA系统相当的表现，同时在超过1000个样本上的ROUGE-1得分为0.512，高于基线的0.443，表明模型生成的推理链更接近人类解释。 Conclusion: 该方法有效提升了模型在BIQA任务中的可解释性和与人类推理的一致性，是实现类人可解释推理的重要一步。 Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

[84] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu,Yuanyuan Liu,Lin Wei,Chang Tang,Yibing Zhan,Zijing Chen,Zhe Chen

Main category: cs.CV

TL;DR: 本文提出了一个基于眼行为的多模态情感识别（EMER）数据集和一种新颖的EMERT模型，通过结合眼动数据与面部表情，弥补了面部表情识别（FER）与真实情感识别之间的差距。实验表明，引入眼行为显著提升了情感识别的鲁棒性。

Details

Motivation: 由于面部表情常作为社交工具而非真实情绪的反映，导致情感识别（ER）与面部表情识别（FER）之间存在差距。因此，需要引入更可靠的情绪线索（如眼行为）来提升ER的真实性与准确性。 Method: 构建了一个包含自发情绪诱导的眼行为多模态数据集（EMER），采集了眼动序列和注视图等非侵入式眼行为数据，并与面部视频同步记录；设计了EMERT模型，采用模态对抗特征解耦和多任务Transformer结构，融合眼行为与面部表情进行情感识别。 Result: 提出了七个多模态基准测试协议，实验结果显示EMERT显著优于现有最先进方法；验证了眼行为作为情感线索的有效性和互补性。 Conclusion: 眼行为是提升情感识别鲁棒性的重要补充线索，EMER数据集和EMERT模型有助于缩小FER与真实ER之间的差距，推动更真实可靠的情感识别研究。 Abstract: Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

[85] YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images

Huma Hafeez,Matthew Garratt,Jo Plested,Sankaran Iyer,Arcot Sowmya

Main category: cs.CV

TL;DR: 本文提出YOLO11-4K，一种针对4K全景图像的高效实时目标检测框架，通过引入多尺度检测头和GhostConv骨干网络，在显著降低延迟的同时提升小物体检测精度，并构建了新的标注数据集CVIP360用于评估。

Details

Motivation: 传统检测器（如YOLO）在处理高分辨率、大视场的360度图像时存在计算开销大、对小目标敏感度低的问题，难以满足实际应用需求。 Method: 提出YOLO11-4K框架，采用带P2层的多尺度检测头增强对小目标的感知能力，并使用GhostConv减少计算复杂度；同时构建并公开CVIP360数据集，包含6876个4K全景图像的框注数据。 Result: YOLO11-4K在0.50 IoU下达到0.95 mAP，单帧推理时间为28.3毫秒，相较YOLO11（112.3毫秒）延迟降低75%，且精度更高（mAP从0.908提升至0.95）。 Conclusion: YOLO11-4K在效率与精度之间实现了良好平衡，适用于自动驾驶、监控和增强现实等高分辨率全景检测任务，具有广泛的应用潜力。 Abstract: The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.

[86] PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu,Jiajie Liu,Jinyan Zhang,Wenhao Li,Junsong Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseMoE的混合专家网络，用于单目3D人体姿态估计，通过分离2D姿态与深度特征的编码过程来提升估计精度。

Details

Motivation: 现有的基于提升的方法在编码2D姿态和未知深度时存在特征纠缠问题，导致深度不确定性影响2D姿态估计，限制了整体精度。本文旨在解决这一问题。 Method: 提出PoseMoE网络，包含两个关键设计：1）混合专家结构，分别由专门模块处理2D姿态和深度特征，实现特征解耦；2）跨专家知识聚合模块，通过时空上下文信息进行双向映射以增强特征。 Result: 在Human3.6M、MPI-INF-3DHP和3DPW三个主流数据集上，PoseMoE均优于传统的提升方法，展现出更高的估计精度。 Conclusion: 深度表示在单目3D姿态估计中至关重要，通过先独立估计深度再与2D姿态融合的方式可有效提升性能，验证了解耦编码策略的有效性。 Abstract: The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

[87] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou,Zhexiao Huang,Yuan Guo,Zhangxuan Gu,Tianyu Xia,Zichen Luo,Fei Tang,Dehan Kong,Yanyi Shang,Suling Ou,Zhenlin Guo,Changhua Meng,Shuheng Shen

Main category: cs.CV

TL;DR: 本文提出了VenusBench-GD，一个大规模、跨平台、双语的GUI元素定位基准，通过分层任务分类和高质量标注数据，实现对多模态模型在基础与高级定位任务上的全面评估。

Details

Motivation: 现有GUI定位基准存在数据量不足、领域覆盖窄或过于依赖特定平台的问题，缺乏适用于真实场景的综合性评估体系。 Method: 构建了一个涵盖多平台、多应用的大规模双语基准VenusBench-GD，建立了高精度的数据标注流程，并提出包含六个子任务的分层任务分类体系，用于系统评估模型在基础与高级定位任务上的表现。 Result: 实验表明，通用多模态模型在基础定位任务上已可匹敌甚至超越专用GUI模型，但在高级任务上仍落后；而专用模型虽表现较好但存在过拟合和鲁棒性差的问题。 Conclusion: 需要多层次、综合性的评估框架来推动GUI智能体的发展，VenusBench-GD为未来模型评估提供了更全面、贴近实际的应用场景测试平台。 Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

[88] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi

Main category: cs.CV

TL;DR: 提出了一种用于骨架动作定位的自监督预训练方法，通过片段判别任务和U形模块增强时序特征表示。

Details

Motivation: 现有的自监督预训练方法在骨架动作识别上取得成功，但在动作定位任务中难以捕捉动作边界的细微变化，缺乏对时序敏感特征的学习。 Method: 设计了片段判别代理任务，将骨架序列划分为非重叠片段并通过对比学习区分不同视频中的片段；采用U形结构融合中间特征以提升帧级定位的特征分辨率。 Result: 在BABEL数据集上显著提升了多种对比学习方法的动作定位性能，并在PKUMMD上通过NTU RGB+D和BABEL的预训练实现了最先进的迁移学习效果。 Conclusion: 所提方法有效增强了骨架序列的时序敏感特征表示，推动了自监督学习在动作定位任务上的应用。 Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[89] Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

Hossein Javidnia

Main category: cs.CV

TL;DR: 本文提出MAGINet，一种多尺度注意力引导的网络，用于从单张人脸图像中进行高精度内在分解，生成包括漫反射反照率在内的多种渲染通道，实现高质量的人脸重光照与材质编辑。

Details

Motivation: 在非约束光照下准确分解人脸图像的内在属性对于光真实感重光照和增强现实应用至关重要，但现有方法在细节保持和光照不变性方面存在不足。 Method: 提出MAGINet，采用分层残差编码、空间-通道注意力机制和自适应多尺度特征融合；通过RefinementNet对初步反照率图上采样并细化，并利用Pix2PixHD为基础的翻译器生成其余五个基于物理的渲染通道。 Result: 在FFHQ-UV-Intrinsics数据集上训练，结合多种损失函数，该方法在漫反射反照率估计和完整渲染堆栈保真度方面均达到SOTA水平。 Conclusion: 所提方法显著提升了人脸图像内在分解的质量，支持高保真数字人、重光照与材质编辑等应用。 Abstract: Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

[90] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Zhiwei Li,Yitian Pang,Weining Wang,Zhenan Sun,Qi Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Test-Time Padding (TTP) 的轻量级防御框架，用于提升视觉-语言模型（如CLIP）在推理阶段对对抗性扰动的鲁棒性，能够在不损害干净样本准确率的前提下，有效检测并适应对抗样本。

Details

Motivation: 现有的训练时防御方法依赖标注数据和昂贵的再训练，而测试时策略难以可靠区分干净与对抗样本，导致鲁棒性和准确率难以兼顾。因此需要一种无需重新训练、高效且通用的测试时防御方法。 Method: TTP通过比较添加空间填充前后CLIP特征嵌入之间的余弦相似度变化来检测对抗样本，并使用统一阈值实现跨架构和数据集的可靠检测；对于检测到的对抗样本，采用可训练填充恢复注意力模式，并结合相似性感知集成策略提升预测鲁棒性；对干净样本则保持不变或结合现有测试时适应技术进一步提升精度。 Result: 在多种CLIP骨干网络和细粒度基准上的实验表明，TTP在对抗鲁棒性方面显著优于当前最先进的测试时防御方法，同时保持甚至提升了干净样本的准确性。 Conclusion: TTP是一种有效且轻量的测试时防御框架，能够实现对抗样本的可靠检测与自适应恢复，在不牺牲干净准确率的情况下显著提升VLMs的鲁棒性，具有良好的通用性和应用潜力。 Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

[91] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang,Lei Ke,Boqiang Zhang,Tianyuan Qu,Hanxun Yu,Zhenpeng Huang,Meng Yu,Dan Xu,Dong Yu

Main category: cs.CV

TL;DR: 本文提出N3D-VLM，一个融合原生3D对象感知与3D感知视觉推理的统一框架，通过可解释的3D空间理解实现精确的3D定位和空间问答，利用大规模2D标注提升至3D的数据构建方法进行训练，在3D指代定位与空间推理任务上达到SOTA。

Details

Motivation: 现有视觉语言模型缺乏对3D对象的本征感知能力，难以准确理解空间关系和深度线索，限制了其在3D场景中的应用。 Method: 提出N3D-VLM框架，引入原生3D对象感知机制，结合基于文本描述的3D定位与显式3D空间推理；构建可扩展的数据生成管道，利用深度估计将2D标注提升至3D空间，并生成支持链式思维（CoT）推理的空间问答数据用于联合训练。 Result: 该方法在3D指代定位任务上性能领先，同时在3D空间推理任务中显著优于现有方法，训练数据规模超过现有最大单图像3D检测数据集六倍以上。 Conclusion: N3D-VLM实现了精确且可解释的3D视觉语言理解，通过原生3D感知与显式3D推理的统一框架，推动了多模态模型在复杂3D场景中的认知能力。 Abstract: While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

[92] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur,Marwan Taher,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出一种动态重建系统，输入为单目RGB视频，输出为完整且持久的4D场景重建，通过3D刚性 primitives 和运动推断实现时序一致的可重放三维重建。

Details

Motivation: 现有方法难以实现完整、持续且具有物体恒存性的动态场景重建，尤其在处理遮挡和运动连续性方面存在不足。 Method: 将场景分解为多个刚性3D primitives，利用密集2D对应点联合优化其刚体运动，并引入基于运动分组的遮挡后运动外推机制，实现4D（3D+时间）重建。 Result: 在物体扫描和多物体数据集上，该系统在定量和定性结果上均显著优于现有方法，支持可重放的3D重建、多物体扫描和物体恒存。 Conclusion: 该方法实现了高质量的4D时空感知动态重建，有效解决了遮挡下的运动连续性与场景完整性问题，推动了单目视频动态重建的发展。 Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

[93] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

Yin Zhang,Yongqiang Zhang,Yaoyue Zheng,Bogdan Raducanu,Dan Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Causal-Tune的新型微调策略，通过在频域中分离和抑制视觉基础模型中的非因果因素（存在于高低频分量中的伪影），增强领域泛化语义分割的鲁棒性。

Details

Motivation: 现有方法在领域泛化语义分割中忽视了预训练视觉模型中存在的伪影问题，这些伪影与非因果因素相关，影响特征表示的有效性，从而降低性能。 Method: 使用离散余弦变换将特征转换到频域，通过高斯带通滤波器分离因果与非因果成分，并引入可学习的因果感知token优化因果部分，最后通过逆变换恢复至空间域。 Result: 在多种跨域任务上验证了方法的有效性，在恶劣天气条件下显著优于基线，雪天场景下mIoU提升达4.8%。 Conclusion: Causal-Tune能有效识别并解耦视觉基础模型中的因果与非因果特征，提升领域泛化语义分割的鲁棒性和性能。 Abstract: Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

[94] CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series

Nico Albert Disch,Saikat Roy,Constantin Ulrich,Yannick Kirchhoff,Maximilian Rokuss,Robin Peretzke,David Zimmerer,Klaus Maier-Hein

Main category: cs.CV

TL;DR: CRONOS 是首个支持离散和连续时间戳的3D医学扫描序列到图像预测统一框架，能够实现不规则采样下的体素级连续时间预测。

Details

Motivation: 现有模型在处理3D医学扫描时间演化时受限于单次扫描输入、固定时间网格或全局标签预测，难以应对不规则采样下的体素级预测需求。 Method: CRONOS 通过学习一个时空速度场，将多个上下文体积数据沿时间维度传输至任意目标时间点的预测体积，直接在3D体素空间中进行连续时间建模，支持多输入到单输出的预测。 Result: 在涵盖Cine-MRI、灌注CT和纵向MRI的三个公开数据集上，CRONOS 在体素级预测任务中优于现有基线方法，同时保持良好的计算效率。 Conclusion: CRONOS 首次实现了针对3D医学数据的连续时间序列到图像预测，为疾病进展建模、治疗规划等提供了更灵活、精确的工具，并推动多数据集可重复基准测试的发展。 Abstract: Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.

[95] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong,Jiaqi Gu,Yujing Lou,Lubin Fan,Yixiong Zou,Yue Wu,Jieping Ye,Ruixuan Li

Main category: cs.CV

TL;DR: 提出了一种名为Sketch-in-Latents (SkiLa) 的新范式，使多模态大语言模型能够通过在统一特征空间中生成潜在的草图令牌来进行视觉想象和多步推理。

Details

Motivation: 现有MLLM在需要视觉想象力的任务上表现不足，而人类可以在无预定义工具包的情况下进行灵活的视觉-文本想象交互。受此启发，作者希望在统一空间内实现多模态推理。 Method: 提出SkiLa范式，扩展MLLM的自回归能力以原生生成连续的视觉嵌入（即潜在草图令牌），并在多步推理过程中动态交替文本思考模式和视觉草图模式，同时引入潜在视觉语义重建机制确保语义一致性。 Result: 实验表明，SkiLa在以视觉为中心的任务上表现出优越性能，并在多种通用多模态基准上展现出强泛化能力。 Conclusion: SkiLa实现了统一的多模态推理框架，使模型能够在同一特征空间中无缝结合文本和视觉思维过程，提升了模型的视觉想象力与推理能力。 Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

[96] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu,Tong Yu,Shenling Wang,Xudong Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于Swin-transformer的文本条件扩散模型Yuan-TecSwin，用于提升图像生成中的长距离语义建模和文本-图像对齐能力，在ImageNet上取得了1.37的FID分数，表现优异。

Details

Motivation: 卷积神经网络（CNN）在扩散模型中存在局部性限制，难以捕捉长距离语义信息，影响图像生成质量。 Method: 采用Swin-transformer替代传统CNN作为编码器和解码器的基本模块，增强非局部特征提取能力；优化文本编码器与文本嵌入的融合方式以提升文本-图像对齐；并使用自适应时间步搜索策略改进推理性能。 Result: Yuan-TecSwin在ImageNet生成任务上实现了1.37的FID分数，达到当前最优水平，且无需额外模型辅助去噪过程；人类受试者难以区分生成图像与真实图像。 Conclusion: Swin-transformer结构有效提升了扩散模型的全局建模能力，结合文本条件优化策略，显著提高了文本到图像生成的质量与真实感。 Abstract: Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

[97] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment

Ayush Bhavsar

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、面向应用的实时去雾图像增强方法Hazedefy，基于暗通道先验和大气散射模型，适用于消费级硬件和移动嵌入式设备。

Details

Motivation: 为了在无需GPU加速的情况下实现实时视频和实时摄像头去雾处理，提升在消费级硬件上的部署可行性。 Method: 采用基于暗通道先验（DCP）和大气散射模型的方法，引入伽马自适应重建、快速透射率近似估计、分数顶部像素平均的大气光稳定估计以及可选的颜色平衡模块。 Result: 实验表明该方法在真实世界的图像和视频中有效提升了可见性和对比度，且可在无GPU支持下运行。 Conclusion: Hazedefy是一种高效、轻量且实用的去雾方案，适合在资源受限的移动和嵌入式平台上实时部署。 Abstract: This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.

[98] Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou,Zeqi Xiao,Tianyi Wei,Shuai Yang,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了Log-linear Sparse Attention (LLSA)，一种用于长序列扩散Transformer的高效稀疏注意力机制，通过分层结构将选择和注意力计算复杂度从二次降至对数线性，在保持生成质量的同时显著加速训练与推理。

Details

Motivation: 现有的Top-K稀疏注意力方法在处理长序列时仍存在二次选择成本且需增大K以维持性能，根本原因在于单层设计无法有效捕捉全局结构，限制了DiT向长序列扩展。 Method: 提出LLSA，采用分层Top-K稀疏选择机制，逐级细化关键块的选择，并引入分层KV增强机制，在不同粒度下保留全局上下文；同时开发高效的GPU实现，前向和反向传播均仅使用稀疏索引，避免稠密掩码开销。 Result: 在256x256像素序列的高分辨率图像生成任务中，LLSA相比原有方法实现了28.27倍的注意力推理加速和6.09倍的DiT训练加速，且保持生成质量不变。 Conclusion: LLSA通过分层稀疏设计有效解决了长序列DiT中的计算瓶颈，为训练超长序列扩散模型提供了高效可行的新方向。 Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

[99] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation

Kanwal Aftab,Graham Adams,Mark Scanlon

Main category: cs.CV

TL;DR: 本文提出了一种基于电源插座类型的三阶段深度学习管道，用于室内多媒体地理定位，在打击人口贩卖和儿童剥削等犯罪中具有重要法医学应用价值。

Details

Motivation: 室内多媒体地理定位在数字取证中潜力巨大，但因房间布局相似、光照变化大、GPS信号不可靠等问题发展受限，尤其是缺乏适用于敏感领域的数据集。 Method: 采用YOLOv11检测插座（mAP@0.5=0.843），Xception模型分类12种插座类型（准确率0.912），再将插座类型映射到国家（>90%置信度下准确率达0.96）；构建了两个专用数据集并进行数据增强。 Result: 在真实条件下的Hotels-50K数据集（特别是TraffickCam子集）上验证了该方法的有效性，相比使用专业图像的评估更具现实意义。 Conclusion: 该框架为实际数字取证应用提供了可行方案，且代码、模型与数据均已开源，推动了室内地理定位技术的发展。 Abstract: Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.

[100] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen,Mingyue Cui,Xingyi Yang

Main category: cs.CV

TL;DR: 本文提出了DeContext，一种通过干扰多模态注意力路径来防御上下文扩散模型中未经授权图像编辑的方法，有效阻断输入与输出间的关联，同时保持图像质量。

Details

Motivation: 由于上下文扩散模型可轻易修改图像，带来隐私泄露和恶意伪造的风险，亟需一种有效机制保护个人图像不被未经同意地操纵。 Method: 提出DeContext方法，通过在关键去噪步骤和特定Transformer模块中注入小而有针对性的扰动，削弱跨注意力路径，从而阻断上下文信息传播。 Result: 在Flux Kontext和Step1X-Edit模型上的实验表明，DeContext能持续阻止非授权图像编辑，同时保持输出图像的视觉质量。 Conclusion: 基于注意力机制的扰动是一种高效且鲁棒的防御手段，可用于保护图像在大型上下文扩散模型中的使用安全。 Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.

[101] SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu,Di Wang,Hebaixu Wang,Haoyang Chen,Wentao Jiang,Yilin Cheng,Haonan Guo,Wei Cui,Jing Zhang

Main category: cs.CV

TL;DR: 提出SARMAE，一种噪声感知的掩码自编码器，用于合成孔径雷达（SAR）图像的自监督表示学习，结合大规模数据集SAR-1M和光学图像先验提升性能。

Details

Motivation: 现有SAR图像深度学习方法受限于数据稀缺和固有的斑点噪声，影响细粒度语义表示学习。 Method: 构建百万级SAR数据集SAR-1M并引入配对光学图像；设计斑点感知表示增强（SARE）模块以注入SAR特有噪声进行鲁棒学习；提出语义锚点表示约束（SARC）利用光学先验对齐特征，保证语义一致性。 Result: 在多个SAR数据集上实验表明，SARMAE在分类、检测和分割任务中均达到最先进性能。 Conclusion: SARMAE通过噪声感知和光学先验引导的自监督学习，有效提升了SAR图像的表示能力，为后续应用提供了强健的预训练模型。 Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.

[102] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis,Christos Sgouropoulos,Bill Psomas,Theodoros Giannakopoulos,Giorgos Sfikas,Ioannis Kakogeorgiou

Main category: cs.CV

TL;DR: 本文提出了REGLUE，一种统一的潜在扩散框架，通过联合建模VAE潜变量、局部视觉基础模型（VFM）语义和全局[CLS] token，增强了图像生成的质量与训练效率。

Details

Motivation: 现有的潜在扩散模型缺乏直接的语义监督，且未能充分利用VFM提供的丰富多层次非线性语义信息，限制了生成质量和收敛速度。 Method: 引入REGLUE框架，使用轻量级卷积语义压缩器非线性聚合多层VFM特征，并将其与VAE潜变量在SiT主干网络中纠缠；同时采用外部对齐损失正则化内部表示。 Result: 在ImageNet 256x256上，REGLUE在FID指标和收敛速度方面均优于SiT基线及REPA、ReDi、REG等方法。 Conclusion: 空间VFM语义、非线性压缩、全局token和外部对齐是提升潜在扩散模型性能的关键因素，REGLUE为语义增强的图像生成提供了高效统一的解决方案。 Abstract: Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

[103] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger,Jan-Niklas Dihlmann,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 本文提出FrameDiffuser，一种基于自回归扩散模型的神经渲染框架，通过结合G-buffer数据和前一帧输出实现高质量、时序一致的实时图像生成。

Details

Motivation: 现有扩散模型在交互式应用中存在时序不一致或计算开销过大的问题，难以满足实时性与连续性需求。 Method: 采用双条件架构（ControlNet+ControlLoRA），以G-buffer为结构引导、前一帧输出为时序引导，通过三阶段训练策略实现稳定的自回归生成，并针对特定环境进行专门化训练。 Result: 在特定环境中实现了优于通用方法的光追级质量，包括准确的光照、阴影和反射，且能稳定生成数百至数千帧保持时序一致性。 Conclusion: FrameDiffuser通过环境特异性训练和自回归设计，在保证推理速度的同时实现了高质量、长时间稳定的神经渲染，适用于交互式应用场景。 Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

[104] Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray

Gonçalo Gaspar Alves,Shekoufeh Gorgi Zadeh,Andreas Husch,Ben Bausch

Main category: cs.CV

TL;DR: 本文提出了一种基于ResNet-50和三元组损失的主体指纹识别方法，用于检测跨开源医学图像数据集的主体重复问题，从而缓解数据泄露。该方法在多种X射线和MRI数据集上实现了接近完美的主体重识别性能。

Details

Motivation: 跨多个开源医学数据集中可能存在相同受试者，导致模型训练时数据泄露和性能高估，因此需要有效识别并处理这些重复主体。 Method: 采用ResNet-50网络，结合三元组边际损失进行训练，将同一受试者的图像映射到潜在空间中的特定区域，通过相似性匹配实现主体指纹识别与重识别，评估采用小样本学习设置（如20-way 1-shot）。 Result: 在ChestXray-14数据集上取得99.10%（20-way 1-shot）和90.06%（500-way 5-shot）的Mean Recall@K；在BraTS-2021上达到99.20%（20-way 1-shot）和98.86%（100-way 3-shot），表现出优异的主体识别能力。 Conclusion: 主体指纹技术能有效识别跨数据集的重复受试者，具有高准确率，可用于预防医学图像分析中的数据泄露问题。 Abstract: Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.

[105] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

Serafino Pandolfini,Lorenzo Pellegrini,Matteo Ferrara,Davide Maltoni

Main category: cs.CV

TL;DR: 本文系统评估了用于全合成图像检测的最先进检测器在局部修复检测中的表现，发现训练于多样化生成器的大规模模型对中大区域或再生式修复具有较好的检测能力。

Details

Motivation: 随着生成式AI的发展，局部图像编辑被广泛应用于安全威胁场景，但现有检测方法在应对这类局部篡改时的泛化能力尚不明确。 Method: 采用多个数据集，涵盖不同生成器、掩码大小和修复技术，对基于深度伪造检测的现有检测器进行跨任务评估。 Result: 实验表明，训练于大量生成器的模型能部分迁移到局部修复检测任务，在中大型区域和再生式修复上表现良好，优于许多现有的专门检测方法。 Conclusion: 当前先进的深度伪造检测器具备一定的局部修复检测泛化能力，尤其适用于较大范围或特定类型的图像篡改识别。 Abstract: The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.

[106] SDFoam: Signed-Distance Foam for explicit surface reconstruction

Antonella Rech,Nicola Conci,Nicola Garau

Main category: cs.CV

TL;DR: 本文提出SDFoam，通过结合显式的Voronoi图与隐式的符号距离场（SDF），在保持RadiantFoam渲染效率的同时，显著提升了神经辐射场的网格重建精度与表面质量。

Details

Motivation: 现有方法如NeRF和3DGS在视图合成中表现良好，但在精确网格重建方面仍存在不足，尤其是表面模糊、浮点噪声和拓扑错误。RadiantFoam虽提升渲染速度，但未解决几何精度问题。本文旨在填补高效渲染与高保真几何重建之间的空白。 Method: 提出SDFoam，联合优化基于Voronoi图的显式辐射表示与隐式SDF。通过光线追踪进行场景优化，并引入Eikonal正则化约束。SDF的等值面引导Voronoi单元面与零水平集对齐，实现几何一致性。 Result: SDFoam在多种场景下显著改善了网格重建的Chamfer距离，减少了浮点噪声并优化了拓扑结构，同时保持了与RadiantFoam相当的PSNR和SSIM指标及训练速度。 Conclusion: 结合显式Voronoi结构与隐式SDF的混合表示方法，在不牺牲渲染效率的前提下，有效提升了神经辐射场的几何重建质量，为高质量视图合成与三维重建提供了新思路。 Abstract: Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.

[107] A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece,Zhehua Mao,Netanell Avisdris,Brian Dromey,Raffaele Napolitano,Dafna Ben Bashat,Francisco Vasconcelos,Danail Stoyanov,Leo Joskowicz,Sophia Bano

Main category: cs.CV

TL;DR: 本文介绍了一个公开的、多中心、多设备的胎儿超声图像基准数据集，包含专家标注的解剖标志点，用于临床常用的胎儿生物测量。该数据集包含来自三个临床中心、七种不同设备的4,513张去标识化超声图像，覆盖所有主要的胎儿生物测量指标，并提供了标准化的训练/测试划分、评估代码和基线结果，以促进方法间的公平比较。研究还通过自动生物测量模型量化了域偏移效应，表明单中心训练和测试会高估性能，突显了多中心验证的重要性。这是首个公开的涵盖全部主要胎儿生物测量的多中心多设备标注数据集，旨在推动跨中心的可靠AI辅助胎儿生长评估。

Details

Motivation: 手动标注超声图像中的解剖标志点耗时、依赖操作者，且在不同设备和中心间存在变异性，限制了自动化胎儿生长评估方法的可重复性。缺乏多来源标注数据集成为开发人工智能辅助方法的主要瓶颈。因此，亟需一个公开、多样化的标准数据集来推动算法在真实多中心环境下的泛化能力。 Method: 收集来自三个临床中心、使用七种不同超声设备获取的4,513张胎儿超声图像，均由专家对关键解剖标志点进行标注，覆盖双顶径、枕额径、腹部横径与前后径及股骨长度等主要生物测量指标。提供标准化的、按受试者划分的训练/测试集分割、评估代码和基线模型结果，支持公平比较。采用自动生物测量模型量化中心间的域偏移效应。 Result: 构建并公开发布了首个涵盖所有主要胎儿生物测量指标的多中心、多设备、标志点标注的超声图像数据集。实验表明，在单一中心内训练和评估会显著高估模型性能，而跨中心测试更能反映真实泛化能力，证实了域偏移的存在和多中心评估的必要性。提供了完整的数据、标注、训练代码和评估流程。 Conclusion: 该数据集为胎儿生物测量领域提供了强有力的公开基准，支持域适应和多中心泛化算法的研究，有助于开发更可靠、可在不同临床环境中部署的AI辅助胎儿生长评估工具，推动该领域的标准化和 reproducibility。 Abstract: Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

[108] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang,Pengfei Ren,Buyuan Zhang,Da Li,Tianhao Han,Haoyang Zhang,Liang Xie,Hongbo Chen,Erwei Yin

Main category: cs.CV

TL;DR: 本文提出了一种用于在线微手势识别的多视图自监督管道，并发布了首个大规模公开基准数据集OMG-Bench，同时设计了HMATr模型以统一手势检测与分类任务。

Details

Motivation: 由于缺乏公开的微手势数据集和通用算法，基于骨架的在线微手势识别面临挑战，尤其是细微动作难以精确标注。 Method: 构建多视图自监督管道生成骨架数据，结合启发式规则与专家精调实现半自动标注；提出HMATr模型，采用分层记忆库和可学习的位置感知查询来捕捉时序上下文与语义信息。 Result: OMG-Bench包含40类细粒度手势、13,948个实例；HMATr在检测率上超越现有最优方法7.6%。 Conclusion: 所提出的管道可有效生成高质量微手势数据，OMG-Bench为该领域提供了重要基准，HMATr模型显著提升了在线微手势识别性能。 Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

[109] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Yunkai Yang,Yudong Zhang,Kunquan Zhang,Jinxiao Zhang,Xinying Chen,Haohuan Fu,Runmin Dong

Main category: cs.CV

TL;DR: 本文提出了一种面向任务的遥感图像语义分割数据合成框架TODSynth，结合多模态扩散Transformer和任务反馈驱动的采样策略，提升了合成数据的质量与下游任务性能。

Details

Motivation: 现有可控生成方法在遥感图像语义分割中面临语义掩码控制复杂和采样质量不稳定的问题，限制了合成数据的有效性。 Method: 提出TODSynth框架，包含具有统一三重注意力的多模态扩散Transformer（MM-DiT），以及基于任务反馈的即插即用采样策略；引入控制-校正流匹配（CRFM）方法，在生成早期阶段根据语义损失动态调整采样方向。 Result: 实验表明，该方法在少样本和复杂场景下显著优于现有最先进方法，生成的数据更稳定且更贴近下游分割任务需求。 Conclusion: TODSynth通过联合注意力机制与任务导向的采样优化，有效提升了合成数据在遥感语义分割中的可用性和性能。 Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

[110] TreeNet: A Light Weight Model for Low Bitrate Image Compression

Mahadev Prasad Panda,Purnachandra Rao Makkena,Srivatsa Prativadibhayankaram,Siegfried Fößel,André Kaup

Main category: cs.CV

TL;DR: 本文提出了一种名为TreeNet的新型低复杂度图像压缩模型，采用二叉树结构的编解码器架构，并结合注意力特征融合机制，在降低计算复杂度的同时提升了压缩性能。

Details

Motivation: 降低基于学习的图像压缩方法的计算复杂度，以促进其广泛应用。 Method: 提出TreeNet模型，采用二叉树结构的编码器-解码器，并引入注意力特征融合机制来有效整合多分支特征。 Result: 在三个常用数据集上评估，TreeNet在低比特率下相比JPEG AI平均BD-rate提升4.83%，模型复杂度降低87.82%。 Conclusion: TreeNet在显著降低计算复杂度的同时实现了更优的压缩性能，且通过消融实验揭示了不同潜在表示对重建效果的影响。 Abstract: Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.

[111] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

Zhiyang Guo,Ori Zhang,Jax Xiang,Alan Zhao,Wengang Zhou,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Make-It-Poseable的新框架，将3D角色摆姿问题转化为潜在空间中的变换任务，通过直接操纵角色的潜在表示来实现高保真、拓扑自适应的姿势生成，并支持多种3D编辑应用。

Details

Motivation: 现有3D角色摆姿方法在蒙皮权重预测、拓扑完整性和姿态一致性方面存在不足，限制了其鲁棒性和泛化能力，本文旨在解决这些问题。 Method: 提出一个前馈式框架Make-It-Poseable，使用潜在空间表示进行角色重建；核心是一个基于骨骼运动操纵形状token的潜在姿态变换器，并采用密集姿态表示实现精确控制；引入潜在空间监督策略和自适应补全模块以保证几何保真度并处理拓扑变化。 Result: 该方法在姿态生成质量上表现出优越性能，能够准确保持几何细节并处理复杂的拓扑结构，同时在部分替换和精细化等3D编辑任务中展现出良好的扩展性。 Conclusion: Make-It-Poseable通过潜在空间变换实现了鲁棒且通用的3D角色摆姿，克服了传统方法的局限，在姿态准确性、拓扑适应性和编辑能力方面均表现优异。 Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.

[112] FlowDet: Unifying Object Detection and Generative Transport Flows

Enis Baty,C. P. Bridges,Simon Hadfield

Main category: cs.CV

TL;DR: 本文提出了FlowDet，首次将现代条件流匹配技术应用于目标检测，相较于基于扩散的方法，在不同实验设置下表现出更优的性能，尤其在召回率受限的情况下显著提升了AP和罕见类别的检测效果。

Details

Motivation: 受DiffusionDet启发，作者希望将目标检测重新定义为更广泛的生成式传输问题，并改进扩散方法中复杂的随机传输路径，以实现更快的推理扩展性和更好的检测性能。 Method: 采用条件流匹配（Conditional Flow Matching）技术，学习从初始分布到目标边界框分布之间的直线传输路径，从而简化生成过程并支持灵活调整推理步数和框数量，无需重新训练。 Result: FlowDet在COCO和LVIS数据集上分别比DiffusionDet提升最多+3.6% AP和+4.2% AP$_{rare}$，且在多种骨干网络和数据集上均优于扩散模型及非生成式基线方法。 Conclusion: 通过将目标检测建模为条件流匹配问题，FlowDet实现了更高效、更直接的生成路径，在保持灵活性的同时显著提升了检测性能，尤其是在高召回和罕见类别场景下表现突出。 Abstract: We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.

[113] Kling-Omni Technical Report

Kling Team,Jialu Chen,Yuanzheng Ci,Xiangyu Du,Zipeng Feng,Kun Gai,Sainan Guo,Feng Han,Jingbin He,Kang He,Xiao Hu,Xiaohua Hu,Boyuan Jiang,Fangyuan Kong,Hang Li,Jie Li,Qingyu Li,Shen Li,Xiaohan Li,Yan Li,Jiajun Liang,Borui Liao,Yiqiao Liao,Weihong Lin,Quande Liu,Xiaokun Liu,Yilun Liu,Yuliang Liu,Shun Lu,Hangyu Mao,Yunyao Mao,Haodong Ouyang,Wenyu Qin,Wanqi Shi,Xiaoyu Shi,Lianghao Su,Haozhi Sun,Peiqin Sun,Pengfei Wan,Chao Wang,Chenyu Wang,Meng Wang,Qiulin Wang,Runqi Wang,Xintao Wang,Xuebo Wang,Zekun Wang,Min Wei,Tiancheng Wen,Guohao Wu,Xiaoshi Wu,Zhenhua Wu,Da Xie,Yingtong Xiong,Yulong Xu,Sile Yang,Zikang Yang,Weicai Ye,Ziyang Yuan,Shenglong Zhang,Shuaiyu Zhang,Yuanxing Zhang,Yufan Zhang,Wenzheng Zhao,Ruiliang Zhou,Yan Zhou,Guosheng Zhu,Yongjie Zhu

Main category: cs.CV

TL;DR: Kling-Omni是一个端到端的通用生成框架，能够根据多模态视觉语言输入直接生成高保真视频，整合了视频生成、编辑与智能推理任务。

Details

Motivation: 现有视频生成系统通常功能分离、流程割裂，难以实现高质量、智能化的统一内容创作，因此需要一个能统一处理多模态输入并支持复杂推理与生成的框架。 Method: 提出Kling-Omni框架，采用端到端设计，将文本指令、参考图像和视频上下文等多模态输入编码为统一表示，并通过大规模预训练策略和推理优化基础设施提升性能。 Result: 在上下文生成、基于推理的编辑和多模态指令遵循任务中表现出色，能够生成电影级质量的视频内容。 Conclusion: Kling-Omni不仅是一个视频创作工具，更是迈向具备感知、推理、生成和交互能力的多模态世界模拟器的重要一步。 Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

[114] R3ST: A Synthetic 3D Dataset With Realistic Trajectories

Simone Teglia,Claudia Melis Tonti,Francesco Pro,Leonardo Russo,Andrea Alfarano,Leonardo Pentassuglia,Irene Amerini

Main category: cs.CV

TL;DR: 本文提出了一种新的合成数据集R3ST，通过结合真实世界轨迹和合成3D环境，解决了现有合成数据集中车辆运动不够真实的问题，从而在保持精确标注的同时提升了交通预测研究的真实性。

Details

Motivation: 现有的真实数据集虽然反映了真实的道路场景，但缺乏精确的地面实况标注；而合成数据集虽能提供大量标注数据，却往往缺乏真实的车辆运动轨迹。因此，需要一个兼具真实轨迹和精确标注的解决方案。 Method: 通过构建一个合成的3D环境，并融入来自无人机拍摄的鸟瞰数据集SinD的真实世界轨迹，生成具有真实人类驾驶行为的合成数据集R3ST。 Result: R3ST数据集成功结合了真实车辆轨迹与高精度的多模态地面实况标注，缩小了合成数据与真实运动之间的差距。 Conclusion: R3ST为车辆轨迹预测研究提供了一个更真实且标注丰富的数据资源，推动了计算机视觉在交通安全分析中的应用发展。 Abstract: Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

[115] KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Shuting Zhao,Zeyu Xiao,Xinrong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为KineST的新型运动学引导状态空间模型，用于从头戴式设备的稀疏信号中高效重建准确且时间连贯的全身姿态。

Details

Motivation: 现有方法在计算成本、时空依赖建模分离等方面存在不足，难以在AR/VR应用中实现高效且准确的全身运动追踪。 Method: 提出KineST模型，采用运动学引导的双向扫描策略嵌入运动学先验，并通过混合时空表示学习紧密耦合空间与时间上下文，同时引入几何角速度损失提升运动稳定性。 Result: 实验表明KineST在轻量级框架下实现了优于现有方法的姿态重建精度和时间一致性。 Conclusion: KineST有效平衡了准确性、平滑性和效率，为AR/VR中的全身动作捕捉提供了实用解决方案。 Abstract: Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

[116] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian,Boyao Han,Chen Shi,Lei Xiao,Long Yang,Shaoshuai Shi,Li Jiang

Main category: cs.CV

TL;DR: GeoPredict是一种几何感知的视觉-语言-动作（VLA）框架，通过引入预测性运动学和几何先验来增强连续动作策略，提升机器人在需要精确3D推理任务中的表现。

Details

Motivation: 现有VLA模型多为反应式且以2D为中心，在需要精确3D空间推理的任务中表现不可靠，缺乏对三维运动轨迹和工作区几何结构的建模能力。 Method: 提出GeoPredict框架，包含一个轨迹级模块用于编码运动历史并预测多步3D关键点轨迹，以及一个预测性3D高斯几何模块，沿未来轨迹进行工作区几何预测与轨迹引导的精细化；这些模块仅在训练时作为深度渲染的监督信号，推理时仅需轻量级查询标记，无需额外3D解码。 Result: 在RoboCasa Human-50、LIBERO和真实世界操作任务上的实验表明，GeoPredict持续优于强VLA基线模型，尤其在几何密集和空间要求高的场景中表现更优。 Conclusion: GeoPredict通过引入几何和运动学预测模块，在不增加推理负担的前提下显著提升了VLA模型在复杂3D操作任务中的性能，展现了其在空间推理任务中的潜力。 Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

[117] DenseBEV: Transforming BEV Grid Cells into 3D Objects

Marius Dähling,Sebastian Krebs,J. Marius Zöllner

Main category: cs.CV

TL;DR: 本文提出了一种名为DenseBEV的新型两阶段锚点生成方法，利用BEV特征网格直接作为检测锚点，结合非极大值抑制和时序建模，在多摄像头3D目标检测中实现了显著性能提升，尤其在小目标（如行人）检测上表现突出，并在nuScenes和Waymo数据集上达到SOTA。

Details

Motivation: 传统基于随机查询的BEV检测方法效率低且不够直观，而现有辅助网络生成锚点的方法复杂。为更高效、直接地利用BEV特征，需设计一种端到端、无需额外网络的锚点机制。 Method: 将BEV特征图中的每个网格单元直接作为对象查询（锚点），采用BEV-based NMS筛选有效查询以减少计算量并支持梯度回传；基于BEVFormer等编码器提取的时序BEV特征，引入融合先验检测结果的混合时序建模策略，增强检测性能。 Result: 在nuScenes上NDS和mAP均显著超越基线，行人mAP提升3.8%；在Waymo上LET-mAP达60.7%，领先前一最优方法5.4个百分点，尤其在稀疏BEV网格下仍保持优势。 Conclusion: DenseBEV通过密集BEV网格作为锚点，实现了高效、端到端的多摄像头3D检测框架，结合时序信息与梯度可导NMS，在多个基准上取得领先性能，尤其利于小物体检测。 Abstract: In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.

[118] Next-Generation License Plate Detection and Recognition System using YOLOv8

Arslan Amin,Rafia Mumtaz,Muhammad Jawad Bashir,Syed Mohammad Hassan Zaidi

Main category: cs.CV

TL;DR: 本研究探讨了YOLOv8系列模型在车牌识别（LPR）和字符识别任务中的性能，提出了一种基于x轴位置进行字符排序的优化方法，并设计了一个高效准确的检测识别流水线，适用于边缘设备上的智能交通系统应用。

Details

Motivation: 为了提升复杂环境下车牌检测与识别的实时性与准确性，推动智能交通系统的发展。 Method: 采用YOLOv8 Nano进行车牌检测，YOLOv8 Small进行字符识别，并引入基于x轴坐标的字符排序策略，构建端到端的识别流程。 Result: YOLOv8 Nano在LPR任务上达到0.964的精确率和0.918的mAP50；YOLOv8 Small在字符识别任务上达到0.92的精确率和0.91的mAP50。 Conclusion: 所提出的优化流水线在保持计算效率的同时实现了高精度识别，为边缘设备部署提供了可行方案，有助于构建更智能的城市交通基础设施。 Abstract: In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.

[119] Radiology Report Generation with Layer-Wise Anatomical Attention

Emmanuel D. Muñiz-De-León,Jorge A. Rosales-de-Golferichs,Ana S. Muñoz-Rodríguez,Alejandro I. Trejo-Castro,Eduardo de Avila-Armenta,Antonio Martínez-Torteya

Main category: cs.CV

TL;DR: 本文提出了一种紧凑的图像到文本架构，用于从单张胸部X光片生成放射学报告的“发现”部分，结合冻结的DINOv3 ViT编码器和带有层次解剖注意力机制的GPT-2解码器，在不增加可训练参数的情况下提升病理检测与报告连贯性，显著优于现有大规模多模态模型。

Details

Motivation: 现有最先进的放射学报告生成系统依赖大规模多模态训练、临床元数据和多视角影像，资源消耗大且难以普及，本文旨在设计一种轻量、仅需单张正面X光图像即可生成高质量报告的模型，以降低部署门槛。 Method: 采用冻结的DINOv3 Vision Transformer作为图像编码器，GPT-2作为文本解码器，并引入基于肺部和心脏分割掩码的层次化解剖注意力机制，通过分层高斯平滑将解剖结构信息融入注意力过程，引导模型关注临床关键区域，整个框架无需额外可训练参数。 Result: 在MIMIC-CXR数据集上评估显示，CheXpert五个关键病理的Macro-F1提升168%（0.083→0.238），Micro-F1提升146%（0.137→0.337），14项观察指标整体提升86%（0.170→0.316），RadGraph F1提升9.7%，结构一致性显著改善。 Conclusion: 尽管模型规模小且仅依赖单一图像输入，但通过解码器端的解剖引导注意力机制可有效提升空间定位能力和临床相关区域的报告连贯性，证明了轻量化设计在自动放射学报告生成中的可行性与优势。 Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.

[120] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song,Jinzhou Li,Rao Fu,Devin Murphy,Kaichen Zhou,Rishi Shiv,Yaqi Li,Haoyu Xiong,Crystal Elaine Owens,Yilun Du,Yiyue Luo,Xianyi Cheng,Antonio Torralba,Wojciech Matusik,Paul Pu Liang

Main category: cs.CV

TL;DR: 本文提出了OpenTouch，首个在自然环境中第一人称全手触觉数据集，包含5.1小时同步的视频-触觉-姿态数据和2900个带文本标注的剪辑，用于推动多模态感知与具身学习研究。

Details

Motivation: 现有研究缺乏将第一人称视觉与全手触觉对齐的真实场景数据，限制了对视觉感知与物理交互之间关系的理解。 Method: 构建OpenTouch数据集，采集真实环境下的同步视频、触觉和手部姿态数据，并设计检索与分类基准任务以评估触觉信号在感知与动作理解中的作用。 Result: 实验表明触觉信号是抓取理解的有力线索，能增强跨模态对齐，并可从真实视频中可靠检索。 Conclusion: OpenTouch为多模态第一人称感知、具身学习和高接触性机器人操作提供了重要资源和基准。 Abstract: The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

[121] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Amita Kamath,Kai-Wei Chang,Ranjay Krishna,Luke Zettlemoyer,Yushi Hu,Marjan Ghazvininejad

Main category: cs.CV

TL;DR: 本文指出文本到图像（T2I）模型评估中的“基准漂移”问题，即现有基准（如GenEval）随时间推移与人类判断严重脱节，并提出新基准GenEval 2和评估方法Soft-TIFA以提升评估的挑战性和与人类判断的一致性。

Details

Motivation: 现有T2I评估基准因静态设计难以适应模型快速演进而产生基准漂移，导致评估结果失真，亟需更动态、更具挑战性的评估体系。 Method: 分析GenEval与人类判断的偏差，构建包含更丰富视觉原语和更高组合性的新基准GenEval 2，并提出基于视觉原语组合判断的Soft-TIFA评估方法。 Result: 实验证明GenEval已严重漂移（最大绝对误差达17.7%），而GenEval 2对当前模型更具挑战性，且Soft-TIFA比VQAScore等方法更贴近人类判断。 Conclusion: 避免基准漂移需持续审计与更新评估基准，GenEval 2和Soft-TIFA为T2I评估提供了更可靠、更可持续的解决方案。 Abstract: Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

[122] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu,Lei Ke,Xiaohang Zhan,Longxiang Tang,Yuqi Liu,Bohao Peng,Bei Yu,Dong Yu,Jiaya Jia

Main category: cs.CV

TL;DR: RePlan是一种基于区域对齐规划的指令式图像编辑框架，通过结合视觉-语言规划器与扩散编辑器，有效应对复杂指令与场景下的图像编辑挑战。

Details

Motivation: 现有模型在处理复杂指令与杂乱或模糊场景时表现不佳，难以实现精确的多区域编辑。 Method: 提出RePlan框架，包含一个通过逐步推理分解指令并将其定位到目标区域的视觉-语言规划器，以及一个使用无需训练的注意力-区域注入机制进行并行多区域编辑的扩散编辑器；并通过基于GRPO的强化学习优化规划能力。 Result: 在IV-Complex设置下，RePlan在区域精度和整体保真度上显著优于使用更大数据集训练的强基线模型，并推出专注于细粒度定位和知识密集型编辑的IV-Edit基准。 Conclusion: RePlan通过显式区域对齐规划和训练-free的编辑机制，在复杂指令与场景下实现了更精准、可靠的图像编辑。 Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

[123] Pixel Seal: Adversarial-only training for invisible image and video watermarking

Tomáš Souček,Pierre Fernandez,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Tom Sander,Alexandre Mourachko

Main category: cs.CV

TL;DR: 本文提出了Pixel Seal，一种新的图像和视频隐形水印方法，通过对抗性训练、三阶段训练策略和高分辨率自适应技术，在鲁棒性和不可感知性之间实现了更好的平衡，显著优于现有方法。

Details

Motivation: 现有的水印方法在鲁棒性与不可感知性之间的权衡上存在困难，且在高分辨率图像和视频中表现不佳，需要更可靠的训练范式。 Method: 提出了一种仅使用对抗性训练的方法，摒弃了不可靠的像素级损失；采用三阶段训练策略以稳定收敛；通过基于JND的衰减和训练时推理模拟实现高分辨率自适应，并引入时间水印池化扩展到视频。 Result: 在多种图像类型和变换下验证了Pixel Seal的鲁棒性和不可感知性，性能明显优于现有最先进方法，并能高效扩展到视频场景。 Conclusion: Pixel Seal为图像和视频内容溯源提供了一个实用且可扩展的解决方案，推动了隐形水印技术的发展。 Abstract: Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.

[124] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Valay Bundele,Mehran Hosseinzadeh,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: ReMeDI-SAM3 是一种无需训练的记忆增强型 SAM3 扩展方法，通过相关性感知的记忆过滤、分段插值和基于特征的重识别模块，显著提升了手术器械在遮挡后的分割性能。

Details

Motivation: 现有视频对象分割方法（如 SAM3）在手术场景中表现受限，主要由于无差别记忆更新、固定记忆容量以及遮挡后身份恢复能力弱，难以应对手术视频中的频繁遮挡、快速运动和反光等问题。 Method: 提出 ReMeDI-SAM3，包含三个关键组件：(i) 关联性感知的记忆过滤与专用于存储遮挡前帧的遮挡感知记忆；(ii) 分段插值策略以扩展有效记忆容量；(iii) 基于特征的重识别模块结合时间投票机制，实现遮挡后身份的可靠区分。整个方法无需训练。 Result: 在 EndoVis17 和 EndoVis18 数据集上以零样本设置进行评估，相比原始 SAM3 的 mcIoU 分别绝对提升了约 7% 和 16%，甚至优于以往需要训练的方法。 Conclusion: ReMeDI-SAM3 有效缓解了误差累积问题，显著增强了遮挡后的恢复能力，为手术视频中器械分割提供了一种高效、无需训练的解决方案。 Abstract: Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

[125] M-PhyGs: Multi-Material Object Dynamics from Video

Norika Wada,Kohei Yamashita,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 本文提出了Multi-material Physical Gaussians (M-PhyGs)，用于从自然场景视频中估计多材质复杂自然物体（以花卉为代表）的材料组成和物理参数，结合级联3D和2D损失及时间小批量策略，在新提出的Phlowers数据集上验证了方法的有效性。

Details

Motivation: 现有方法在估计物体物理材质参数时通常假设物体为单一材质、具有预学习的动力学或简单拓扑结构，难以适用于现实中常见的多材质、几何复杂的物体，如花朵。因此需要一种能处理复杂材质与几何的真实世界物体的方法。 Method: 提出M-PhyGs方法，从自然场景下的短视频中联合实现物体的材质分割与连续力学参数估计，并考虑重力影响；引入级联3D和2D损失函数，并采用时间小批量策略提升优化效率。 Result: 在新构建的Phlowers数据集上实验表明，M-PhyGs能够准确估计多材质花卉的物理参数，各组件贡献显著，整体性能优越。 Conclusion: M-PhyGs能够有效应对真实环境中多材质复杂物体的物理参数估计问题，为从视觉数据中理解非均质自然物体的物理属性提供了可行方案。 Abstract: Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.

[126] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu

Main category: cs.CV

TL;DR: 本文提出了LinkedOut，一种基于视频大语言模型（VLLM）的新型表示方法，用于实现高效、多视频输入、低延迟的视频推荐，首次在无需手工标注的情况下直接从原始帧中提取世界知识并取得SOTA性能。

Details

Motivation: 现有的VLLM在视频理解中表现出潜力，但其高延迟、不支持多视频输入以及语言输出限制了其在推荐等下游视觉任务中的应用；缺乏既能保留像素级细节又能利用世界知识的表示是主要瓶颈。 Method: 提出LinkedOut，通过VLLM从原始视频帧中提取语义对齐且知识感知的token，结合可提示查询和辅助模态引导，并引入跨层知识融合的MoE机制，选择合适的特征抽象层级以实现个性化和低延迟推荐。 Result: LinkedOut首次实现了基于VLLM、直接处理原始帧且无需手工标签的视频推荐，在标准基准上达到最先进性能，同时支持快速推理和多视频历史输入。 Conclusion: LinkedOut为充分利用VLLM的世界知识先验和视觉推理能力提供了可行路径，验证了层间多样性与逐层融合在提升推荐系统性能和可解释性方面的优势。 Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

[127] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Kaiwen Jiang,Xueting Li,Seonwook Park,Ravi Ramamoorthi,Shalini De Mello,Koki Nagano

Main category: cs.CV

TL;DR: 本文提出了一种结合2D扩散模型表达细节与3D感知前馈方法速度和一致性的肖像动画新方法，实现了高质量、快速且3D一致的面部动画。

Details

Motivation: 现有2D扩散模型在肖像动画中缺乏3D一致性且速度慢，而3D-aware方法虽快但表达细节不足，限制了在数字孪生等实际场景中的应用。 Method: 通过将2D扩散模型的知识蒸馏到前馈编码器中，将野外单张图像快速转换为3D一致、高效且富有表现力的可动画表示；采用解耦设计和轻量级局部融合策略，避免依赖预定义参数模型和计算密集的全局融合机制。 Result: 该方法在107.31 FPS下实现动画与姿态控制，速度显著优于现有方法，同时动画质量与最先进方法相当，兼顾了速度与表达性。 Conclusion: 所提出的方法成功融合了2D扩散模型的高表达性与3D-aware模型的高效性和3D一致性，为实际应用场景下的高质量实时肖像动画提供了有效解决方案。 Abstract: Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

[128] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Kai Qiu,Chong Luo,Zuxuan Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为FlashPortrait的端到端视频扩散Transformer模型，用于生成身份保持的无限长度人像动画，并在推理速度上实现高达6倍的加速。

Details

Motivation: 现有的基于扩散的方法在长视频人像动画中难以保证身份一致性，本文旨在解决这一问题。 Method: 首先使用现成的提取器计算与身份无关的表情特征，然后引入归一化表情块（Normalized Facial Expression Block），通过均值和方差归一化对齐面部特征与扩散隐变量；在推理阶段采用动态滑动窗口策略并结合重叠区域加权融合，并利用高阶隐变量导数跳过多步去噪过程以加速推理。 Result: 实验结果表明，FlashPortrait在多个基准数据集上均能有效提升身份一致性和视觉质量，同时实现最高6倍的推理加速。 Conclusion: FlashPortrait在保持身份一致性和高效长视频生成方面优于现有方法，为扩散模型在长序列人像动画中的应用提供了新思路。 Abstract: Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

[129] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding,Yang Zhou,Xi Chen,Miao Yang,Jiarong Ou,Rui Chen,Xin Tao,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了Alchemist，一种基于元梯度的大规模文本-图像数据自动选择框架，通过数据评分与剪枝提升生成模型训练效率和视觉质量。

Details

Motivation: 现有的文本到图像生成模型受限于训练数据质量，而现有数据筛选方法依赖人工或单一特征启发式评分，缺乏自动化、高效的数据选择机制。 Method: 提出Alchemist框架，包含两个阶段：数据评分与数据剪枝；利用轻量级评分器基于多粒度感知的梯度信息评估样本影响，并采用Shift-Gsampling策略选择高价值子集。 Result: 在合成与网络爬取数据集上实验表明，使用Alchemist选出的50%数据训练即可超越全量数据训练的效果，显著提升视觉质量和下游任务性能。 Conclusion: Alchemist是首个用于文本到图像模型训练的自动、可扩展的元梯度数据选择框架，有效提升了数据利用效率和模型性能。 Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

[130] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong,Haotian Yang,Angtian Wang,Yizhi Wang,Yiding Yang,Canyu Zhang,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出VIVA，一种基于视觉语言模型引导编码和奖励优化的可扩展视频编辑框架，能够更好地遵循复杂自然语言指令并保持视频内容保真度与时间连贯性。

Details

Motivation: 现有基于扩散模型的视频编辑方法依赖于简单操作的配对数据，难以泛化到复杂多样的真实世界指令，存在泛化能力不足的问题。 Method: 提出VIVA框架，包括：1）基于视觉语言模型（VLM）的指导模块，将文本指令、源视频首帧和参考图像编码为视觉接地的指令表示；2）采用Edit-GRPO进行后训练，利用相对奖励优化模型以提升指令忠实性、内容保持性和视觉质量；3）构建合成高多样性配对数据的数据生成 pipeline。 Result: 实验表明，VIVA在指令遵循、泛化能力和编辑质量方面优于现有最先进方法。 Conclusion: VIVA通过VLM引导的表示学习与奖励优化，显著提升了指令驱动视频编辑的泛化性与实用性，为复杂场景下的视频编辑提供了有效解决方案。 Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

[131] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen,Yifan Wang,Zhengqin Li,Homanga Bharadhwaj,Yujin Chen,Chuan Qin,Ziyi Kou,Yuan Tian,Eric Whitmire,Rajinder Sodhi,Hrvoje Benko,Eli Shlizerman,Yue Liu

Main category: cs.CV

TL;DR: 提出EgoMAN数据集和模型，用于交互阶段感知的3D手势轨迹预测，通过视觉-语言推理与运动生成的联合学习实现准确且具泛化性的轨迹预测。

Details

Motivation: 现有3D手势轨迹预测工作受限于缺乏语义监督的数据集和推理与动作关联较弱的模型。 Method: 构建大规模自我中心EgoMAN数据集，包含219K个6DoF手部轨迹和3M个结构化问答对，并设计基于轨迹-标记接口的推理到运动框架（EgoMAN模型），通过渐进式训练对齐推理与运动动态。 Result: 所提方法能生成准确、交互阶段感知的3D手部轨迹，并在真实场景中展现出良好的泛化能力。 Conclusion: EgoMAN实现了视觉-语言推理与手部运动预测的紧密耦合，推动了具身交互中语义与动作的联合建模。 Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[132] SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu,Chih-hao Lin,Henry Che,Aditi Tiwari,Chuhang Zou,Shenlong Wang,Derek Hoiem

Main category: cs.CV

TL;DR: 本文提出了SceneDiff基准和一种无需训练的多视角物体变化检测方法，利用预训练的3D、分割和图像编码模型，在不同视角下对齐场景并比较空间与语义特征以检测物体增减或移动，显著优于现有方法。

Details

Motivation: 检测同一场景在不同时间拍摄的图像或视频之间的物体变化（如添加、移除或移动）具有重要应用价值，但视角变化会导致误检，因此需要更鲁棒的多视角变化检测方法。 Method: 提出SceneDiff方法，通过预训练的3D、分割和图像编码模型，将两次捕捉在3D空间中对齐，提取对象区域，并比较其空间和语义特征来检测变化。该方法无需训练。 Result: 在多视角和双视角基准上实验表明，该方法大幅超越现有方法，相对AP提升分别为94%和37.4%。 Conclusion: SceneDiff方法在多视角物体变化检测中表现出色，且具备良好的泛化能力，为实际应用提供了高效、可靠的解决方案。 Abstract: We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

[133] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju,Yongyuan Liang,Yen-Jen Wang,Nandiraju Gireesh,Yuanliang Ju,Seungjae Lee,Qiao Gu,Elvis Hsieh,Furong Huang,Koushil Sreenath

Main category: cs.CV

TL;DR: 本文提出了MomaGraph，一种用于家庭环境中移动操作机器人的统一场景表示方法，结合了空间-功能关系和部件级交互元素，并发布了首个大规模、任务驱动的场景图数据集MomaGraph-Scenes及评估套件MomaGraph-Bench。基于此，开发了MomaGraph-R1（7B规模的视觉语言模型），在强化学习训练下实现零样本任务规划，在基准测试中达到71.6%的准确率，超越最佳基线11.4%，并在真实机器人任务中表现出良好泛化与迁移能力。

Details

Motivation: 现有的场景图表示方法通常将空间与功能关系分离，忽略物体状态和时间更新，且缺乏针对当前任务的关键信息，难以满足家庭环境中移动操作机器人对紧凑、语义丰富场景理解的需求。 Method: 提出MomaGraph，一种融合空间-功能关系和部件级可交互元素的统一场景表示；构建MomaGraph-Scenes数据集和MomaGraph-Bench评估套件；在此基础上训练基于强化学习的7B规模视觉语言模型MomaGraph-R1，采用Graph-then-Plan框架进行零样本任务规划。 Result: MomaGraph-R1在MomaGraph-Bench上达到71.6%的准确率，比最优基线提升11.4%；在多个公共基准上展现良好泛化能力，并成功迁移到真实机器人实验中。 Conclusion: MomaGraph为家庭环境中的具身智能体提供了更丰富、动态且任务导向的场景理解方案，配套的数据集、评估体系与模型推动了该领域的发展，验证了其在复杂推理与实际应用中的有效性。 Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

[134] SFTok: Bridging the Performance Gap in Discrete Tokenizers

Qihang Rao,Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: SFTok是一种离散图像tokenizer，通过引入自强制引导重建和去偏-拟合训练策略，在高压缩率下实现了最先进的图像重建质量，并在类到图像生成任务中表现出色。

Details

Motivation: 现有的离散tokenizers在多模态系统中仍落后于连续方法，存在训练与推理不一致的问题，限制了其在高分辨率图像生成中的应用。 Method: 提出SFTok，采用多步迭代机制，结合自强制引导视觉重建和去偏-拟合训练策略，解决训练与推理的不一致性问题。 Result: 在仅64个token每张图像的高压缩率下，SFTok在ImageNet上达到1.21的rFID和2.29的gFID，显著优于现有离散方法。 Conclusion: SFTok有效提升了离散tokenizer的性能，弥合了离散与连续方法之间的差距，为多模态生成模型提供了高效且高质量的图像表示方案。 Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

[135] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin,Meixi Song,Dizhe Zhang,Wenxuan Lu,Haodong Li,Bo Du,Ming-Hsuan Yang,Truong Nguyen,Lu Qi

Main category: cs.CV

TL;DR: 本文提出了一种全景度量深度基础模型，通过数据在环范式和三阶段伪标签流程，在多场景距离下实现了良好的泛化能力。

Details

Motivation: 为了提升全景图像在不同场景距离下的度量深度估计性能，减少室内外及合成与真实数据之间的域差异。 Method: 结合公开数据集、UE5模拟器生成的高质量合成数据、文本到图像模型和网络真实全景图像构建大规模数据集；采用DINOv3-Large作为主干网络，并引入即插即用的范围掩码头、以清晰度为中心和以几何为中心的优化策略。 Result: 在多个基准（如Stanford2D3D、Matterport3D和Deep360）上验证了模型具有强性能和零样本泛化能力，尤其在多样化真实场景中表现出稳健且稳定的度量预测。 Conclusion: 所提出的方法在全景度量深度估计任务中实现了跨场景的优异泛化性能，具备实际应用潜力。 Abstract: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}

[136] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen,Yihua Du,Wenhang Ge,Jing He,Chirui Chang,Donghao Zhou,Zhen Yang,Luozhou Wang,Xin Tao,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了UniStereo数据集和StereoPilot模型，用于解决单目到立体视频转换中的多阶段流程缺陷，实现高效、高质量的立体视图生成。

Details

Motivation: 现有的“深度-扭曲-修复”（DWI）多阶段流程存在误差传播、深度模糊和格式不一致问题，限制了高质量立体视频的自动生成。 Method: 构建大规模统一立体视频数据集UniStereo，并提出端到端前馈模型StereoPilot，无需显式深度图或扩散采样，通过可学习域切换器和循环一致性损失适应不同立体格式。 Result: 实验表明，StereoPilot在视觉保真度和计算效率上均显著优于现有最先进方法。 Conclusion: StereoPilot结合UniStereo数据集为单目到立体视频转换提供了更高效、鲁棒的解决方案，推动了立体显示内容的生成发展。 Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

[137] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang,Kaituo Feng,Dongyang Chen,Zhongyu Wang,Zhixun Li,Sicheng Gao,Meng Meng,Xu Zhou,Manyuan Zhang,Yuzhang Shang,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了AdaTooler-V，一种能够自适应调用视觉工具的多模态大语言模型，通过强化学习和新构建的数据集显著提升了视觉推理效率与性能。

Details

Motivation: 现有开源多模态大模型在无需视觉工具时仍盲目调用，导致推理开销增加和性能下降，因此需要一种能判断何时真正需要工具的自适应机制。 Method: 提出AT-GRPO强化学习算法，基于每样本的工具增益分数动态调整奖励尺度，并构建了两个数据集AdaTooler-V-CoT-100k和AdaTooler-V-300k用于训练和强化学习。 Result: 在十二个基准测试中表现优异，AdaTooler-V-7B在高分辨率基准V*上达到89.8%准确率，超越GPT-4o和Gemini 1.5 Pro。 Conclusion: AdaTooler-V实现了高效的自适应工具使用，在多种视觉推理任务中表现出色，且所有代码、模型和数据均已开源。 Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

[138] DVGT: Driving Visual Geometry Transformer

Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Shengyin Jiang,Long Chen,Zhi-Xin Yang,Jiwen Lu

Main category: cs.CV

TL;DR: 提出Driving Visual Geometry Transformer (DVGT)，一种无需精确相机参数即可从多视角无姿态图像序列中重建全局密集3D点云地图的模型，适用于自动驾驶中的多样化场景和相机配置。

Details

Motivation: 现有方法依赖精确相机参数和外部传感器对齐，难以适应不同驾驶场景和相机配置，缺乏针对驾驶任务的通用密集几何感知模型。 Method: 采用DINO骨干网络提取图像特征，通过交替的 intra-view 局部注意力、cross-view 空间注意力和 cross-frame 时间注意力建模图像间的几何关系，并使用多头解码器预测第一帧自我坐标系下的全局点云图和各帧的自我位姿。 Result: 在nuScenes、OpenScene、Waymo、KITTI和DDAD等多个大规模驾驶数据集上训练和测试，DVGT显著优于现有方法，能够灵活处理任意相机配置并直接输出度量尺度的几何结构。 Conclusion: DVGT是一种无需显式3D几何先验、不依赖外部传感器对齐的通用驾驶几何感知模型，能够在多种场景下实现高精度的全局密集3D重建。 Abstract: Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

[139] EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai,Chaoyang Wang,Guocheng Gordon Qian,Willi Menapace,Sergey Tulyakov,Bernard Ghanem,Peter Wonka,Ashkan Mirzaei

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的指令式视频编辑框架EasyV2V，在数据、模型和控制方面进行了创新设计，实现了最先进的视频编辑效果。

Details

Motivation: 视频编辑在一致性、控制性和泛化性方面仍面临挑战，现有方法不够灵活和高效。 Method: 构建多样化的视频数据对，利用预训练的文本到视频模型并采用轻量级LoRA微调，通过序列拼接进行条件控制，并引入统一的时空掩码机制实现精细化编辑控制。 Result: EasyV2V支持多种输入形式（如视频+文本、视频+掩码+文本等），在多个基准上超越了当前最先进的系统和商业工具。 Conclusion: 简单的架构结合高质量的数据构造策略足以实现强大的视频编辑能力，EasyV2V为指令驱动的视频编辑提供了高效且通用的解决方案。 Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

[140] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu,Chengzhi Mao,Yaojie Liu,Alan Yuille,Wen-Sheng Chu

Main category: cs.CV

TL;DR: 提出AuditDM框架，通过强化学习训练多模态大模型作为审计员，主动发现并修正模型分歧中的失败模式，揭示多种缺陷并实现模型性能提升。

Details

Motivation: 现有评估方法缺乏可解释性，难以充分暴露多模态大模型的能力差距。 Method: 利用强化学习微调一个多模态大模型作为审计员，生成最大化目标模型分歧的挑战性问题和反事实图像，以发现失败模式，并用这些发现进行无标注数据的模型修正。 Result: 在Gemma-3和PaliGemma-2等SOTA模型上发现了超过20种不同的失败类型，基于发现数据微调后在16个基准上持续提升性能，使3B模型超越28B模型。 Conclusion: 随着数据扩展效益递减，定向模型审计为模型诊断与改进提供了有效路径。 Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

[141] Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu,Ziqiao Ma,Wenhao Chai,Xuweiyi Chen,Weiyang Jin,Joyce Chai,Saining Xie,Stella X. Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为Next-Embedding Predictive Autoregression (NEPA)的视觉自监督学习方法，通过预测图像块嵌入而非像素或离散标记来简化模型设计，仅使用因果掩码和梯度截断，在ImageNet-1k上预训练ViT模型即实现了优异性能。

Details

Motivation: 受自然语言生成预训练成功的启发，探索是否可以将类似原则应用于视觉领域，实现无需复杂设计（如像素重建、对比损失等）的高效自监督学习。 Method: 采用Transformer架构，以过去图像块的嵌入为条件，预测未来的嵌入，引入因果掩码和停止梯度机制，直接在嵌入空间进行自回归预测。 Result: 在ImageNet-1k上微调后，ViT-B和ViT-L分别达到83.8%和85.3%的top-1准确率，并在ADE20K语义分割任务上表现出良好的迁移能力。 Conclusion: 基于嵌入生成的预训练提供了一种简单、可扩展且可能跨模态通用的视觉自监督学习新范式。 Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

[142] Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu,Jia-Bin Huang,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出了一种名为生成式重对焦（Generative Refocusing）的两步方法，通过DeblurNet和BokehNet实现从单张图像中恢复全焦图像并生成可控景深效果，采用半监督训练结合合成数据与真实未配对散景图像，显著提升了去模糊、散景合成和重对焦性能，并支持文本引导调整和自定义光圈形状。

Details

Motivation: 单图像重对焦面临恢复清晰内容和生成逼真散景的挑战，现有方法依赖全焦输入、合成数据且控制能力有限，难以反映真实光学特性。 Method: 提出两步法：首先用DeblurNet从多样输入恢复全焦图像，再用BokehNet生成可控散景；采用半监督训练，结合合成配对数据与利用EXIF元数据筛选的真实未配对散景图像，以捕捉真实光学特征。 Result: 在散焦去模糊、散景合成和重对焦基准上达到最先进性能，支持文本引导编辑和自定义光圈形状，生成结果更真实且控制灵活。 Conclusion: Generative Refocusing通过创新的半监督训练策略和两阶段架构，有效解决了单图像重对焦中的关键难题，在真实性和可控性方面均优于现有方法。 Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

[143] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang,Hao Ouyang,Qiuyu Wang,Yue Yu,Yihao Meng,Wen Wang,Ka Leong Cheng,Shuailei Ma,Qingyan Bai,Yixuan Li,Cheng Chen,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: WorldCanvas 是一个结合文本、轨迹和参考图像的多模态框架，用于生成可控且连贯的世界事件视频，支持多智能体交互、对象出入场和身份一致性，推动世界模型从被动预测向用户可操控模拟器发展。

Details

Motivation: 现有方法在生成包含复杂动态和语义意图的视频时存在局限，难以同时实现运动控制、语义表达与视觉保真；需要一种能统一时间、空间和语义控制的多模态生成框架。 Method: 提出 WorldCanvas 框架，将轨迹（编码运动、时序和可见性）与自然语言描述及参考图像结合，通过多模态条件生成实现对复杂世界事件的精确控制和视觉一致性的保持。 Result: 生成的视频在时间上连贯，并展现出对象身份和场景的涌现一致性，即使对象暂时消失也能保持外观一致；支持多代理互动、对象进出、反直觉事件等复杂场景。 Conclusion: WorldCanvas 实现了更丰富、用户可引导的仿真，使世界模型从被动预测工具转变为可交互、可编程的模拟系统，为未来交互式内容生成提供了新方向。 Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Table of Contents

cs.CL [Back]

[1] TabReX : Tabular Referenceless eXplainable Evaluation

[2] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

[4] Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

[6] Convolutional Lie Operator for Sentence Classification

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

[9] A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media

[10] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

[11] An Information-Theoretic Framework for Robust Large Language Model Editing

[12] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

[13] Sigma-Moe-Tiny Technical Report

[14] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

[15] Hacking Neural Evaluation Metrics with Single Hub Text

[16] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[17] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

[18] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

[19] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

[20] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

[21] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

[22] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

[23] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

[24] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

[25] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

[26] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

[27] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

[28] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

[29] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

[30] In-Context Algebra

[31] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

cs.CV [Back]

[32] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

[33] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

[34] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

[35] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

[36] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

[37] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

[38] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

[39] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

[40] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

[41] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

[42] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

[43] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

[44] Auto-Vocabulary 3D Object Detection

[45] LAPX: Lightweight Hourglass Network with Global Context

[46] Collimator-assisted high-precision calibration method for event cameras

[47] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

[48] Flexible Camera Calibration using a Collimator System

[49] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

[50] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

[51] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

[52] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

[53] Towards Closing the Domain Gap with Event Cameras

[54] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

[55] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

[56] Open Ad-hoc Categorization with Contextualized Feature Learning

[57] Enhanced 3D Shape Analysis via Information Geometry

[58] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

[59] Image Compression Using Singular Value Decomposition

[60] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

[61] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

[62] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

[63] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

[64] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

[65] GFLAN: Generative Functional Layouts

[66] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

[67] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

[68] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

[69] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

[70] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

[71] Collaborative Edge-to-Server Inference for Vision-Language Models

[72] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction

[73] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

[74] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

[75] Adaptive Frequency Domain Alignment Network for Medical image segmentation

[76] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

[77] BrepLLM: Native Boundary Representation Understanding with Large Language Models