Skip to content

Table of Contents

cs.CL [Back]

[1] On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Wenlong Deng,Yushu Li,Boying Gong,Yi Ren,Christos Thrampoulidis,Xiaoxiao Li

Main category: cs.CL

TL;DR: 本文研究了工具集成强化学习(TI-RL)中基于GRPO算法的训练崩溃问题,提出了一种称为Lazy Likelihood Displacement (LLD) 的新机制来解释该现象,并设计了一种轻量级的正则化方法LLDS以缓解LLD,从而稳定训练并显著提升性能。

Details Motivation: 在工具集成强化学习中,尽管GRPO类方法(如Search-R1)具有快速收敛和无需价值网络的优势,但常出现训练崩溃问题。现有研究未能揭示其根本原因,因此需要深入分析并提出有效解决方案。 Method: 作者识别出Lazy Likelihood Displacement(LLD)作为导致训练崩溃的核心机制,并提出了LLD Death Spiral三阶段模型描述其演化过程;在此基础上设计了一种细粒度、仅在似然下降时激活的正则化方法LLDS,用于保留生成序列中的关键token似然。 Result: 在七个开放域和多跳问答基准上验证了LLDS的有效性,成功防止梯度爆炸、稳定训练过程,在Qwen2.5-3B和Qwen2.5-7B模型上分别取得+37.8%和+32.0%的性能提升。 Conclusion: LLD是GRPO-based TIRL中的一个根本性瓶颈,而LLDS提供了一种低干扰、高效的正则化路径,有助于实现工具集成大模型的稳定与可扩展训练。 Abstract: Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

[2] Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

Mansour Essgaer,Khamis Massud,Rabia Al Mamlook,Najah Ghmaid

Main category: cs.CL

TL;DR: 本研究比较了多种机器学习模型在利比亚方言推文分类中的表现,使用QADI语料库,发现多项式朴素贝叶斯(MNB)结合词和字符n-gram特征效果最佳,准确率达85.89%。

Details Motivation: 阿拉伯方言在拼写和正字法上高度不规范,给自然语言处理带来挑战,尤其是低资源方言如利比亚方言缺乏系统的分类研究。 Method: 采用逻辑回归、线性SVM、多项式朴素贝叶斯和伯努利朴素贝叶斯模型,结合词与字符n-gram特征,在QADI语料库上进行实验;通过卡方检验筛选显著特征,并评估不同表示方法下的分类性能。 Result: MNB模型在(1,2)词n-gram和(1,5)字符n-gram下取得最高准确率85.89%和F1分数0.85741,优于逻辑回归(84.41%)和线性SVM(84.73%);卡方分析显示邮箱提及和情感符号等特征不显著,被排除。 Conclusion: 合适的n-gram表示与分类模型选择对提升阿拉伯方言识别精度至关重要,MNB在利比亚方言分类中表现最优,为阿拉伯语NLP研究提供了实证基准。 Abstract: This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.

[3] SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh

Main category: cs.CL

TL;DR: SQuARE是一个混合检索框架,通过复杂性感知的路由机制,在具有多行表头、合并单元格和单位注释的复杂电子表格上实现精确问答。

Details Motivation: 现有方法如基于SQL的视图在缺乏一致模式的文件上表现不佳,而简单的分块策略会因多行表头、合并单元格等结构破坏准确性,导致难以准确回答真实电子表格中的问题。 Method: 提出SQuARE框架,计算表头深度和合并密度的连续得分,据此将查询路由到结构保持的分块检索或自动生成的关系表示上的SQL查询;引入轻量级代理监督检索结果的细化或融合。 Result: 在多表头企业资产负债表、高度合并的世界银行工作簿及多种公开数据集上,SQuARE在检索精度和端到端答案准确率上均优于单一策略基线和ChatGPT-4o,且延迟可预测。 Conclusion: 通过解耦检索与模型选择,SQuARE兼容新兴的表格基础模型,为更鲁棒的表格理解提供了实用桥梁。 Abstract: Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

[4] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei,Jinxiang Meng,Yiming Huang,Junjie Zhao,Yitong Zhang,Jianwen Luo,Xin Zou,Ruiyi Yang,Wenbo Shi,Yan Gao,Shizhu He,Zuo Wang,Qian Liu,Yang Wang,Ke Wang,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: DAComp是一个包含210个任务的基准,用于评估企业级数据智能工作流中的数据工程和数据分析能力,揭示当前最先进模型在复杂管道协调和开放性推理方面的严重不足。

Details Motivation: 现有的基准未能充分反映企业中真实的数据智能工作流,缺乏对数据工程与数据分析双重能力的综合评估。 Method: 提出DAComp基准,涵盖需要构建多阶段SQL管道的数据工程任务和需要战略规划与解释的开放性数据分析任务;采用基于执行的多指标评分和经过验证的LLM裁判结合分层评分标准评估表现。 Result: 最先进的代理在DE任务上的成功率低于20%,在DA任务上的平均得分低于40%,表明其在整体管道编排和开放性推理方面存在显著缺陷。 Conclusion: DAComp提供了一个严格且贴近现实的测试平台,能够明确诊断当前自主数据代理的瓶颈,推动面向企业场景的更强大智能系统的发展。 Abstract: Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

[5] ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Yiming Xu,Yuan Yuan,Vijay Viswanathan,Graham Neubig

Main category: cs.CL

TL;DR: 提出了一种名为ClusterFusion的混合框架,将大语言模型(LLM)作为聚类核心,结合轻量级嵌入方法,在标准和特定领域任务中均实现先进性能。

Details Motivation: 传统聚类算法在特定领域中表现不佳,且依赖昂贵的微调;现有LLM应用多局限于辅助角色,未能充分发挥其上下文理解能力。 Method: 提出三阶段框架:嵌入引导的子集划分、LLM驱动的主题总结、基于LLM的主题分配,以嵌入方法初筛,用LLM进行语义归纳与分配。 Result: 在三个公开基准和两个新构建的领域特定数据集上,ClusterFusion均取得领先性能,尤其在专业领域显著优于现有方法。 Conclusion: ClusterFusion有效融合LLM的上下文适应性与嵌入方法的效率,为文本聚类提供了更灵活、可定制的新范式,并推动领域特定聚类的发展。 Abstract: Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.

[6] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

Muyu Pan,Matthew Walter,Dheeraj Kodakandla,Mahfuza Farooque

Main category: cs.CL

TL;DR: 提出LangSAT框架,将自然语言描述自动转换为CNF并利用强化学习优化CDCL过程,提升SAT求解的可访问性与效率。

Details Motivation: 现有SAT求解器需要输入合取范式(CNF),限制了非专业用户的使用;希望实现从自然语言到逻辑求解的端到端自动化,提升可访问性和求解效率。 Method: 框架包含Lang2Logic和SmartSAT两部分:Lang2Logic将英文描述转化为CNF表达式;SmartSAT基于强化学习优化CDCL中的启发式选择,通过图结构编码子句-变量关系并提取全局特征。 Result: Lang2Logic能处理长达450词的自然语言输入,生成的CNF可被有效求解;SmartSAT在求解时间上表现与传统CDCL启发式方法相当。 Conclusion: LangSAT实现了从自然语言到SAT求解的端到端自动化,提升了SAT技术在推理、形式验证和调试等场景中的可访问性与可扩展性。 Abstract: Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfiability (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.

[7] MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

Zhou Yang,Shunyan Luo,Jiazhen Zhu,Fang Jin

Main category: cs.CL

TL;DR: 提出了一种模型无关的显著性估计框架MASE,通过在嵌入层上应用归一化线性高斯扰动来解释文本模型的预测,相比现有方法在Delta Accuracy上表现更优。

Details Motivation: 深度神经网络在NLP中表现优秀但缺乏可解释性,传统事后解释方法难以适用于离散文本数据,需要一种无需访问模型内部结构的通用解释方法。 Method: 提出MASE框架,利用嵌入层上的归一化线性高斯扰动(NLGP)估计输入显著性,实现对文本模型的局部解释,且不依赖模型内部信息。 Result: 实验结果表明MASE在Delta Accuracy等指标上优于其他模型无关的解释方法,能更准确地识别关键文本特征。 Conclusion: MASE是一种有效的、模型无关的文本解释方法,通过在嵌入空间中进行扰动分析,提升了对NLP模型决策过程的理解能力。 Abstract: Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.

[8] Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

Subrata Karmaker

Main category: cs.CL

TL;DR: 本文研究了在不使用神经网络和上下文信息的情况下,基于经典机器学习方法和显式特征工程进行讽刺检测的效果,使用SARC 2.0数据集的子集,结合词级和字符级TF-IDF特征与风格指标,Naive Bayes和逻辑回归表现最佳,F1分数约为0.57。

Details Motivation: 讽刺识别对机器而言具有挑战性,因其语义常与字面意思相反,且现有方法多依赖复杂模型或上下文,缺乏可解释性和轻量性。 Method: 采用经典机器学习方法(逻辑回归、线性SVM、朴素贝叶斯、随机森林),结合词级与字符级TF-IDF特征及简单风格特征,在无父评论上下文条件下进行讽刺检测。 Result: 朴素贝叶斯和逻辑回归表现最好,针对讽刺评论的F1分数约为0.57;由于缺少对话上下文,整体性能受限。 Conclusion: 尽管性能受限,但该方法提供了轻量、可解释且可复现的讽刺检测基线,适用于资源有限或需模型透明性的场景。 Abstract: Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.

[9] RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

Guoshenghui Zhao,Huawei Lin,Weijie Zhao

Main category: cs.CL

TL;DR: 提出RapidUn,一种基于影响驱动的高效参数微调框架,用于大语言模型的快速遗忘,显著优于现有方法。

Details Motivation: 现有的遗忘方法在处理小或不平衡的遗忘集时效率低且不稳定,重新训练成本高。 Method: 通过快速估计模块计算每个样本的影响,并将其映射为自适应更新权重,指导选择性参数更新,实现有害行为的遗忘同时保留通用知识。 Result: 在Mistral-7B和Llama-3-8B上,RapidUn比完全重训练效率高达100倍,在Dolly-15k和Alpaca-57k数据集上优于Fisher、GA和LoReUn,适用于分布内和分布外遗忘任务。 Conclusion: 基于影响引导的参数重加权是一种可扩展且可解释的大语言模型遗忘范式。 Abstract: Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.

[10] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

Yuanshuo Zhang,Aohua Li,Bo Chen,Jingbo Sun,Xiaobing Zhao

Main category: cs.CL

TL;DR: 提出了一种多阶段、多专家框架MSME,用于解决零样本立场检测中的复杂现实场景问题,通过知识准备、专家推理和决策聚合三个阶段实现最先进的性能。

Details Motivation: 现有的基于大语言模型的方法在复杂的现实场景中仍存在困难,例如需要动态背景知识、复合实体或事件的立场定义以及讽刺等修辞手法掩盖真实意图等问题。 Method: 设计了一个包含三个阶段的框架:知识准备阶段获取相关背景知识并明确立场标签;专家推理阶段包括知识专家、标签专家和语用专家分别从不同角度进行分析;决策聚合阶段由元裁判整合所有专家分析结果。 Result: 在三个公开数据集上的实验表明,MSME在零样本立场检测任务中达到了最先进的性能。 Conclusion: MSME通过多阶段和多专家协同机制有效提升了复杂场景下的零样本立场检测效果。 Abstract: LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.

[11] UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

Tianmai M. Zhang,Zhaoyi Sun,Sihang Zeng,Chenxi Li,Neil F. Abernethy,Barbara D. Lam,Fei Xia,Meliha Yetisgen

Main category: cs.CL

TL;DR: 本文研究了从癌症患者的电子健康记录中构建全身抗癌治疗时间线的方法,重点是利用大语言模型和不同策略(如监督微调、直接偏好优化等)从临床笔记中提取化疗事件,并生成患者层面的时间线。

Details Motivation: 为了提高从非结构化临床文本中自动构建癌症患者化疗时间线的准确性与效率,推动临床决策支持和回顾性研究的发展。 Method: 采用两步流程:首先使用大语言模型(LLM)从单个临床笔记中提取化疗事件,然后通过算法对事件进行标准化并聚合为患者级时间线;比较了链式思维、监督微调、直接偏好优化和基于词典查找等方法。 Result: 多种方法在测试集上表现良好,其中经过微调的Qwen3-14B模型取得了最高的官方得分0.678。 Conclusion: 监督微调的大语言模型在化疗时间线提取任务中表现优异,所提出的方法可为未来类似任务提供参考。 Abstract: The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

[12] EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

Pengfei Cao,Zeao Ji,Daojian Zeng,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 本文提出了终身自由文本知识编辑(LF-Edit)这一新任务,旨在通过自然语言形式持续更新大语言模型的知识,并构建了大规模基准MRLF-Bench和多层级评估框架。为此,作者提出EvoEdit方法,结合潜在扰动增强和知识驱动的参数融合,在知识注入与旧知识保留之间取得更好平衡,显著优于现有方法。

Details Motivation: 现有知识编辑方法依赖结构化三元组,与大语言模型预训练时的自由文本输入不一致,且难以捕捉事实间的复杂关系;同时,大多数方法仅支持一次性知识更新,缺乏对连续、长期编辑场景的支持。因此需要一种能处理自然语言形式、支持终身学习的知识编辑新范式。 Method: 提出LF-Edit任务和MRLF-Bench基准,包含16,835条自由文本编辑请求,并设计四层次认知启发式评估框架(记忆、理解、受限理解、推理)。为解决该任务,提出EvoEdit方法:通过潜在扰动增强提升新知识注入效果,利用知识驱动的参数融合机制减少旧知识遗忘。 Result: 实验结果表明,EvoEdit在MRLF-Bench上显著优于现有的知识编辑方法,尤其在长期连续编辑场景下表现出更强的知识保持能力和综合性能。多层级评估显示其在记忆、理解与推理层面均取得更优表现。 Conclusion: LF-Edit为大语言模型的持续知识更新提供了更贴近实际应用的新研究方向,EvoEdit通过机制创新有效应对了新知识注入与旧知识遗忘之间的权衡问题,推动了知识编辑技术向更实用、动态的方向发展。 Abstract: Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model's internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.

[13] AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

Yangning Li,Shaoshen Chen,Yinghui Li,Yankai Chen,Hai-Tao Zheng,Hui Wang,Wenhao Jiang,Philip S. Yu

Main category: cs.CL

TL;DR: AdmTree提出了一种自适应的分层上下文压缩框架,通过基于信息密度动态分割输入并利用gist token构建语义二叉树,有效缓解了自注意力机制在处理长文本时的二次复杂度问题,同时保持了高语义保真度。

Details Motivation: 现有的上下文压缩方法在处理长文本时存在局部细节丢失、位置偏差或无法捕捉长距离语义依赖等问题,限制了大语言模型处理长上下文的能力。 Method: AdmTree动态地根据信息密度分割输入,使用gist token对可变长度段进行摘要,并以冻结的大语言模型为基础,结合轻量级聚合机制构建语义二叉树,实现高效的分层上下文抽象。 Result: AdmTree在保持细粒度细节和全局语义一致性的同时,减轻了位置偏差,能够动态适应不同内容,从而更有效地保留长上下文中的语义信息。 Conclusion: AdmTree是一种高效且语义保真度高的上下文压缩方法,为大语言模型处理长上下文提供了一个可扩展且低参数开销的解决方案。 Abstract: The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.

[14] ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

Pritam Kadasi,Abhishek Upperwal,Mayank SIngh

Main category: cs.CL

TL;DR: ADAPT是一种元学习算法,通过在显式令牌预算下学习多任务指令调优中的任务采样比例,动态调整任务分布以优化下游性能,相比静态混合策略在更少训练令牌下表现相当或更优。

Details Motivation: 传统多任务学习中任务权重通常手动设定,缺乏适应性,容易导致资源浪费或任务崩溃;希望实现根据任务实际效用自动分配训练资源。 Method: 提出ADAPT算法,维护一个连续的任务分布,并通过平滑最坏情况验证目标的元梯度更新该分布,从而在给定令牌预算下实现自适应课程学习。 Result: 在三个约1B参数的开源大模型上实验表明,ADAPT在仅使用1%-10%监督令牌的情况下,在11个跨领域基准上表现优于或媲美均匀和按规模比例混合的强基线方法,且更高效地将预算分配给困难且与基准对齐的任务。 Conclusion: ADAPT能有效学习任务采样策略,在有限训练预算下实现更优或多样的性能,为多任务指令调优提供了一种数据高效、自适应的解决方案。 Abstract: We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1\%$, $5\%$, and $10\%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.

Wenjin Liu,Haoran Luo,Xin Feng,Xiang Ji,Lijuan Zhou,Rui Mao,Jiapu Wang,Shirui Pan,Erik Cambria

Main category: cs.CL

TL;DR: 本文提出了一个名为LexGenius的专家级中文法律基准,用于系统评估大语言模型(LLM)在法律通用智能(Legal GI)方面的能力。该基准基于维度-任务-能力框架,涵盖七个维度、十一个任务和二十种能力,并通过真实案例和考试题目构建多选题,结合人工与LLM审核以降低数据泄露风险。研究对12个主流LLM进行了评测,发现现有模型在各项法律智能能力上存在显著差异,且表现最好的模型仍远逊于人类法律专业人士。作者认为LexGenius有助于推动法律GI的发展。

Details Motivation: 现有的法律AI评估基准过于结果导向,缺乏对大语言模型法律理解、推理与决策能力的系统性评估,限制了法律通用智能的发展。因此需要一个更全面、专业的评估基准来衡量和促进LLMs在法律领域的智能水平。 Method: 提出LexGenius基准,采用维度-任务-能力框架,涵盖七个维度、十一个任务和二十种能力;使用真实法律案例和司法考试题目构建多选题,并结合人工标注与大语言模型审查进行数据质量控制,经过多轮校验确保准确性和可靠性;在12个最先进的大语言模型上进行实验评估。 Result: 通过对12个主流大语言模型的评测发现:不同模型在法律智能各项能力上表现差异显著;即使是性能最佳的模型,在多数任务上仍明显落后于人类法律专业人员;验证了LexGenius在评估法律通用智能方面的有效性与挑战性。 Conclusion: LexGenius是一个可靠且具有挑战性的中文法律基准,能够系统评估大语言模型的法律通用智能水平,揭示当前模型的局限性,并为未来法律AI的发展提供方向和工具支持。 Abstract: Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

[16] Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

Carolin Mueller-Spitzer,Samira Ochs,Jan Oliver Ruediger,Sascha Wolfer

Main category: cs.CL

TL;DR: 本研究基于大型新闻语料库,分析德语中泛指阳性形式(GM)的分布与语言特征,发现其使用存在词项差异,且多用于复数和不定名词短语,挑战了其代表全体人群的传统观点。

Details Motivation: 探讨泛指阳性形式在真实语境中的实际使用情况,以弥补心理语言学研究与真实语料之间的差距,并回应关于其性别中立性的争议。 Method: 对21个德语人事名词的全部屈折形式进行人工标注,共处理6,195个语料标记,开展基于语料库的定量与定性分析。 Result: 发现GM在不同词项间使用差异显著,尤其体现在被动角色名词与高地位职业名词之间;语法上多用于复数和不定指表达,且很少用于指代整类人群。 Conclusion: 泛指阳性形式的实际用法比以往假设更为复杂,需根据具体词汇和语法环境理解其指称功能,研究结果有助于优化心理语言学实验中的语言材料设计。 Abstract: This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.

[17] OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models

Zhuoyue Wan,Wentao Hu,Chen Jason Zhang,Yuanfeng Song,Shuaimin Li,Ruiqiang Xiao,Xiao-Yong Wei,Raymond Chi-Wing Wong

Main category: cs.CL

TL;DR: 本文提出了OsmT,一个开源的、标签感知的语言模型,用于在自然语言与OverpassQL(OpenStreetMap的查询语言)之间建立桥梁,并引入标签检索增强机制(TRA)以提升查询生成的准确性与结构有效性,同时定义了反向任务OverpassQL-to-Text以增强可解释性,在公开基准上取得了优于强基线的效果。

Details Motivation: 现有的自然语言到结构化查询语言的转换方法多依赖大规模闭源模型,存在推理成本高、透明度低和难以轻量化部署的问题;同时地理空间查询具有复杂的拓扑结构,对语义理解和结构建模提出挑战。 Method: 提出OsmT模型,结合标签感知机制和Tag Retrieval Augmentation(TRA),利用OSM中标签的层次与关联关系增强查询生成;并设计OverpassQL-to-Text反向任务以生成自然语言解释,提升可解释性;基于开源架构实现轻量化模型。 Result: 在公开基准上的实验表明,OsmT在NL-to-OverpassQL和OverpassQL-to-Text两个任务上均优于强基线模型,且参数量显著更少,展现出更高的效率与竞争力。 Conclusion: OsmT证明了轻量级开源语言模型通过领域知识增强(如标签检索)可在结构丰富的地理空间环境中有效桥接自然语言与结构化查询语言,具备良好的准确性、透明性和部署潜力。 Abstract: Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.

[18] SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Wenhua Cheng,Weiwei Zhang,Heng Guo,Haihao Shen

Main category: cs.CL

TL;DR: SignRoundV2是一种高效的后训练量化框架,能够在极低比特(如2-4比特)下显著提升大语言模型的量化性能,无需混合精度即可接近全精度模型的表现。

Details Motivation: 极端低比特量化对部署大语言模型至关重要,但现有方法在2比特或4比特下常导致严重性能下降,亟需更有效的量化方案。 Method: 提出SignRoundV2,包含两个核心组件:一是结合梯度信息与量化偏差的快速敏感性度量,用于指导逐层比特分配;二是轻量级预调优搜索量化尺度以优化极低比特表现。 Result: 实验表明,该方法在4-5比特时仅约1%误差,2比特下仍保持强性能,显著缩小了与全精度模型的差距。 Conclusion: SignRoundV2是一种高效且实用的低比特量化框架,为大语言模型的高效部署提供了可行解决方案。 Abstract: Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.

[19] Model Whisper: Steering Vectors Unlock Large Language Models' Potential in Test-time

Xinyue Kang,Diwei Shi,Li Chen

Main category: cs.CL

TL;DR: 提出了一种轻量级的测试时引导向量(TTSV)方法,通过在输入前添加可优化的向量并冻结大模型参数,实现对大语言模型推理能力的高效激活。

Details Motivation: 现有测试时适应方法常需微调模型参数,计算开销大且可能损害预训练模型原有能力,亟需一种高效、不修改参数的方法来释放大模型在特定任务上的推理潜力。 Method: 设计了一个轻量级的测试时引导向量(TTSV),将其 prepend 到输入中,在保持 LLM 参数完全冻结的前提下,通过在测试数据上优化 TTSV 以最小化模型输出熵,从而引导模型进入更高置信度的内部状态。 Result: 在 MATH500 任务上,Qwen2.5-Math-7B 模型实现了 45.88% 的相对性能提升,Qwen3-4B 模型实现了 16.22% 的提升;实验表明该方法在基础模型和增强推理模型上均有效,且引导向量具有良好的跨任务迁移性。 Conclusion: TTSV 是一种高效、即插即用的测试时适应方法,能够在不调整模型参数的情况下显著提升大模型在目标任务上的表现,并具备良好的泛化与迁移能力。 Abstract: It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model's pre-existing abilities.To address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM's parameters entirely frozen. By optimizing the TTSV on test data to minimize the model's output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach's effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.

[20] EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

Ruilin Li,Yibin Wang,Wenhong Zhu,Chenglin Li,Jinghao Zhang,Chenliang Li,Junchi Yan,Jiaqi Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的知识编辑范式Edit-then-Consolidate,通过TPSFT减少过拟合并利用GRPO进行知识巩固,有效提升了大语言模型在真实场景下的知识更新效果。

Details Motivation: 现有知识编辑方法在理想评估条件下表现良好,但在实际的终身学习场景中效果不佳,存在过拟合和知识整合不足的问题。 Method: 提出Edit-then-Consolidate框架:首先使用目标 proximal 监督微调(TPSFT)限制策略漂移以定位编辑;然后通过组相对策略优化(GRPO)在轨迹级别上对推理行为进行优化,实现知识的巩固。 Result: 实验表明,该方法在真实世界评估中显著提高了编辑的可靠性和泛化能力,同时更好地保持了局部性和预训练能力。 Conclusion: Edit-then-Consolidate弥合了理论知识编辑与实际应用之间的差距,为大模型的知识更新提供了更实用的解决方案。 Abstract: Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work's empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs' inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.

[21] Challenging the Abilities of Large Language Models in Italian: a Community Initiative

Malvina Nissim,Danilo Croce,Viviana Patti,Pierpaolo Basile,Giuseppe Attanasio,Elio Musacchio,Matteo Rinaldi,Federico Borazio,Maria Francis,Jacopo Gili,Daniel Scalena,Begoña Altuna,Ekhi Azurmendi,Valerio Basile,Luisa Bentivogli,Arianna Bisazza,Marianna Bolognesi,Dominique Brunato,Tommaso Caselli,Silvia Casola,Maria Cassese,Mauro Cettolo,Claudia Collacciani,Leonardo De Cosmo,Maria Pia Di Buono,Andrea Esuli,Julen Etxaniz,Chiara Ferrando,Alessia Fidelangeli,Simona Frenda,Achille Fusco,Marco Gaido,Andrea Galassi,Federico Galli,Luca Giordano,Mattia Goffetti,Itziar Gonzalez-Dios,Lorenzo Gregori,Giulia Grundler,Sandro Iannaccone,Chunyang Jiang,Moreno La Quatra,Francesca Lagioia,Soda Marem Lo,Marco Madeddu,Bernardo Magnini,Raffaele Manna,Fabio Mercorio,Paola Merlo,Arianna Muti,Vivi Nastase,Matteo Negri,Dario Onorati,Elena Palmieri,Sara Papi,Lucia Passaro,Giulia Pensa,Andrea Piergentili,Daniele Potertì,Giovanni Puccetti,Federico Ranaldi,Leonardo Ranaldi,Andrea Amelio Ravelli,Martina Rosola,Elena Sofia Ruzzetti,Giuseppe Samo,Andrea Santilli,Piera Santin,Gabriele Sarti,Giovanni Sartor,Beatrice Savoldi,Antonio Serino,Andrea Seveso,Lucia Siciliani,Paolo Torroni,Rossella Varvara,Andrea Zaninello,Asya Zanollo,Fabio Massimo Zanzotto,Kamyar Zeinalipour,Andrea Zugarini

Main category: cs.CL

TL;DR: CALAMITA是首个大规模、社区驱动的意大利语大语言模型评估基准,涵盖20多个任务和近100个子任务,强调方法论、标准化流程和细粒度指标,旨在为非英语语言提供可持续的评估框架。

Details Motivation: 现有大语言模型评估主要集中于英语,缺乏对意大利语等语言的系统性、多样化和可持续的评估体系,且缺少统一的方法论和社区协作机制。 Method: 通过联合80多名来自学术界、工业界和公共部门的研究者,设计并整合超过20个任务和近100个子任务,覆盖语言能力、常识推理、事实一致性等多个维度,并建立统一的评估流水线支持异构数据集与多样化指标。 Result: 发布了目前最全面的意大利语基准测试集,评估了四个开源大模型,揭示了其在不同能力上的优劣,并识别出任务特定评估中的挑战;同时建立了可扩展的持续集成评估框架。 Conclusion: CALAMITA不仅是一个资源丰富的语言基准,更提供了一种可推广的社区驱动评估范式,强调细粒度指标、流程标准化和协作治理,为其他语言构建类似评估体系提供了蓝图。 Abstract: The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

[22] AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages

Pooja Singh,Sandeep Kumar

Main category: cs.CL

TL;DR: AdiBhashaa是一个社区驱动的项目,创建了四种印度部落语言的首个开放平行语料库和基线机器翻译系统,旨在促进更公平的人工智能研究。

Details Motivation: 许多部落语言在大型语言模型和机器翻译系统中处于不可见状态,加剧了教育、治理和数字参与方面的结构性不平等。 Method: 通过与母语使用者合作进行参与式数据创建,结合人工验证,并系统评估编码器-解码器机器翻译模型和大型语言模型。 Result: 构建了四种印度部落语言(Bhili、Mundari、Gondi 和 Santali)的首个开放平行语料库和基线MT系统,并展示了本地专业知识和人力验证在语言技术开发中的重要性。 Conclusion: AdiBhashaa为更加公平的人工智能研究提供了一个可行模式,强调本地知识、边缘化社区早期研究人员能力建设以及人类验证的核心作用。 Abstract: Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.

[23] DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

Gianluca Barmina,Nathalie Carmen Hau Norman,Peter Schneider-Kamp,Lukas Galke

Main category: cs.CL

TL;DR: 本文提出了一种用于评估丹麦语语言可接受性的增强型基准,通过分析常见书面错误并引入14种破坏函数生成错误句子,结合人工与自动方法验证其有效性,结果表明该基准比现有方法更全面、更具挑战性,并能更好地区分不同性能的大型语言模型。

Details Motivation: 现有的语言可接受性基准在覆盖错误类型和区分模型性能方面存在局限,因此需要构建一个更全面、更具挑战性的丹麦语评估基准。 Method: 基于对书面丹麦语常见错误的分析,设计了14种系统性引入错误的破坏函数,并利用人工评估与自动方法验证生成句子的有效性,进而构建语言可接受性判断任务的基准。 Result: 新基准比现有基准涵盖更多样化的错误类型,任务难度更高,LLM在其上的表现普遍下降,且显示出更强的模型区分能力。 Conclusion: 所提出的增强型基准在广度和严谨性上优于当前方法,能更有效地评估大型语言模型在丹麦语语言可接受性判断上的真实水平。 Abstract: We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

[24] DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

L. D. M. S. Sai Teja,N. Siva Gopala Krishna,Ufaq Khan,Muhammad Haris Khan,Partha Pakray,Atul Mishra

Main category: cs.CL

TL;DR: 本文提出了一种名为Info-Mask的新框架,用于检测人类与AI混合撰写的文本中的作者身份转换点,结合了风格特征、困惑度信号和结构化边界建模,并发布了对抗性基准数据集MAS,提升了在对抗环境下的分割鲁棒性,同时引入可解释的归因可视化辅助人类理解。

Details Motivation: 随着大语言模型的发展,人类与AI生成文本的界限日益模糊,亟需有效方法识别混合文本中的作者转换点,以保障内容真实性、信任度和人类监督。 Method: 提出Info-Mask框架,融合stylometric cues、perplexity-driven信号和结构化边界建模;构建对抗性数据集MAS评估鲁棒性;设计人类可解释的归因(HIA)可视化方法并开展小规模人类实验评估其有效性。 Result: Info-Mask在多种模型架构下显著提升了对抗条件下的片段级鲁棒性,建立了新的性能基线;HIA有助于人类理解模型决策;实验揭示了当前方法的潜力与局限。 Conclusion: Info-Mask在混合作者文本分割任务中表现出更强的鲁棒性和可解释性,推动了人类-AI协同写作中的可信检测技术发展,但仍存在挑战需要进一步研究。 Abstract: In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.

[25] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

Atsuki Yamaguchi,Terufumi Morishita,Aline Villavicencio,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文提出了一种名为Source-Shielded Updates (SSU)的方法,在低资源条件下利用无标签目标语言数据来适应指令型大语言模型,有效缓解了灾难性遗忘问题,并在多种语言和模型规模上显著优于全量微调。

Details Motivation: 扩大指令型大语言模型的语言多样性对全球可访问性至关重要,但受限于目标语言标注数据的高成本和模型适应过程中的灾难性遗忘问题。本文旨在解决仅使用无标签目标语言数据这一现实且低资源场景下的模型适应挑战。 Method: 提出SSU方法,通过小量源语言数据和参数重要性评分机制识别对保持源语言能力关键的参数,并采用按列冻结策略在适应前保护这些参数,从而实现选择性参数更新。 Result: 在五种类型多样的语言和7B、13B规模模型上的实验表明,SSU将源任务单语性能下降分别降至平均3.4%(7B)和2.8%(13B),远优于全量微调的20.3%和22.3%;同时在目标语言性能上与全量微调相当甚至更优,7B模型在所有基准上超越全量微调,13B模型在多数基准上表现更佳。 Conclusion: SSU能有效缓解低资源跨语言适应中的灾难性遗忘问题,在几乎不损失源语言能力的前提下显著提升目标语言性能,为构建多语言指令模型提供了一种高效可行的新路径。 Abstract: Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.

[26] SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Hao Wang,Jialun Zhong,Changcheng Wang,Zhujun Nie,Zheng Li,Shunyu Yao,Yanzeng Li,Xinchi Li

Main category: cs.CL

TL;DR: 本文提出了SEAL,一种基于自进化代理学习的两阶段语义解析框架,用于知识库对话式问答,通过核心S表达式提取与模板化补全,结合记忆与反思机制,在SPICE基准上实现了最先进的性能。

Details Motivation: 现有方法在处理复杂查询时存在结构不准确和计算成本高的问题,难以有效解决指代消解、上下文建模和复杂逻辑推理。 Method: 采用两阶段框架:第一阶段由大语言模型提取最小S表达式核心,并通过代理校准模块修正语法错误并精确对齐知识图谱实体;第二阶段基于模板补全,结合问题类型预测和占位符实例化生成可执行S表达式;引入自进化机制,利用局部与全局记忆及反思模块持续优化。 Result: 在SPICE基准上的实验表明,SEAL在多跳推理、比较和聚合任务中表现优异,显著提升了结构准确性和计算效率。 Conclusion: SEAL通过分解逻辑形式生成过程并引入自进化学习机制,实现了更鲁棒、可扩展的对话式推理,为知识库问答提供了高效且准确的新范式。 Abstract: Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework's capacity for robust and scalable conversational reasoning.

[27] LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Weiye Shi,Zhaowei Zhang,Shaoheng Yan,Yaodong Yang

Main category: cs.CL

TL;DR: 本文介绍了一个新的多语言体裁分类数据集,用于评估大型语言模型是否能从原始文本中学习到句法结构、隐喻计数和语音特征等深层语言特性,并探讨这些特征对分类性能的影响。

Details Motivation: 研究大型语言模型是否真正捕捉到了深层语言属性,如句法结构、语音线索和韵律模式,而不仅仅是表面文本信息。 Method: 构建了一个来自Project Gutenberg的多语言体裁分类数据集,包含六种语言的诗歌、小说和戏剧二元分类任务,并引入句法树结构、隐喻数量和语音度量三类显式语言特征进行分析。 Result: 实验表明,尽管LLM可以从原始文本或显式特征中学习潜在语言结构,但不同特征在不同任务中的贡献不均,显示出复杂语言信号在训练中的重要性。 Conclusion: 将更复杂的语言信号整合到模型训练中对于提升语言理解能力至关重要。 Abstract: Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

[28] Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Nex-AGI Team,:,Yuxuan Cai,Lu Chen,Qiaoling Chen,Yuyang Ding,Liwen Fan,Wenjie Fu,Yufei Gao,Honglin Guo,Pinxue Guo,Zhenhua Han,Zhengfu He,Hanglei Hu,Kai Hu,Shengjia Hua,Tianyu Huai,Baodai Huang,Li Ji,Zhen Jiang,Zhikai Lei,Bufan Li,Jiahang Lin,Lizhi Lin,Jinxiu Liu,Shichun Liu,Ziming Liu,Yuchen Ni,Pengfang Qian,Yujiong Shen,Qingyun Shi,Wentao Shu,Peng Sun,Yiran Suo,Tian Tang,Boyu Tian,Guoteng Wang,Junzhe Wang,Peixin Wang,Zhiheng Xi,Hang Yan,Jie Yang,Zhixiong Yang,Tianchu Yao,Guangze Ye,Qianxi Yu,Shuo Zhang,Xinyue Zhang,Yiqi Zhang,Jiarong Zhao,Miao Zheng,Rui Zheng,Enyu Zhou,Jiazheng Zhou,Maosen Zhou,Yuhao Zhou,Tao Gui,Yining Zheng,Xinchi Chen,Jie Zhou,Siyuan Feng,Qin Chen,Liang He,Qi Zhang,Xuanjing Huang,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的交互式环境构建方法,以推动大语言模型从被动响应向基于激励的自主决策代理演进,涵盖复杂性、多样性和真实性三个维度,并发布了Nex-N1模型及Nex生态系统。

Details Motivation: 现有的大语言模型缺乏在多样化、复杂且贴近现实的交互环境中进行有效策略学习的基础设施,限制了其作为自主代理的发展。 Method: 通过三个正交维度扩展交互环境:(1) 复杂性:提出NexAU框架支持简单配置构建复杂代理层次;(2) 多样性:NexA4A能从自然语言自动生成覆盖无限领域的多样化代理结构;(3) 真实性:NexGAP融合动态真实世界环境生成接地轨迹。基于此构建的环境训练Nex-N1模型。 Result: 在SWE-bench和tau2等基准上,Nex-N1持续优于最先进的开源模型,并在复杂代理任务中达到与前沿专有模型相当的性能。 Conclusion: 所提出的Nex生态系统有效解决了自主代理训练中交互信号构建的可扩展性问题,推动了大语言模型向更高阶智能体的演进。 Abstract: The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.

[29] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Francielle Vargas,Daniel Pedronette

Main category: cs.CL

TL;DR: 提出了一种名为自解释对比证据重排序(CER)的新方法,通过对比学习和生成标记级归因理由来改进检索,特别适用于临床试验报告等安全关键领域。

Details Motivation: 现有检索方法在安全关键领域(如临床试验)中易产生幻觉且缺乏可解释性,需要更可靠、基于事实证据的检索机制。 Method: 采用对比学习微调嵌入表示,并引入主客观性标准自动选择难负样本,同时为每个检索段落生成标记级归因理由,构建与证据推理对齐的嵌入空间。 Result: 在临床试验报告上的实验表明,CER提升了检索准确率,减少了RAG系统中的幻觉风险,并提供了透明、基于证据的检索结果。 Conclusion: CER通过结合对比学习与自解释机制,实现了更可靠、可解释的证据检索,尤其适合高风险领域的应用。 Abstract: This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.

[30] Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran,Rishabh Tiwari,Yuezhou Hu,Kerem Dilmen,Coleman Hooper,Haocheng Xi,Nicholas Lee,Mehrdad Farajtabar,Michael W. Mahoney,Kurt Keutzer,Amir Gholami

Main category: cs.CL

TL;DR: 本文提出了一种名为Arbitrage的新型步骤级推测生成框架,通过动态路由机制在推理过程中根据草案模型和目标模型之间的相对优势选择生成内容,显著提升了大语言模型推理的效率与准确性权衡。

Details Motivation: 传统的基于token的推测解码在推理任务中因语义等价但token不匹配导致不必要的拒绝,而现有的步骤级方法仍存在大量重复生成被拒步骤的问题,浪费计算资源。因此需要一种更高效的推测生成机制。 Method: 提出Arbitrage框架,使用轻量级路由器动态预测何时目标模型可能生成更优的推理步骤,并据此决定由哪个模型生成下一步,从而实现接近理想仲裁Oracle的效果;该路由器通过学习来近似最优决策。 Result: 在多个数学推理基准上,Arbitrage相比现有步骤级推测解码方法显著减少推理延迟,最高可达约2倍的加速,同时保持相同准确率。 Conclusion: Arbitrage通过动态路由实现了更优的效率-准确性权衡,有效解决了传统推测解码在复杂推理任务中的局限性,为大模型高效推理提供了新思路。 Abstract: Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

[31] Structured Document Translation via Format Reinforcement Learning

Haiyue Song,Johannes Eschbach-Dymanus,Hour Kaing,Sumire Honda,Hideki Tanaka,Bianka Buschbeck,Masao Utiyama

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的结构感知文本翻译方法FormatRL,通过优化TreeSim和Node-chrF等新奖励函数,在处理复杂文档级XML/HTML结构方面取得了显著进展。

Details Motivation: 现有结构化文本翻译工作局限于句子级别,难以有效处理复杂的文档级XML或HTML结构,因此需要一种能够同时优化结构保持和翻译质量的方法。 Method: 采用监督微调模型基础上的Group Relative Policy Optimization,引入两种新的结构感知奖励函数:TreeSim(衡量预测与参考XML树之间的结构相似性)和Node-chrF(在XML节点级别衡量翻译质量),并使用StrucAUC作为细粒度评估指标。 Result: 在SAP软件文档基准上的实验显示,该方法在六个指标上均有提升,并通过分析验证了不同奖励函数对结构和翻译质量改进的贡献。 Conclusion: FormatRL能有效提升文档级结构化文本翻译的性能,兼顾结构保真度与语言质量,为复杂格式文本的翻译提供了可行方案。 Abstract: Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

[32] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: 本文提出了一种名为语义软自举(SSB)的自蒸馏技术,用于提升大语言模型在长上下文推理中的训练效率和准确性,无需人工干预即可构建教师-学生训练对,并在多个数学基准上显著优于传统强化学习方法。

Details Motivation: 现有的基于强化学习与可验证奖励(RLVR)的方法在长上下文推理训练中存在稀疏奖励、样本效率低和计算资源消耗大等问题,限制了其广泛应用。 Method: 提出语义软自举(SSB),利用同一基础语言模型作为教师和学生,通过生成多个推理路径并筛选正确与常见错误答案,构建带语义上下文的示范,自动生成训练样本,并让学生模型仅从问题出发拟合教师生成的logits序列。 Result: 在Qwen2.5-3B-Instruct模型上使用GSM8K数据集进行参数高效微调,并在MATH500和AIME2024基准测试中分别取得了比GRPO高10.6%和10%的准确率提升。 Conclusion: SSB是一种高效、无需人工标注的自蒸馏方法,能显著提升大语言模型在数学推理任务上的表现,同时缓解了RLVR的瓶颈问题,具备良好的可扩展性和应用潜力。 Abstract: Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.

cs.CV [Back]

[33] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Alejandro Cobo,Roberto Valle,José Miguel Buenaposada,Luis Baumela

Main category: cs.CV

TL;DR: 本文提出了一种新的合成视频生成方法,通过引入微妙的运动学不一致性来增强深度伪造检测模型的泛化能力。

Details Motivation: 现有的深度伪造检测方法在处理未见过的操纵时表现不佳,尤其是在视频领域中忽略了面部区域间自然运动依赖性的破坏。 Method: 训练一个自编码器将面部标志点配置分解为运动基,并通过操纵这些基来选择性地打破面部运动中的自然相关性,然后通过面部变形将其引入到原始视频中。 Result: 在多个流行基准上实现了最先进的泛化性能。 Conclusion: 该方法能够有效提高深度伪造检测模型对未知操纵的泛化能力。 Abstract: Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.

[34] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology

Jinzhen Hu,Kevin Faust,Parsa Babaei Zadeh,Adrienn Bourkas,Shane Eaton,Andrew Young,Anzar Alvi,Dimitrios George Oreopoulos,Ameesha Paliwal,Assem Saleh Alrumeh,Evelyn Rose Kamski-Hennekam,Phedias Diamandis

Main category: cs.CV

TL;DR: OnSight Pathology是一个平台无关的计算机视觉软件,通过本地运行的实时AI推理辅助病理学家进行数字病理图像分析,无需复杂集成,支持多种设备和场景,促进AI在病理学中的普及。

Details Motivation: 现有的数字病理AI解决方案多为专有系统,部署复杂且成本高,限制了AI在临床实践中的广泛应用。同时,传统病理诊断依赖主观判断和专家资源,影响诊断准确性。因此,需要一个低成本、易部署、安全且兼容性强的AI辅助工具。 Method: 开发了一个名为OnSight Pathology的平台无关软件,利用连续自定义屏幕截图实现对数字切片图像的实时AI推理。该软件以单一可执行文件形式运行于普通个人电脑上,无需复杂的软件集成,并集成了多模态聊天助手以提供图像描述和质量控制,同时验证其在不同幻灯片查看器、临床环境及活体显微镜视频流(包括智能手机)中的兼容性和鲁棒性。 Result: 在2500多个公开全切片图像及临床病例中验证了OnSight Pathology的有效性,成功应用于常见脑肿瘤分类、有丝分裂检测和免疫组化染色量化等任务,并展示了其在实时显微镜和移动设备视频流中的兼容性。 Conclusion: OnSight Pathology能够跨越现有技术壁垒,在各种病理工作流程中实现实时AI辅助分析,具有成本低、安全性高、易于部署的优点,有助于推动AI在组织病理学中的广泛应用。 Abstract: The microscopic examination of surgical tissue remains a cornerstone of disease classification but relies on subjective interpretations and access to highly specialized experts, which can compromise accuracy and clinical care. While emerging breakthroughs in artificial intelligence (AI) offer promise for automated histological analysis, the growing number of proprietary digital pathology solutions has created barriers to real-world deployment. To address these challenges, we introduce OnSight Pathology, a platform-agnostic computer vision software that uses continuous custom screen captures to provide real-time AI inferences to users as they review digital slide images. Accessible as a single, self-contained executable file (https://onsightpathology.github.io/ ), OnSight Pathology operates locally on consumer-grade personal computers without complex software integration, enabling cost-effective and secure deployment in research and clinical workflows. Here we demonstrate the utility of OnSight Pathology using over 2,500 publicly available whole slide images across different slide viewers, as well as cases from our clinical digital pathology setup. The software's robustness is highlighted across routine histopathological tasks, including the classification of common brain tumor types, mitosis detection, and the quantification of immunohistochemical stains. A built-in multi-modal chat assistant provides verifiable descriptions of images, free of rigid class labels, for added quality control. Lastly, we show compatibility with live microscope camera feeds, including from personal smartphones, offering potential for deployment in more analog, inter-operative, and telepathology settings. Together, we highlight how OnSight Pathology can deliver real-time AI inferences across a broad range of pathology pipelines, removing key barriers to the adoption of AI tools in histopathology.

[35] Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

Bishoy Galoaa,Xiangyu Bai,Shayda Moezzi,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出了LAPA,一种基于Transformer的端到端多摄像头点跟踪架构,通过结合外观匹配与几何约束,利用注意力机制在多个视角和时间上联合推理,实现了优于现有方法的性能。

Details Motivation: 传统跟踪流水线将检测、关联和跟踪分离,导致误差传播和时序不一致,尤其在复杂场景下表现不佳。因此需要一种能够统一建模多视角和时序信息的方法以提升鲁棒性和准确性。 Method: LAPA采用跨视角注意力机制并融入几何先验来建立软对应关系;通过注意力加权聚合构建3D点表示,避免传统三角化方法;使用Transformer解码器建模长距离依赖以保持身份一致性。 Result: 在TAPVid-3D-MC和PointOdyssey-MC数据集上分别达到37.5%和90.3%的APD性能,显著优于现有方法,尤其在复杂运动和遮挡场景中表现突出。 Conclusion: LAPA通过端到端的方式统一了多视角点跟踪中的匹配与几何建模,有效提升了跟踪的准确性和时序一致性,为未来多摄像头跟踪提供了新思路。 Abstract: This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-

[36] Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur\

Main category: cs.CV

TL;DR: PARSE是一个无监督的统一框架,通过预测学习从流式视频中自动学习多尺度事件结构,实现了类人的时间嵌套事件感知,在多个基准上达到最先进性能。

Details Motivation: 模仿人类将连续经验感知为时间上嵌套的层次化事件,需要能够在无监督情况下对视频进行前瞻性、分层分割的模型。 Method: 提出PARSE框架,采用具有不同时间粒度的层次化循环预测器,利用预测误差峰值自然检测事件边界,并通过注意力反馈整合长期上下文。 Result: 在Breakfast Actions、50 Salads和Assembly 101三个基准上,PARSE在流式方法中达到最先进的性能,且在时间对齐(H-GEBD)和结构一致性(TED, hF1)上可与离线方法媲美。 Conclusion: 基于不确定性的预测学习为实现类人类的时间抽象和组合式事件理解提供了一条可扩展的路径。 Abstract: Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

[37] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai,He Liang,Bishoy Galoaa,Utsav Nandi,Shayda Moezzi,Yuhang He,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出了MoReGen,一个基于牛顿力学的文本到视频生成框架,结合多智能体大模型、物理模拟器和渲染器,生成符合物理规律的视频,并构建了包含1275个标注视频的基准MoReSet用于评估物理一致性。

Details Motivation: 现有文本到视频生成模型在生成符合物理规律、运动连贯的视频方面存在不足,尤其在遵循牛顿运动定律方面表现不佳。 Method: 提出MoReGen框架,通过将文本提示转化为代码域中的物理参数,利用多智能体LLM解析场景并生成可执行代码,调用物理引擎进行仿真,再通过渲染器生成视频;同时提出物体轨迹一致性作为评估指标,并构建MoReSet基准进行定量评测。 Result: 实验表明现有T2V模型在物理有效性上表现较差,而MoReGen能生成更符合物理规律的视频,在物体轨迹一致性等指标上显著优于现有方法。 Conclusion: MoReGen为实现物理精确、可复现的文本到视频生成提供了有效路径,强调了将物理先验知识融入生成过程的重要性,推动了意图对齐且物理一致的视频生成研究。 Abstract: While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

[38] ReasonX: MLLM-Guided Intrinsic Image Decomposition

Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück

Main category: cs.CV

TL;DR: ReasonX是一种利用多模态大语言模型作为感知判别器提供相对内在比较的新框架,并通过GRPO奖励在无标签的真实图像上微调内在分解模型,显著提升了多种基础架构和模态下的性能。

Details Motivation: 现有基于合成数据配对监督的内在图像分解模型在真实世界场景中的泛化能力有限,需要一种能有效利用无标签真实数据的方法来提升性能。 Method: 提出ReasonX框架,使用多模态大语言模型(MLLM)作为感知判别器进行相对内在比较,并将这些比较结果作为GRPO奖励用于微调内在分解模型;该方法通过奖励条件性内在预测器与判别器关系评估之间的一致性来实现对齐。 Result: 在多个基础架构和模态上,ReasonX显著改善了性能,包括在IIW反照率任务中WHDR降低9-25%,在ETH3D深度估计任务中准确率最高提升46%。 Conclusion: MLLM引导的比较监督有望弥合低级与高级视觉推理之间的差距,为内在图像分解提供了有效且模型无关的优化方案。 Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

[39] 6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer,Piotr Kalinowski,Caroline Ebersbach,Marcel Knopp,Tim Rädsch,Evangelia Christodoulou,Annika Reinke,Fiona R. Kolbinger,Lena Maier-Hein

Main category: cs.CV

TL;DR: AdversarialAnatomyBench是首个针对自然发生的罕见解剖变异的视觉-语言模型(VLM)基准,揭示了现有VLM在罕见解剖结构上的性能显著下降,暴露其对典型解剖先验的依赖和泛化能力不足的问题。

Details Motivation: 现有基准主要评估常见解剖表现,缺乏对罕见变异挑战的衡量,导致临床应用中VLM可能表现不可靠,因此需要一个专门评估罕见解剖变异下模型鲁棒性的新基准。 Method: 构建AdversarialAnatomyBench,包含多种成像模态和解剖区域的真实罕见解剖变异数据;在此基准上评测22个最先进的VLM在基础医学感知任务中的表现,并分析性能下降模式、解剖偏差体现及现有干预手段的有效性。 Result: 1) VLM在罕见解剖上的平均准确率从典型的74%降至29%,最佳模型(如GPT-5、Gemini 2.5 Pro)也下降41-51%;2) 模型错误反映出明显的解剖偏差;3) 模型规模扩大、偏见感知提示或测试时推理等方法均未能解决该问题。 Conclusion: 当前VLM在罕见解剖表现下的泛化能力存在严重缺陷,AdversarialAnatomyBench为量化和缓解多模态医学AI中的解剖偏差提供了系统性工具。 Abstract: Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.

[40] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为MVRoom的可控新视角合成(NVS)方法,用于3D室内场景生成,通过多视图扩散模型结合粗糙3D布局,在两阶段框架中实现高保真且一致的多视图合成,并支持文本到场景生成。

Details Motivation: 现有的NVS方法在处理复杂室内场景时难以保持多视角一致性,且缺乏对场景布局的有效控制。因此,需要一种能够利用3D布局先验来提升生成质量和可控性的方法。 Method: MVRoom采用两阶段设计:第一阶段利用新颖的表示方法将3D布局与图像条件信号有效结合;第二阶段基于图像条件进行多视图生成,并引入布局感知的极线注意力机制以增强扩散过程中的多视图一致性;此外,还提出一个迭代框架以支持不同复杂度的文本到场景生成。 Result: 实验结果表明,MVRoom在定量和定性评估上均优于现有最先进基线方法,能生成高保真、多视角一致的3D室内场景,消融研究验证了各关键组件的有效性。 Conclusion: MVRoom通过结合3D布局先验与扩散模型,显著提升了室内场景NVS的生成质量与可控性,为文本驱动的3D场景生成提供了有效解决方案。 Abstract: We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

[41] UniLight: A Unified Representation for Lighting

Zitian Zhang,Iliyan Georgiev,Michael Fischer,Yannick Hold-Geoffroy,Jean-François Lalonde,Valentin Deschaintre

Main category: cs.CV

TL;DR: 本文提出了UniLight,一种统一的多模态光照表示方法,通过对比学习将文本、图像、辐照度和环境图映射到共享的潜在空间中,并利用球谐函数预测增强方向理解,在光照检索、环境图生成和扩散模型中的光照控制任务中表现出色。

Details Motivation: 不同模态的光照表示(如环境图、球谐函数、文本等)互不兼容,限制了跨模态迁移,因此需要一种统一的表示方法。 Method: 设计模态特定编码器,通过对比学习对齐文本、图像、辐照度和环境图的表示,并引入球谐函数预测作为辅助任务以增强方向感知。 Result: 在三个任务上验证了方法的有效性:光照检索、环境图生成和基于扩散模型的光照控制,结果表明该表示具有良好的一致性与跨模态可迁移性。 Conclusion: UniLight实现了多模态光照信息的统一表示,支持灵活的跨模态光照操作,为后续视觉应用提供了通用的光照嵌入空间。 Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

[42] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

Tasmiah Haque,Srinjoy Das

Main category: cs.CV

TL;DR: 提出了一种新的推理时优化方法GRU-SNF,结合GRU-NF与MCMC采样,在不牺牲准确性的前提下提升视频运动预测的多样性。

Details Motivation: 现有GRU-NF模型因确定性结构限制了表达能力,难以充分捕捉未来动作的多模态分布,需提升生成多样性。 Method: 在GRU-NF推理过程中引入MCMC随机采样步骤,形成GRU-SNF框架,通过Stochastic Normalizing Flows思想增强输出空间探索能力。 Result: 在关键点视频运动转移任务中,GRU-SNF相比GRU-NF生成结果更具多样性且保持高精度,尤其在长时预测中表现更优。 Conclusion: 在推理阶段引入随机性可有效提升基于流的序列生成模型的多样性,验证了随机动力学与流模型结合在时间序列生成中的潜力。 Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.

[43] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

Fan Jia,Yuhao Huang,Shih-Hsin Wang,Cristina Garcia-Cardona,Andrea L. Bertozzi,Bao Wang

Main category: cs.CV

TL;DR: 本文提出了PnP-Flow的连续极限——一个SDE代理模型,提供了对PnP-Flow的理论理解,并据此提出改进方法,在多个图像恢复任务中取得了优于现有方法的表现。

Details Motivation: 尽管PnP-Flow在图像恢复中取得了实证成功,但其理论理解滞后,缺乏对误差来源和优化策略的深入分析。 Method: 推导PnP-Flow的连续极限,构建基于随机微分方程(SDE)的代理模型,并基于该模型提出改进策略:优化步长调度、正则化神经网络向量场的Lipschitz常数,并通过外推法加速现有模型。 Result: 在图像去噪、去模糊、超分辨率和修复等多个基准任务上验证了SDE指导的改进PnP-Flow的有效性,数值结果显著优于基线PnP-Flow及其他最先进方法。 Conclusion: 所提出的SDE模型为理解PnP-Flow提供了理论基础,并指导了实际改进,提升了图像恢复性能,具有广泛的应用潜力。 Abstract: Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.

[44] Learning Single-Image Super-Resolution in the JPEG Compressed Domain

Sruthi Srinivasan,Elham Shakibapour,Rajy Rawther,Mehdi Saeedi

Main category: cs.CV

TL;DR: 提出了一种基于JPEG DCT系数的轻量级超分辨率 pipeline,直接在编码特征上训练模型,显著提升数据加载和训练速度,同时保持良好的视觉质量。

Details Motivation: 深度学习模型的数据加载已成为训练和推理速度的主要瓶颈,尤其是在输入数据规模不断增大的情况下。现有的方法通常需要完全解码JPEG图像,带来额外计算开销。因此,探索直接在压缩域特征上进行模型训练的方法具有重要意义。 Method: 提出一种轻量级的单幅图像超分辨率(SISR) pipeline,直接在JPEG的离散余弦变换(DCT)系数上操作,利用频率域中的编码特征进行模型训练,避免了完整的JPEG解码过程,从而减少数据加载的计算开销。 Result: 该方法实现了2.6倍的数据加载加速和2.5倍的训练加速,同时保持与传统SISR方法相当的视觉质量。 Conclusion: 在JPEG压缩域的DCT系数上直接进行超分辨率训练是可行且高效的,能够显著提升数据处理和训练速度,为高效深度学习 pipeline 提供了新思路。 Abstract: Deep learning models have grown increasingly complex, with input data sizes scaling accordingly. Despite substantial advances in specialized deep learning hardware, data loading continues to be a major bottleneck that limits training and inference speed. To address this challenge, we propose training models directly on encoded JPEG features, reducing the computational overhead associated with full JPEG decoding and significantly improving data loading efficiency. While prior works have focused on recognition tasks, we investigate the effectiveness of this approach for the restoration task of single-image super-resolution (SISR). We present a lightweight super-resolution pipeline that operates on JPEG discrete cosine transform (DCT) coefficients in the frequency domain. Our pipeline achieves a 2.6x speedup in data loading and a 2.5x speedup in training, while preserving visual quality comparable to standard SISR approaches.

[45] Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

Gasser Elazab,Maximilian Jansen,Michael Unterreiner,Olaf Hellwich

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的单目几何估计方法Gamma-from-Mono(GfM),通过解耦全局和局部结构来解决单目深度估计中细小道路几何特征被过度平滑的问题,利用无量纲参数gamma表示路面残差变化,并在无需完全外参标定的情况下实现度量深度的闭式求解,在KITTI和RSRD数据集上取得了近场深度估计的最先进性能。

Details Motivation: 传统的单目深度估计方法在重建车辆周围3D环境时容易过度平滑道路细节(如颠簸、坡度等),导致影响运动规划与行驶稳定性,因此需要一种能保留精细尺度道路几何信息且适用于实际部署的解决方案。 Method: 提出Gamma-from-Mono(GfM)方法,将道路结构分解为全局主导平面和局部残差变化;引入gamma(点高度相对于深度的无量纲比值)作为局部几何偏差的表示,在仅有相机离地高度的条件下通过闭式公式恢复度量深度,并采用自监督学习避免依赖标注数据。 Result: 在KITTI和RSRD数据集上验证了GfM的有效性,其8.88M参数的轻量模型在近场深度和gamma估计方面达到最先进的精度,同时保持有竞争力的全局深度性能,并展现出对不同相机配置的良好适应性。 Conclusion: GfM是一种物理可解释、无需完全标定且适用于自监督学习的单目几何估计方法,能够有效恢复近道路区域的细粒度几何结构,为自动驾驶中的安全舒适控制提供了高精度的环境感知方案。 Abstract: Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.

[46] How (Mis)calibrated is Your Federated CLIP and What To Do About It?

Mainak Singha,Masih Aminbeidokhti,Paolo Casari,Elisa Ricci,Subhankar Roy

Main category: cs.CV

TL;DR: 本文研究了在联邦学习(FL)环境下CLIP模型的校准问题,发现现有的微调方法和聚合策略对校准效果提升有限。作者提出了一种基于LoRA的新方法FL²oRA,能自然改善FL中的模型校准性能,减少对显式校准步骤的依赖。

Details Motivation: 尽管已有研究关注CLIP在离线设置下的校准,但其在联邦学习环境中的校准表现尚未被探索,尤其是微调过程对校准的影响。 Method: 分析了文本提示微调方法在联邦学习下的校准退化现象,评估了多种训练中校准技术与四种全局聚合方法的组合效果,并提出了FL²oRA——一种基于LoRA的微调方法,以提升分布式环境下的模型校准性。 Result: 实验表明,现有校准技术在FL下改进有限;FL²oRA在多个基准上 consistently 提升了模型校准度,且无需额外的显式校准步骤。 Conclusion: 关键挑战在于选择哪些组件进行微调,而不仅仅是如何聚合或校准;FL²oRA通过合理的参数选择有效解决了联邦学习中视觉语言模型的校准问题。 Abstract: While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

[47] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Rui Fonseca,Bruno Martins,Gil Rocha

Main category: cs.CV

TL;DR: 本文提出了一种无需对齐图像-文本对的纯文本训练图像描述生成方法TOMCap,利用CLIP表示和检索增强来引导预训练语言模型生成描述,显著优于现有无监督方法。

Details Motivation: 减少对人工标注图像-文本数据的依赖,探索无需对齐图像-文本对的图像描述生成方法。 Method: 通过CLIP提取图像表示,并采用模态差距缩减策略;结合检索到的文本示例和潜在向量表示,提示预训练语言模型解码器生成图像描述。 Result: 实验表明TOMCap在多种无监督设置下优于现有的免训练和纯文本方法,且对检索增强和模态差距缩减的不同配置进行了分析验证。 Conclusion: TOMCap有效实现了无需对齐图像-文本对的图像描述生成,为减少对标注数据依赖提供了新思路。 Abstract: Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

[48] Real-time Cricket Sorting By Sex

Juan Manuel Cantarero Angulo,Matthew Smith

Main category: cs.CV

TL;DR: 提出了一种基于计算机视觉和轻量级深度学习模型的低成本、实时蟋蟀性别自动分拣系统,可在资源受限设备上有效运行,提升蟋蟀养殖效率与可持续性。

Details Motivation: 由于食用昆虫作为可持续蛋白来源的需求上升,优化蟋蟀养殖中的性别分拣有助于提高繁殖效率、营养管理和生产可持续性,但目前缺乏低成本、自动化的解决方案。 Method: 结合树莓派5、官方AI相机和自定义YOLOv8 nano目标检测模型,构建一个集计算机视觉与伺服驱动分拣臂于一体的自动化系统,并在真实环境中测试其分拣准确率。 Result: 模型在测试中达到0.977的mAP@0.5,实际群体实验中整体分拣准确率为86.8%。 Conclusion: 轻量级深度学习模型可有效部署于资源受限设备,实现高效的蟋蟀性别自动分拣,为昆虫养殖提供了可行且实用的技术方案。 Abstract: The global demand for sustainable protein sources is driving increasing interest in edible insects, with Acheta domesticus (house cricket) identified as one of the most suitable species for industrial production. Current farming practices typically rear crickets in mixed-sex populations without automated sex sorting, despite potential benefits such as selective breeding, optimized reproduction ratios, and nutritional differentiation. This work presents a low-cost, real-time system for automated sex-based sorting of Acheta domesticus, combining computer vision and physical actuation. The device integrates a Raspberry Pi 5 with the official Raspberry AI Camera and a custom YOLOv8 nano object detection model, together with a servo-actuated sorting arm. The model reached a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.977 during testing, and real-world experiments with groups of crickets achieved an overall sorting accuracy of 86.8%. These results demonstrate the feasibility of deploying lightweight deep learning models on resource-constrained devices for insect farming applications, offering a practical solution to improve efficiency and sustainability in cricket production.

[49] Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

Haolin Xiong,Tianwen Fu,Pratusha Bhuvana Prasad,Yunxuan Cai,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出了Mind-to-Face,首个利用非侵入式脑电图(EEG)信号直接生成高保真面部表情的框架,实现了在无视觉输入情况下的个性化、情感感知的神经驱动虚拟头像。

Details Motivation: 现有表情虚拟头像系统严重依赖视觉线索,在面部遮挡或情绪内敛时表现不佳;因此,需要一种不依赖视觉、能从神经信号中解码情绪并生成面部表情的新方法。 Method: 构建了一个双模态同步采集系统,记录情绪刺激下的同步EEG与多视角面部视频;采用CNN-Transformer编码器将EEG信号映射为密集的3D位置图,并通过改进的3D高斯点阵渲染 pipeline 生成逼真且视角一致的表情。 Result: 实验证明仅凭EEG信号即可可靠预测动态、个体化的面部表情,包括细微的情绪反应,揭示了神经信号中蕴含比以往认知更丰富的情感与几何信息。 Conclusion: Mind-to-Face建立了一种神经驱动虚拟头像的新范式,为沉浸式环境中的个性化远程临场与认知交互提供了新可能。 Abstract: Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.

[50] DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

Jiashu Liao,Pietro Liò,Marc de Kamps,Duygu Sarikaya

Main category: cs.CV

TL;DR: DisentangleFormer提出了一种空间-通道解耦的Vision Transformer架构,通过并行处理空间和通道token、自适应融合模块和多尺度前馈网络,在超光谱成像等多通道视觉任务中实现了更优的表示学习与性能表现。

Details Motivation: 标准自注意力机制将空间和通道维度耦合处理,导致表示纠缠,难以独立建模结构与语义依赖,尤其在超光谱成像中影响显著。 Method: 提出DisentangleFormer,包含三个核心组件:(1) 并行解耦处理空间token和通道token;(2) Squeezed Token Enhancer动态融合双流信息;(3) 多尺度FFN捕获局部上下文依赖。 Result: 在Indian Pine、Pavia University、Houston、BigEarthNet和红外病理数据集上达到SOTA性能,同时在ImageNet上保持竞争力,并减少17.8%的FLOPs。 Conclusion: DisentangleFormer通过原则性的空间-通道解耦设计,有效提升了多通道视觉表示能力,兼具高性能与高效率。 Abstract: Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

[51] SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

Yonghan Lee,Tsung-Wei Huang,Shiv Gehlot,Jaehoon Choi,Guan-Ming Su,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出了一种名为SyncTrack4D的新型多视频4D高斯点阵化(4DGS)方法,用于处理真实世界中未同步的视频集合,首次实现了无需预定义场景对象或先验模型的通用4DGS重建。

Details Motivation: 动态3D场景建模因高维度和需多视角信息聚合而具有挑战性,现有方法难以处理未同步视频下的时变几何与运动重建问题。 Method: 利用密集的4D轨迹表示作为线索,通过融合Gromov-Wasserstein最优传输计算跨视频的4D特征轨迹及其对应关系;进行全局帧级时间对齐以最大化运动重叠,并基于运动样条框架实现子帧级同步与多视频4D高斯点阵化重建。 Result: 在Panoptic Studio和SyncNeRF Blender数据集上验证了方法的有效性,平均时间误差低于0.26帧,PSNR达到26.3,实现了高保真的4D重建与精确同步。 Conclusion: SyncTrack4D是首个适用于未同步视频集合的通用4D高斯点阵化方法,能同时实现跨视频同步与高质量4D场景重建,无需依赖先验对象或模型。 Abstract: Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.

[52] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

Biao Chen,Zhenhua Lei,Yahui Zhang,Tongzhi Niu

Main category: cs.CV

TL;DR: 提出了一种基于非均匀B样条曲面的高质量数字图像相关(DIC)数据集生成方法,并设计了Bayes-DIC Net网络,可实现多层级信息提取与融合,结合贝叶斯推断提供预测置信度,显著提升深度学习在DIC任务中的性能与实用性。

Details Motivation: 现有DIC数据集难以覆盖多样化的实际位移场场景,限制了深度学习方法的训练效果和泛化能力;同时缺乏对预测结果不确定性的量化机制,影响其在实际应用中的可靠性。 Method: 通过随机生成控制点构建非均匀B样条位移场,生成包含多种真实位移情景的大规模DIC数据集;提出Bayes-DIC Net网络结构,采用轻量卷积块扩大感受野,在下采样中提取多级特征并通过单跳连接聚合信息,并引入推理阶段激活的dropout模块实现贝叶斯神经网络,输出预测及置信度。 Result: 所生成的数据集有效提升了模型在真实位移场下的泛化能力;Bayes-DIC Net在保持低计算成本的同时实现了高精度位移预测,并能为预测结果提供可靠的置信度估计,增强了在实际未标记数据上的适用性与可信度。 Conclusion: 本文提出的基于非均匀B样条的数据生成方法与Bayes-DIC Net网络架构,为DIC领域的数据构造和算法设计提供了新思路,显著提高了深度学习方法的性能、可靠性与实用价值。 Abstract: This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.

[53] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: NN-RAG是一个检索增强生成系统,将大规模PyTorch代码库转化为可搜索、可执行的神经网络模块库,支持跨仓库模块复用与验证,显著提升神经架构的可发现性与多样性。

Details Motivation: 现有神经网络组件复用困难,缺乏有效的工具来发现、提取和验证来自大量开源仓库中的模块。 Method: 提出NN-RAG系统,采用作用域感知的依赖解析、导入保持的重构和验证器门控提升机制,实现对PyTorch代码库的多级去重(精确、词法、结构)与可运行模块提取。 Result: 在19个主要仓库中提取1289个候选模块,成功验证941个(73.0%),其中超80%结构唯一,贡献LEMUR数据集中约72%的新颖网络结构,并实现跨仓库架构模式迁移。 Conclusion: NN-RAG首次在开源领域实现了大规模、可验证、可执行的神经模块检索与复用,为算法发现提供了可重现、可追溯的基础设施。 Abstract: Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

[54] Open Set Face Forgery Detection via Dual-Level Evidence Collection

Zhongyi Cai,Bryce Gernon,Wentao Bao,Yifan Li,Matthew Wright,Yu Kong

Main category: cs.CV

TL;DR: 本文研究了开放集人脸伪造检测(OSFFD)问题,提出了一种基于双层次证据的伪造检测方法(DLED),通过空间和频率层面融合类别特定证据来估计预测不确定性,在检测新型伪造类型方面显著优于现有方法,平均提升20%,同时在传统真/假分类任务中表现优异。

Details Motivation: 现有伪造检测方法大多局限于二分类或已知伪造类别识别,难以应对不断出现的新类型伪造,缺乏对新类别伪造的识别能力,导致在实际应用中可靠性不足。 Method: 提出Dual-Level Evidential Detection (DLED) 方法,通过在空间域和频率域两个层次收集并融合类别特定证据,利用不确定性估计来识别已知和未知的伪造类型,从而实现开放集下的伪造检测。 Result: 在多种实验设置下,DLED在检测新型伪造类别上平均比现有方法高出20%,并在传统真/假检测任务中保持竞争力,验证了其有效性与泛化能力。 Conclusion: DLED通过引入双层次证据融合与不确定性建模,有效解决了开放集人脸伪造检测中的新类别识别难题,提升了模型在现实场景中的适应性与鲁棒性。 Abstract: The proliferation of face forgeries has increasingly undermined confidence in the authenticity of online content. Given the rapid development of face forgery generation algorithms, new fake categories are likely to keep appearing, posing a major challenge to existing face forgery detection methods. Despite recent advances in face forgery detection, existing methods are typically limited to binary Real-vs-Fake classification or the identification of known fake categories, and are incapable of detecting the emergence of novel types of forgeries. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which demands that the detection model recognize novel fake categories. We reformulate the OSFFD problem and address it through uncertainty estimation, enhancing its applicability to real-world scenarios. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which collects and fuses category-specific evidence on the spatial and frequency levels to estimate prediction uncertainty. Extensive evaluations conducted across diverse experimental settings demonstrate that the proposed DLED method achieves state-of-the-art performance, outperforming various baseline models by an average of 20% in detecting forgeries from novel fake categories. Moreover, on the traditional Real-versus-Fake face forgery detection task, our DLED method concurrently exhibits competitive performance.

[55] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang,Wei-Yuan Cheng,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 提出了一种名为SANTA的自增强对比对齐框架,用于缓解多模态大模型在视频描述生成中的物体和动作幻觉问题,通过自增强生成负样本并利用轨迹-短语对比对齐提升描述的保真度。

Details Motivation: 现有视频描述生成模型存在严重的事实性幻觉问题,尤其是动态视频中的物体和动作幻觉尚未被有效解决。 Method: 提出SANTA框架,采用幻觉驱动的自增强策略生成对比负样本,并通过轨迹-短语对比对齐机制将区域物体和关系引导的动作与视觉-时间短语进行匹配。 Result: 实验表明,SANTA在多个幻觉评测基准上优于现有方法,显著减少了物体和动作的幻觉现象。 Conclusion: SANTA能有效提升多模态大模型在视频描述生成中的事实一致性,通过解耦虚假关联并强调视觉事实,实现了更可靠的生成表现。 Abstract: Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

[56] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

Ao Xu,Rujin Zhao,Xiong Xu,Boceng Huang,Yujia Jia,Hongfeng Long,Fuxuan Chen,Zilong Cao,Fangyuan Chen

Main category: cs.CV

TL;DR: 提出了一种基于2D卷积的多频率自适应融合网络MAFNet,用于高效、高质量的立体匹配,兼顾精度与实时性。

Details Motivation: 现有立体匹配方法在计算开销或非局部上下文建模上存在不足,难以在资源受限设备上实现实时高精度部署。 Method: 设计了自适应频域滤波注意力模块,将代价体分解为高低频部分分别聚合,并通过Linformer低秩注意力机制自适应融合高低频信息,仅使用2D卷积实现高效特征处理。 Result: 在Scene Flow和KITTI 2015等数据集上显著优于现有实时方法,实现了更高的精度与更快的速度。 Conclusion: MAFNet在不依赖3D卷积或迭代优化的情况下,通过频域分解与自适应融合策略,有效平衡了立体匹配的精度与实时性,适合移动端部署。 Abstract: Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.

[57] FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

Geunhyuk Youk,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出FMA-Net++,一种用于联合视频超分辨率与去模糊的框架,显式建模了运动与动态曝光变化的耦合效应,并通过分层双向传播结构和曝光感知模块实现高效、准确的视频恢复。

Details Motivation: 现实视频中常存在由运动与动态曝光变化(如自动曝光或低光拍摄)共同导致的复杂退化,而现有方法大多忽视这一耦合效应,导致恢复效果不佳。 Method: FMA-Net++采用基于层次化双向传播的序列级架构,引入曝光时间感知调制层和流引导动态滤波模块,分别建模每帧曝光条件与运动相关的退化核,并将退化学习与恢复过程解耦,以提升精度与效率。 Result: 在合成数据上训练的FMA-Net++在新提出的REDS-ME、REDS-RE及GoPro数据集上均达到最先进的恢复质量与时间一致性,且推理速度更快,能有效泛化至真实世界视频。 Conclusion: FMA-Net++通过显式建模运动与动态曝光的耦合退化机制,显著提升了复杂真实场景下的视频恢复性能,具备良好的实用性与泛化能力。 Abstract: Real-world video restoration is plagued by complex degradations from motion coupled with dynamically varying exposure - a key challenge largely overlooked by prior works and a common artifact of auto-exposure or low-light capture. We present FMA-Net++, a framework for joint video super-resolution and deblurring that explicitly models this coupled effect of motion and dynamically varying exposure. FMA-Net++ adopts a sequence-level architecture built from Hierarchical Refinement with Bidirectional Propagation blocks, enabling parallel, long-range temporal modeling. Within each block, an Exposure Time-aware Modulation layer conditions features on per-frame exposure, which in turn drives an exposure-aware Flow-Guided Dynamic Filtering module to infer motion- and exposure-aware degradation kernels. FMA-Net++ decouples degradation learning from restoration: the former predicts exposure- and motion-aware priors to guide the latter, improving both accuracy and efficiency. To evaluate under realistic capture conditions, we introduce REDS-ME (multi-exposure) and REDS-RE (random-exposure) benchmarks. Trained solely on synthetic data, FMA-Net++ achieves state-of-the-art accuracy and temporal consistency on our new benchmarks and GoPro, outperforming recent methods in both restoration quality and inference speed, and generalizes well to challenging real-world videos.

[58] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

Hieu Dinh Trung Pham,Huy Minh Nhat Nguyen,Cuong Tuan Nguyen

Main category: cs.CV

TL;DR: 本文提出了一个名为Fourier-Attentive Representation Learning (FARL)的新框架,通过傅里叶分析显式解耦视觉表示中的结构和风格信息,从而提升大规模预训练视觉-语言模型的泛化能力。

Details Motivation: 现有的视觉-语言模型通常将图像的领域不变结构与领域特定风格隐式纠缠在一起,限制了其泛化能力,因此需要一种方法来显式地分离这些特征以增强适应性。 Method: 提出FARL框架,采用双交叉注意力机制,使可学习的表示标记分别查询图像相位谱中的结构特征和幅度谱中的风格特征,并通过非对称注入策略将解耦后的特征注入到VLM编码器深层以指导自适应过程。 Result: 在15个数据集上的大量实验表明,所提方法在少样本学习任务中显著优于现有方法,验证了其有效性。 Conclusion: FARL通过显式解耦结构与风格特征并有效引导VLM适应,提升了模型的跨域泛化能力和视觉-语言对齐质量。 Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image's domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image's structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.

[59] Performance Evaluation of Transfer Learning Based Medical Image Classification Techniques for Disease Detection

Zeeshan Ahmad,Shudi Bao,Meng Chen

Main category: cs.CV

TL;DR: 本文系统分析了基于深度卷积神经网络的迁移学习技术在医学图像分类中的应用,评估了六种预训练模型在胸部X光数据集上的表现,发现InceptionV3性能最优,ResNet系列随深度增加性能提升,并探讨了模型鲁棒性、计算效率及迁移学习的有效性因素。

Details Motivation: 由于从零训练大型深度学习模型在医学图像分类中通常不可行,因此需要系统评估迁移学习技术在该领域的有效性及不同模型的表现差异。 Method: 采用六种预训练深度卷积神经网络(AlexNet、VGG16、ResNet18、ResNet34、ResNet50、InceptionV3)在自定义胸部X光数据集上进行迁移学习实验,评估其在疾病检测中的分类性能,并进行不确定性分析和运行时间比较。 Result: InceptionV3在所有标准指标上 consistently 表现最佳;ResNet家族随深度增加性能逐步提升;VGG16和AlexNet表现尚可但准确率较低;迁移学习在数据量有限时尤为有益;且仅需轻量级前馈网络即可实现高效预测。 Conclusion: 迁移学习在医学图像分类中具有显著优势,尤其适用于数据稀缺场景,模型选择应综合考虑架构、数据集大小和任务间领域相似性,本研究为实际应用中的模型选型提供了重要指导。 Abstract: Medical image classification plays an increasingly vital role in identifying various diseases by classifying medical images, such as X-rays, MRIs and CT scans, into different categories based on their features. In recent years, deep learning techniques have attracted significant attention in medical image classification. However, it is usually infeasible to train an entire large deep learning model from scratch. To address this issue, one of the solutions is the transfer learning (TL) technique, where a pre-trained model is reused for a new task. In this paper, we present a comprehensive analysis of TL techniques for medical image classification using deep convolutional neural networks. We evaluate six pre-trained models (AlexNet, VGG16, ResNet18, ResNet34, ResNet50, and InceptionV3) on a custom chest X-ray dataset for disease detection. The experimental results demonstrate that InceptionV3 consistently outperforms other models across all the standard metrics. The ResNet family shows progressively better performance with increasing depth, whereas VGG16 and AlexNet perform reasonably well but with lower accuracy. In addition, we also conduct uncertainty analysis and runtime comparison to assess the robustness and computational efficiency of these models. Our findings reveal that TL is beneficial in most cases, especially with limited data, but the extent of improvement depends on several factors such as model architecture, dataset size, and domain similarity between source and target tasks. Moreover, we demonstrate that with a well-trained feature extractor, only a lightweight feedforward model is enough to provide efficient prediction. As such, this study contributes to the understanding of TL in medical image classification, and provides insights for selecting appropriate models based on specific requirements.

[60] Dual-Stream Spectral Decoupling Distillation for Remote Sensing Object Detection

Xiangyi Gao,Danpei Zhao,Bo Yuan,Wentao Li

Main category: cs.CV

TL;DR: 提出了一种通用的双流频谱解耦蒸馏方法(DS2D2),通过显式和隐式知识蒸馏结合频谱分解,有效提升遥感图像小目标检测性能。

Details Motivation: 现有知识蒸馏方法在遥感图像中存在特征混合和忽略细微特征差异导致的知识混淆问题。 Method: 提出DS2D2,结合一阶小波变换进行频谱分解,设计密度无关尺度权重(DISW),并利用全频段和高频放大器提取学生-教师模型间的隐式知识差异。 Result: 在DIOR和DOTA数据集上验证有效,DIOR上RetinaNet和Faster R-CNN的AP50分别提升4.2%和3.8%。 Conclusion: DS2D2能有效解耦遥感图像中的混合特征,挖掘隐式知识差异,显著提升检测精度,具有架构无关性和广泛适用性。 Abstract: Knowledge distillation is an effective and hardware-friendly method, which plays a key role in lightweighting remote sensing object detection. However, existing distillation methods often encounter the issue of mixed features in remote sensing images (RSIs), and neglect the discrepancies caused by subtle feature variations, leading to entangled knowledge confusion. To address these challenges, we propose an architecture-agnostic distillation method named Dual-Stream Spectral Decoupling Distillation (DS2D2) for universal remote sensing object detection tasks. Specifically, DS2D2 integrates explicit and implicit distillation grounded in spectral decomposition. Firstly, the first-order wavelet transform is applied for spectral decomposition to preserve the critical spatial characteristics of RSIs. Leveraging this spatial preservation, a Density-Independent Scale Weight (DISW) is designed to address the challenges of dense and small object detection common in RSIs. Secondly, we show implicit knowledge hidden in subtle student-teacher feature discrepancies, which significantly influence predictions when activated by detection heads. This implicit knowledge is extracted via full-frequency and high-frequency amplifiers, which map feature differences to prediction deviations. Extensive experiments on DIOR and DOTA datasets validate the effectiveness of the proposed method. Specifically, on DIOR dataset, DS2D2 achieves improvements of 4.2% in AP50 for RetinaNet and 3.8% in AP50 for Faster R-CNN, outperforming existing distillation approaches. The source code will be available at https://github.com/PolarAid/DS2D2.

[61] UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes

Changhe Liu,Ehsan Javanmardi,Naren Bao,Alex Orsholits,Manabu Tsukada

Main category: cs.CV

TL;DR: 提出了一种基于三角形的可微分光线追踪管线,直接以三角形为渲染基元,无需代理几何体,实现了高质量实时渲染,并统一了新视角合成中的渲染基元。

Details Motivation: 现有高斯粒子光线追踪方法依赖代理几何体,需构建复杂中间网格并进行昂贵的相交测试,且高斯粒子不适合作为光线追踪与光栅化的统一基元。 Method: 提出一种可微分的基于三角形的光线追踪管线,直接将三角形作为渲染基元,不依赖任何代理几何体,并能直接渲染由光栅化方法Triangle Splatting优化后的三角形。 Result: 该方法在保持实时渲染性能的同时,显著提升了渲染质量,并实现了新视角合成中渲染基元的统一。 Conclusion: 基于三角形的光线追踪管线优于现有方法,兼具高质量与高效性,且有望统一不同渲染路径的基元表示。 Abstract: Ray tracing 3D Gaussian particles enables realistic effects such as depth of field, refractions, and flexible camera modeling for novel-view synthesis. However, existing methods trace Gaussians through proxy geometry, which requires constructing complex intermediate meshes and performing costly intersection tests. This limitation arises because Gaussian-based particles are not well suited as unified primitives for both ray tracing and rasterization. In this work, we propose a differentiable triangle-based ray tracing pipeline that directly treats triangles as rendering primitives without relying on any proxy geometry. Our results show that the proposed method achieves significantly higher rendering quality than existing ray tracing approaches while maintaining real-time rendering performance. Moreover, our pipeline can directly render triangles optimized by the rasterization-based method Triangle Splatting, thus unifying the primitives used in novel-view synthesis.

[62] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

Manar Alnaasan,Md Selim Sarowar,Sungho Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于RGB-D多模态数据的可解释性框架,用于帕金森病步态分析,结合双YOLOv11编码器与跨空间融合机制,并引入冻结的大语言模型生成临床可解读的文本解释,提升了识别准确率与鲁棒性。

Details Motivation: 现有帕金森病步态分析方法受限于单模态输入、鲁棒性差及缺乏临床透明度,难以满足实际应用需求。 Method: 采用双YOLOv11编码器提取RGB和Depth特征,结合多尺度局部-全局特征提取模块(MLGE)与跨空间融合机制,并利用冻结的大语言模型将融合特征转化为临床有意义的文本解释。 Result: 在多模态步态数据集上验证,该方法相比单模态基线模型具有更高的识别精度和环境适应能力,能有效捕捉细微肢体运动与整体步态特征。 Conclusion: 该研究通过融合多模态视觉特征与语言可解释性,构建了可靠的视觉-语言分析范式,推动了帕金森病早期检测向临床实用化发展。 Abstract: Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM

[63] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Sidan Zhu,Hongteng Xu,Dixin Luo

Main category: cs.CV

TL;DR: 提出了一种自步长、自校正的掩码预测方法SSMP,用于电影预告片生成,通过双向上下文建模和渐进式自我校正机制,在自动预告片生成任务中达到先进性能。

Details Motivation: 现有方法采用“先选择后排序”范式,存在误差传播问题,限制了生成质量,因此需要一种更鲁棒的端到端方法。 Method: 提出SSMP方法,使用Transformer编码器对电影镜头序列进行掩码预测训练,采用自适应掩码比进行自步学习,并在生成过程中通过多次填充与重掩码实现渐进式自我校正。 Result: 在定量指标和用户研究中均优于现有自动预告片生成方法,验证了模型的有效性。 Conclusion: SSMP通过自步长掩码训练和自我校正机制,有效提升了电影预告片生成的质量,为视频摘要任务提供了新思路。 Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: https://github.com/Dixin-Lab/SSMP.

[64] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Bin Suna,Yaoguang Caob,Yan Wanga,Rui Wanga,Jiachen Shanga,Xiejie Fenga,Jiayi Lu,Jia Shi,Shichun Yang,Xiaoyu Yane,Ziying Song

Main category: cs.CV

TL;DR: 本文提出了一种名为MindDrive的端到端自动驾驶框架,结合高质量轨迹生成与多目标决策推理,通过“上下文模拟-候选生成-多目标权衡”范式,在安全性、舒适性和效率方面实现先进性能。

Details Motivation: 现有自动驾驶方法在轨迹生成与决策选择之间存在割裂:生成导向方法缺乏综合评估能力,而选择导向方法生成能力不足。因此需要一种兼顾高质量生成与合理决策的新框架。 Method: 提出MindDrive框架,包含两个核心模块:基于世界动作模型(WaM)的未来感知轨迹生成器(FaTG),用于进行自我条件下的‘假设’模拟并生成前瞻性轨迹候选;以及面向视觉语言模型的评估器(VLoE),利用大模型的推理能力对安全、舒适、效率等维度进行多目标评估与权衡。 Result: 在NAVSIM-v1和NAVSIM-v2基准上进行了广泛实验,结果显示MindDrive在多项驾驶指标上达到最先进水平,显著提升了安全性、合规性与泛化能力。 Conclusion: MindDrive为可解释、认知引导的自动驾驶提供了可行路径,验证了将生成能力与结构化推理相结合的重要性。 Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

[65] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang,Zhenkai Li,Tianwen Qian,Huanran Zheng,Zheng Wang,Yuqian Fu,Xiaoling Wang

Main category: cs.CV

TL;DR: 本文提出了StreamEQA,首个面向具身场景中流视频问答的基准,从“具身”和“流式”两个维度评估多模态大模型在连续视觉输入下的感知、交互与规划能力。实验表明现有模型在此类任务上仍面临挑战。

Details Motivation: 随着具身智能向现实世界部署发展,模型需持续理解流式视觉输入并进行动态推理。然而现有基准缺乏对具身性与时间连续性的综合评估,因此需要新基准推动该方向研究。 Method: 构建了包含156个长视频、42项任务和约21K带精确时间戳问答对的StreamEQA基准,采用自动化生成结合人工精炼的混合流程。从具身(感知、交互、规划)和流式(回溯、实时、前向推理)两个正交维度设计问题。 Result: 在13种最先进的视频大模型上进行评测,结果显示尽管这些模型在传统基准上表现良好,但在StreamEQA上的流式具身理解能力仍然不足,尤其在需要时间上下文和高阶推理的任务中表现较差。 Conclusion: StreamEQA为具身智能中的流式视频理解提供了新的评估标准,揭示了当前模型的局限性,并有望推动具备持续感知与动态决策能力的智能系统研究。 Abstract: As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.

[66] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Changjin Kim,HyeokJun Lee,YoungJoon Yoo

Main category: cs.CV

TL;DR: 提出GuidNoise,一种基于单对噪声/干净图像引导的扩散模型,用于广义噪声合成,通过GAFM和噪声感知细化损失提升生成真实感噪声的能力,无需额外元数据,可在训练和推理中高效生成高质量噪声-干净图像对,显著增强去噪性能。

Details Motivation: 现有生成模型在真实噪声合成中依赖相机元数据和大量特定目标的噪声-干净图像对,泛化能力有限且获取成本高,因此需要一种更通用、低成本的噪声合成方法。 Method: 提出GuidNoise,利用单个噪声/干净图像对作为引导,结合指导感知仿射特征修改(GAFM)和噪声感知细化损失来训练扩散模型,使其在前向过程中生成逼真的合成噪声,并在反向过程中更好还原噪声分布。 Result: GuidNoise在多种噪声环境下均能合成高质量的噪声图像,无需任何额外元数据;在推理阶段可高效生成噪声-干净图像对用于数据增强,显著提升轻量级模型和小规模训练数据下的去噪性能。 Conclusion: GuidNoise通过单对图像引导实现广义噪声合成,降低了对元数据和大规模配对数据的依赖,具备良好的实用性和扩展性,为图像去噪提供了高效的数据增强解决方案。 Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.

[67] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Yingzi Ma,Yulong Cao,Wenhao Ding,Shuibai Zhang,Yan Wang,Boris Ivanovic,Ming Jiang,Marco Pavone,Chaowei Xiao

Main category: cs.CV

TL;DR: 本文提出dVLM-AD,一种基于离散扩散的视觉语言模型,用于提升端到端自动驾驶系统在分布外场景下的推理与规划一致性。相比自回归模型,该方法通过双向注意力和迭代去噪机制增强了可控性和可靠性,在多个基准上表现出更优的行为-轨迹一致性和规划性能。

Details Motivation: 现有基于自回归的视觉语言模型因因果注意力和顺序生成限制,在高阶推理与低阶规划之间缺乏一致性与可控性,难以满足自动驾驶中对可靠决策的需求。 Method: 提出dVLM-AD,采用基于离散扩散的视觉语言模型架构,结合双向注意力机制和迭代去噪过程,统一感知、结构化推理与低层规划,提升端到端驾驶系统的可控性与一致性。 Result: 在nuScenes和WOD-E2E数据集上验证,dVLM-AD在行为-轨迹一致性上比自回归基线提升9%,在长尾WOD-E2E场景中RFS指标提高6%,规划性能与现有先进系统相当。 Conclusion: 基于扩散的VLM为端到端自动驾驶提供了更可控、可靠且可扩展的发展路径,优于传统自回归架构。 Abstract: The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

[68] UniTS: Unified Time Series Generative Model for Remote Sensing

Yuxiang Zhang,Shunlin Liang,Wenyuan Li,Han Ma,Jianglei Xu,Yichuan Ma,Jiangwei Xie,Wei Li,Mengmeng Zhang,Ran Tao,Xiang-Gen Xia

Main category: cs.CV

TL;DR: 本文提出了一种统一的时间序列生成模型UniTS,基于流匹配生成范式,适用于遥感中的多种时间序列任务,如重建、去云、变化检测和预测。

Details Motivation: 现有方法通常针对不同任务设计专用模型,缺乏对多任务时空特征的统一建模。 Method: 提出UniTS框架,采用扩散Transformer结构,引入自适应条件注入器(ACor)和时空感知调制器(STM),实现多任务统一建模。 Result: 在多个低级和高级时间序列任务上显著优于现有方法,尤其在严重云污染、模态缺失和物候变化预测等挑战下表现优异。 Conclusion: UniTS实现了遥感时间序列任务的统一建模,具备强大的生成与认知能力,推动了通用地球观测模型的发展。 Abstract: One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

[69] DeRA: Decoupled Representation Alignment for Video Tokenization

Pengbo Guo,Junke Wang,Zhen Xing,Chengxu Liu,Daoguo Dong,Xueming Qian,Zuxuan Wu

Main category: cs.CV

TL;DR: DeRA是一种新型的1D视频分词器,通过解耦时空表示学习来提升训练效率和性能,在多个任务上实现了最先进的结果。

Details Motivation: 现有的视频分词方法在联合建模空间语义与时间动态时存在训练效率低和性能瓶颈的问题,因此需要一种更高效的解耦表示方法。 Method: DeRA将视频编码分解为外观和运动两个分支,并分别对齐预训练的视觉基础模型;引入对称对齐-冲突投影(SACP)模块以缓解异构监督带来的梯度冲突。 Result: DeRA在UCF-101上的rFVD指标优于先前最先进方法LARP达25%;在UCF-101条件生成和K600帧预测任务上也取得了新的最优性能。 Conclusion: DeRA通过解耦时空表示和有效的梯度优化策略,显著提升了视频分词与生成的性能,具有良好的应用潜力。 Abstract: This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.

[70] Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun,Oindrila Saha,Subhransu Maji

Main category: cs.CV

TL;DR: 本文介绍了NABirds Look-Alikes (NABLA) 数据集,用于评估鸟类身份保持生成任务,并展示了当前最先进模型在此数据集上的局限性以及通过按物种、年龄和性别分组训练提升性能的方法。

Details Motivation: 现有的零样本、身份保持生成模型在非刚性或细粒度类别(如鸟类)上表现不佳,且缺乏高质量的数据(如视频或多视角图像),限制了其准确性和细节表现能力。因此需要一个专门的基准来推动这一领域的发展。 Method: 构建了一个包含4,759对专家策划图像的NABLA数据集,并结合iNaturalist上的1,073对多图像观测和少量视频数据形成完整基准;采用按物种、年龄和性别分组的图像进行训练,以改善身份保持生成效果。 Result: 实验表明,当前最先进的基线模型在NABLA数据集上难以维持身份一致性;而使用按物种、年龄和性别分组的图像训练后,在已见和未见物种上均显著提升了性能。 Conclusion: NABLA为细粒度身份保持生成提供了新的挑战性基准,证明了利用语义身份标签(如物种、年龄、性别)进行训练是改进此类任务的有效方向。 Abstract: Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.

[71] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Chang-Hsun Wu,Kai-Po Chang,Yu-Yang Sheng,Hung-Kai Chung,Kuei-Chun Wang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 提出了一种无需训练的视频大语言模型幻觉缓解方法SEASON,通过自诊断对比解码增强时空保真度。

Details Motivation: 现有VideoLLM在处理视频时难以有效捕捉时间信息,导致生成内容存在时间不一致或因果不合理等幻觉问题,尤其是时间推理方面研究较少。 Method: 提出Self-Diagnostic Contrastive Decoding(SEASON),动态诊断每个输出token的幻觉倾向,并对其对应的时间和空间负样本进行自适应对比解码,以提升生成结果的时空一致性。 Result: 在三个幻觉评测基准上优于所有现有的无训练幻觉缓解方法,并在四个通用视频理解基准上进一步提升了VideoLLM性能。 Conclusion: SEASON是一种有效的训练-free方法,能够自适应地提升VideoLLM在时空维度上的生成忠实性,显著缓解时间与空间幻觉问题。 Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

[72] Controllable Long-term Motion Generation with Extended Joint Targets

Eunjong Lee,Eunhee Kim,Sanghoon Hong,Eunho Jung,Jihoon Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为COMET的实时、稳定且可控的角色运动生成框架,基于Transformer的条件VAE和参考引导反馈机制,实现了高精度控制与长期稳定性,并支持实时风格迁移。

Details Motivation: 现有方法在精细控制或长序列生成中存在不足,难以满足交互式应用对稳定性和实时性的需求。 Method: 提出COMET框架,采用高效的Transformer-based条件VAE实现对任意关节的细粒度控制,并引入参考引导的反馈机制防止误差累积,提升长期稳定性,同时支持即插即用的风格迁移。 Result: 实验表明COMET能在实时速度下生成高质量动作,在复杂控制任务中显著优于现有最先进方法。 Conclusion: COMET具备出色的长期稳定性、多任务兼容性和实时性能,适合用于要求严苛的交互式动画应用。 Abstract: Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.

[73] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang,Renrui Zhang,Haodong Li,Zhuofan Zong,Ziyu Guo,Jun He,Claire Guo,Junyan Ye,Rongyao Fang,Weijia Li,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 提出了一种新的交错推理范式Draft-as-CoT(DraCo),利用低分辨率草图进行视觉规划,并通过选择性修正和超分辨率优化生成结果,显著提升了多模态大模型在文本到图像生成中的表现。

Details Motivation: 现有基于思维链(CoT)的统一多模态大语言模型在文本到图像生成中仍受限于仅使用抽象文本规划或作为独立生成器,难以处理细粒度语义对齐和稀有属性组合的问题。 Method: 提出DraCo方法,首先生成低分辨率草图作为视觉预览以提供结构化规划,然后利用模型自身理解能力检测草图与输入提示之间的语义偏差,并通过选择性修正和超分辨率进行精细化优化;引入DraCo-240K数据集和DraCo-CFG策略支持训练。 Result: 在GenEval、Imagine-Bench和GenEval++等多个评估基准上显著优于直接生成和其他基于CoT的方法,分别提升8%、0.91和3%。 Conclusion: DraCo通过融合视觉与文本的交错推理,有效改善了生成过程中的规划与验证机制,解决了文本规划粗粒度和罕见属性组合生成难的问题,显著提升了多模态生成质量。 Abstract: Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

[74] Shift-Window Meets Dual Attention: A Multi-Model Architecture for Specular Highlight Removal

Tianci Huo,Lingfeng Qi,Yuhan Chen,Qihong Xue,Jinyuan Shao,Hai Yu,Jie Li,Zhanhua Zhang,Guofa Li

Main category: cs.CV

TL;DR: 提出了一种多模型架构MM-SHR,用于去除不同尺度的镜面高光,结合卷积和注意力机制,在准确性和效率上优于现有方法。

Details Motivation: 现有方法在处理不同尺度的镜面高光时,难以同时建模局部细节和全局依赖关系,导致去高光效果不佳。 Method: 设计了MM-SHR架构,浅层使用卷积提取局部细节,深层使用注意力机制捕捉全局特征,并提出OAIBlock和HDDAConv模块,通过粗到细的方式建模长距离依赖。 Result: 在三个基准任务和六种材料表面上的实验表明,MM-SHR在去高光的准确性和效率方面均优于当前最先进的方法。 Conclusion: MM-SHR有效平衡了局部细节与全局依赖的建模,显著提升了复杂环境下镜面高光去除的性能。 Abstract: Inevitable specular highlights in practical environments severely impair the visual performance, thus degrading the task effectiveness and efficiency. Although there exist considerable methods that focus on local information from convolutional neural network models or global information from transformer models, the single-type model falls into a modeling dilemma between local fine-grained details and global long-range dependencies, thus deteriorating for specular highlights with different scales. Therefore, to accommodate specular highlights of all scales, we propose a multi-model architecture for specular highlight removal (MM-SHR) that effectively captures fine-grained features in highlight regions and models long-range dependencies between highlight and highlight-free areas. Specifically, we employ convolution operations to extract local details in the shallow layers of MM-SHR, and utilize the attention mechanism to capture global features in the deep layers, ensuring both operation efficiency and removal accuracy. To model long-range dependencies without compromising computational complexity, we utilize a coarse-to-fine manner and propose Omni-Directional Attention Integration Block(OAIBlock) and Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv) , which leverage omni-directiona pixel-shifting and window-dividing operations at the raw features to achieve specular highlight removal. Extensive experimental results on three benchmark tasks and six types of surface materials demonstrate that MM-SHR outperforms state-of-the-art methods in both accuracy and efficiency for specular highlight removal. The implementation will be made publicly available at https://github.com/Htcicv/MM-SHR.

[75] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

Yuduo Jin,Brandon Haworth

Main category: cs.CV

TL;DR: 本文通过控制研究探讨了生成式运动扩散模型中运动表示和损失函数的基本问题,基于代理模型MDM采用v loss目标(vMDM),评估了六种常见运动表示在质量与多样性上的表现,比较了不同配置下的训练速度,并在大规模数据集上进行了分析,结果揭示了不同决策对模型性能的显著影响。

Details Motivation: 旨在深入理解运动生成中不同表示方法和损失函数的影响,提升条件运动扩散模型的性能与训练效率。 Method: 基于MDM构建vMDM模型,使用v loss作为预测目标,对六种运动表示、多种训练配置进行消融实验,并在大型运动数据集上进行评估。 Result: 实验表明不同运动表示在各类数据集上性能差异显著;特定配置可有效加速训练;v loss有助于更好建模潜在分布。 Conclusion: 运动表示选择和训练配置对扩散模型性能有重要影响,合理设计能显著提升生成质量与训练效率,为后续工作提供了实证基础。 Abstract: Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.

[76] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Min Zhao,Bokai Yan,Xue Yang,Hongzhou Zhu,Jintao Zhang,Shilong Liu,Chongxuan Li,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了UltraImage,一个用于解决图像扩散变换器在超分辨率生成中内容重复和质量下降问题的框架。通过频率分析和递归主频校正,以及熵引导的自适应注意力集中,实现了高达6K*6K的高质量图像生成。

Details Motivation: 现有的图像扩散变换器在超越训练尺度生成图像时存在内容重复和质量退化的问题,限制了其在高分辨率图像生成中的应用。 Method: 提出了一种基于位置嵌入频率分析的方法,识别出周期性是导致重复的原因,并引入递归主频校正;同时发现注意力稀释导致质量下降,因此设计了熵引导的自适应注意力集中机制来增强局部和全局注意力。 Result: 实验表明,UltraImage在Qwen-Image和Flux数据集上(约4K)三个生成场景中均优于先前方法,显著减少重复并提升视觉保真度,且能在1328p训练分辨率下生成最高达6K*6K的图像。 Conclusion: UltraImage有效解决了扩散变换器在极端外推下的内容重复与质量退化问题,展现出强大的高分辨率图像生成能力。 Abstract: Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.

[77] DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

Yinghui Xing,Xiaoting Su,Shizhou Zhang,Donghao Chu,Di Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于MAE的双域引导红外基础模型DuGI-MAE,通过确定性掩码策略和双域引导模块提升红外图像表征能力,并构建大规模数据集Inf-590K用于预训练,在多种下游任务中展现出优越性能。

Details Motivation: 现有基于可见光数据训练的基础模型(如MAE)在红外图像理解任务中表现不佳,且当前红外基础模型存在信息令牌丢失、全局关联建模不足及非均匀噪声忽略等问题。 Method: 提出DuGI-MAE模型:设计基于令牌熵的确定性掩码策略以保留高熵令牌;引入双域引导(DDG)模块,同时捕捉全局令牌关系并自适应滤除非均匀背景噪声;构建大规模红外图像数据集Inf-590K用于预训练。 Result: 在红外目标检测、语义分割和小目标检测等下游任务中,DuGI-MAE均优于现有的监督与自监督方法,表现出强大的泛化能力。 Conclusion: DuGI-MAE通过改进掩码策略和引入双域建模机制,有效提升了红外图像的基础表征学习效果,为红外视觉任务提供了一个强有力的通用模型。 Abstract: Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

[78] EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang,Jiarui Ye,Yuanlei Wang,Ming Zhong,Mingju Cao,Wanke Xia,Bowen Zeng,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: EgoLCD是一种端到端的自我中心长时视频生成框架,通过高效稳定的记忆管理解决内容漂移问题,在EgoVid-5M上实现了最先进的性能。

Details Motivation: 现有自回归模型在生成长程自我中心视频时存在内容漂移问题,难以保持对象身份和场景语义的长期一致性。 Method: 提出EgoLCD框架,结合长期稀疏KV缓存与基于注意力的短期记忆,使用LoRA进行局部适应,引入记忆调节损失和结构化叙事提示以增强时间一致性和记忆稳定性。 Result: 在EgoVid-5M基准上的实验表明,EgoLCD在感知质量和时间一致性方面均达到最先进水平,有效缓解了生成遗忘问题。 Conclusion: EgoLCD通过创新的记忆管理机制显著提升了长时自我中心视频生成的质量,为具身AI中的可扩展世界模型构建提供了重要进展。 Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

[79] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu,Xiaoshan Wu,Xinting Hu,Tao Hu,Yangtian Sun,Xiaoyang Lyu,Bo Wang,Lin Ma,Yuewen Ma,Zhongrui Wang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 提出VideoSSM,一种结合自回归扩散与混合状态空间模型的长视频生成框架,通过全局和局部记忆机制提升分钟级长视频的时间一致性和运动稳定性。

Details Motivation: 自回归扩散模型在长视频生成中面临累积误差、运动漂移和内容重复问题,难以维持长时间的连贯性,需从记忆机制角度改进。 Method: 将视频生成视为需要短程和长程上下文协同的循环动力过程,引入混合状态空间记忆:状态空间模型(SSM)作为全序列的全局动态记忆,配合提供局部运动和细节记忆的上下文窗口,统一于自回归扩散框架中。 Result: 在短程和长程基准上实现最优的时间一致性和运动稳定性,尤其在分钟级生成中表现突出,支持高多样性内容生成和基于提示的交互控制,且推理时间随序列长度线性扩展。 Conclusion: VideoSSM通过融合状态空间模型与自回归扩散,构建了一种可扩展、具记忆感知的长视频生成框架,有效解决了长期依赖下的连贯性问题。 Abstract: Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

[80] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

Chenlin Xu,Lei Zhang,Lituan Wang,Xinyu Pu,Pengfei Ma,Guangwu Qian,Zizhou Wang,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为BA-TTA-SAM的无任务测试时自适应框架,通过高斯提示注入和边界感知注意力对齐机制,显著提升了SAM在医学图像分割中的零样本性能,平均Dice分数提升12.4%,且无需源域训练数据。

Details Motivation: 由于标注数据稀缺和计算成本高,传统微调方法在医学图像分割中面临挑战;现有方法依赖下游任务的特定训练,而SAM在医学数据上存在领域偏移问题,因此需要高效的零样本增强方法。 Method: 提出BA-TTA-SAM框架,包含两种机制:1)编码器级高斯提示注入,将基于高斯的提示嵌入图像编码器以指导初始表示学习;2)跨层边界感知注意力对齐,利用ViT主干中的层次特征交互,对齐深层语义响应与浅层边界线索。 Result: 在ISIC、Kvasir、BUSI和REFUGE四个医学数据集上实验显示,相比SAM的零样本分割性能,Dice分数平均提升12.4%,且在无需源域训练数据的情况下始终优于现有最先进模型。 Conclusion: BA-TTA-SAM有效增强了SAM在医学图像分割中的泛化能力,实现了高效、无任务的测试时自适应,为零样本医学图像分割提供了新思路。 Abstract: Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.

[81] WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism

Ruijing Liu,Cunhua Pan,Jiaming Zeng,Hong Ren,Kezhi Wang,Lei Kong,Jiangzhou Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于WiFi信号的跨域手势识别网络,通过融合多角度多普勒频谱图像和结合空间与通道注意力机制,实现了高精度的域内和跨域识别性能。

Details Motivation: 现有基于WiFi的手势识别方法在训练域内表现良好,但缺乏跨域泛化能力,难以适应未见过的环境。 Method: 从CSI中提取多接收器的多普勒频谱,并沿时间轴拼接生成包含多角度信息的融合图像;设计了一种结合多语义空间注意力和自注意力通道机制的网络结构,利用注意力图量化手势的时空特征,同时采用ResNet18作为骨干网络提取深层特征。 Result: 在Widar3公开数据集上验证,域内识别准确率达99.72%,跨域识别准确率达97.61%,显著优于现有最优方法。 Conclusion: 所提方法有效提升了基于WiFi信号的手势识别在不同环境下的泛化能力,具有良好的实际应用前景。 Abstract: While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions.

[82] Identity Clue Refinement and Enhancement for Visible-Infrared Person Re-Identification

Guoqing Zhang,Zhun Wang,Hairui Wang,Zhonglin Ye,Yuhui Zheng

Main category: cs.CV

TL;DR: 提出了一种新的ICRE网络,用于可见光-红外行人重识别,通过挖掘和利用模态特定属性中的隐式判别知识来提升性能。

Details Motivation: 现有方法主要关注学习模态不变特征,但忽略了模态特定的身份感知知识在判别特征学习中的重要作用。 Method: 设计了多感知特征优化模块(MPFR)以捕获易被忽略的模态特定属性;提出了语义蒸馏级联增强模块(SDCE),从浅层特征中提取身份感知知识并指导模态不变特征的学习;引入了身份线索引导损失(ICG Loss)以缓解增强特征内的模态差异。 Result: 在多个公开数据集上的大量实验表明,所提出的ICRE方法明显优于现有的最先进方法。 Conclusion: ICRE网络有效挖掘并利用了模态特定属性中的隐式判别知识,显著提升了可见光-红外行人重识别的性能。 Abstract: Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.

[83] Auto3R: Automated 3D Reconstruction and Scanning via Data-driven Uncertainty Quantification

Chentao Shen,Sizhe Zheng,Bingqian Wu,Yaohua Feng,Yuanchen Fei,Mingyu Mei,Hanwen Jiang,Xiangru Huang

Main category: cs.CV

TL;DR: Auto3R是一个数据驱动的不确定性量化模型,用于自动化3D扫描与重建,能够在未知真实几何和外观的情况下预测扫描视角的不确定性分布,实验证明其性能优于现有方法,并可在机器人手臂上部署以生成高质量数字资产。

Details Motivation: 传统高质量3D扫描依赖人工规划扫描流程,难以满足无人机、机器人等具身系统对全自动精确3D扫描日益增长的需求。 Method: 提出Auto3R,一种数据驱动的不确定性量化模型,在迭代的3D重建与扫描过程中,高效准确地预测潜在扫描视角的不确定性分布,适用于包含非朗伯和镜面材质物体的场景。 Result: 实验表明Auto3R性能显著优于当前最先进方法;在配备相机的机械臂上成功部署,能有效数字化真实世界物体并生成即用型、照片级真实感数字资产。 Conclusion: Auto3R实现了高质量、全自动的3D扫描与重建,支持复杂材料物体,具备实际应用能力,为机器人和自动化系统提供了可靠的3D感知解决方案。 Abstract: Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: https://tomatoma00.github.io/auto3r.github.io .

[84] PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan,Xin Wang,Hong Chen,Tongtong Feng,Wei Feng,Ren Wang,Guangyao Li,Qing Li,Wenwu Zhu

Main category: cs.CV

TL;DR: 本文提出PhyVLLM,一种引入物理运动建模的视频大语言模型框架,通过双分支编码器分离外观与运动,并利用Neural ODE建模物理动态,实现无需显式物理标注的自监督学习,在物理推理和通用视频理解任务上均优于现有方法。

Details Motivation: 现有视频大语言模型依赖外观匹配,难以理解物理动态,限制了其在需要物理推理场景中的表现,因此需要引入对物理运动的显式建模以提升理解能力。 Method: 提出PhyVLLM框架:采用双分支编码器分离视觉外观与物体运动;引入Neural ODE模块建模连续时间下的物理动态;将运动感知表示投影到预训练大语言模型的token空间;采用自监督方式训练,避免依赖昂贵的物理标注。 Result: 实验结果表明,PhyVLLM在物理推理和通用视频理解任务上显著优于当前最先进的视频大语言模型,验证了引入显式物理建模的有效性。 Conclusion: 通过显式引入物理运动建模,PhyVLLM提升了视频大语言模型在物理动态理解方面的能力,同时保持了其原有的多模态能力,为构建更具物理常识的视觉语言系统提供了有效路径。 Abstract: Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

[85] Refaçade: Editing Object with Given Reference Texture

Youze Huang,Penghui Ruan,Bojia Zi,Xianbiao Qi,Jianan Wang,Rong Xiao

Main category: cs.CV

TL;DR: 本文提出了一个新的任务——物体重纹理化(Object Retexture),旨在将参考对象的局部纹理迁移到目标对象上,并提出Refaçade方法以实现对图像和视频中纹理转移的精确控制。

Details Motivation: 尽管扩散模型在图像和视频编辑方面取得了显著进展,但一些任务仍待探索。现有方法在进行纹理迁移时受限于结构信息干扰和纹理与结构特征耦合的问题,导致可控性不足。 Method: Refaçade包含两个关键设计:一是使用基于3D网格渲染训练的纹理移除器去除源内容的外观信息而保留几何与运动信息;二是采用拼图排列打乱参考图像的全局布局,促使模型关注局部纹理统计而非整体结构。 Result: 实验表明,该方法在视觉质量、编辑精度和可控性方面均优于强基线模型,在定量评估和人类主观评价中表现更优。 Conclusion: Refaçade实现了高精度且可控的图像与视频纹理迁移,有效解决了传统方法中因结构干扰和特征纠缠导致的控制难题,为未来编辑任务提供了新思路。 Abstract: Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.

[86] Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model

Bita Baroutian,Atefe Aghaei,Mohsen Ebrahimi Moghaddam

Main category: cs.CV

TL;DR: 提出了一种基于视频的面部序列分析方法,用于检测酒精中毒,结合图注意力网络和3D ResNet提取时空特征,并在新构建的数据集上实现了优于现有方法的性能。

Details Motivation: 酒精摄入是全球事故和死亡的主要原因,需要一种非侵入、可靠的检测方法以提升公共安全。 Method: 结合面部关键点分析(GAT)与3D ResNet提取的时空视觉特征,通过动态融合策略进行自适应加权。 Result: 在包含3542个视频片段的新数据集上,模型达到95.82%准确率、0.977精确率和0.97召回率,优于3D-CNN和VGGFace+LSTM基线模型。 Conclusion: 所提方法在酒精中毒检测中表现出色,具备在公共安全系统中实际部署的潜力。 Abstract: Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.

[87] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang,Hai Ci,Yiren Song,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出X-Humanoid,一种生成式视频编辑方法,通过微调Wan 2.2模型实现从人类动作视频到人形机器人动作的转换,并构建大规模配对合成数据集,成功将Ego-Exo4D中的60小时视频转化为超过360万帧“机器人化”人形视频。

Details Motivation: 现有视觉-语言-动作(VLA)和世界模型因缺乏大规模多样化训练数据而受限,尽管利用网络规模的人类视频进行“机器人化”是可行方案,但现有方法难以处理复杂全身运动和场景遮挡,尤其在第三人称视角下表现不佳。 Method: 提出X-Humanoid,将Wan 2.2模型适配为视频到视频结构,并针对人类到人形机器人的转换任务进行微调;设计可扩展的数据生成流程,利用Unreal Engine生成17小时以上的成对合成视频用于训练。 Result: 在60小时Ego-Exo4D视频上应用该模型,生成并发布了包含超过360万帧的大型‘机器人化’人形视频数据集;定量分析与用户研究表明,69%用户认为其在动作一致性上最优,62.1%认为其在具身正确性上最佳。 Conclusion: X-Humanoid有效解决了从人类视频中提取适用于人形机器人策略的大规模训练数据难题,为推动具身智能发展提供了可行路径和宝贵资源。 Abstract: The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

[88] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin,Qingyuan Wang,Wenhao Zhang,Yang Liu,Sijie Cheng

Main category: cs.CV

TL;DR: 提出VideoMem框架,通过自适应内存管理和渐进式训练算法PRPO实现超长视频理解,显著优于现有开源模型。

Details Motivation: 现有视觉语言模型在处理超长视频时受限于上下文长度和长期记忆能力,外部知识库方法又带来过高计算与存储开销。 Method: 设计动态更新的全局内存缓冲区以保留关键信息,并引入PRPO算法,包含渐进状态传播(PSP)和时序级联奖励(TCR)两个模块,提升训练效率与收敛速度。 Result: 在多个超长视频理解基准上,VideoMem显著优于现有的开源模型,验证了其有效性和高效性。 Conclusion: VideoMem为超长视频理解提供了高效、可扩展的新范式,解决了长时记忆与训练稀疏奖励的关键挑战。 Abstract: Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

[89] Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization

Hong Kuang,Jianchen Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于熵最小化的3D高斯点阵渲染方法(GEF),通过降低图元分布的构型熵来优化表面重建,结合自适应空间正则化和多尺度几何保持策略,在多个基准上实现了优异的几何精度与渲染质量。

Details Motivation: 为了提升3D高斯点阵渲染中的表面重建精度,同时保持高质量的渲染效果,作者引入了低构型熵作为良好重建表面的特征指标,并据此设计优化框架。 Method: 提出了三项技术:1)基于熵最小化的表面建模;2)使用表面邻域冗余指数(SNRI)和图像熵引导权重的自适应空间正则化;3)通过竞争性跨尺度熵对齐实现多尺度几何保持。 Result: 在DTU上取得了0.64的Chamfer Distance,在T&T上达到0.44的F1分数;在Mip-NeRF 360上获得最佳SSIM(0.855)和LPIPS(0.136),优于现有方法。 Conclusion: GEF框架能有效提升表面重建的几何精度,同时不牺牲渲染的光度保真度,验证了熵驱动建模在3DGS中的有效性。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading technique for novel view synthesis, demonstrating exceptional rendering efficiency. \replaced[]{Well-reconstructed surfaces can be characterized by low configurational entropy, where dominant primitives clearly define surface geometry while redundant components are suppressed.}{The key insight is that well-reconstructed surfaces naturally exhibit low configurational entropy, where dominant primitives clearly define surface geometry while suppressing redundant components.} Three complementary technical contributions are introduced: (1) entropy-driven surface modeling via entropy minimization for low configurational entropy in primitive distributions; (2) adaptive spatial regularization using the Surface Neighborhood Redundancy Index (SNRI) and image entropy-guided weighting; (3) multi-scale geometric preservation through competitive cross-scale entropy alignment. Extensive experiments demonstrate that GEF achieves competitive geometric precision on DTU and T\&T benchmarks, while delivering superior rendering quality compared to existing methods on Mip-NeRF 360. Notably, superior Chamfer Distance (0.64) on DTU and F1 score (0.44) on T\&T are obtained, alongside the best SSIM (0.855) and LPIPS (0.136) among baselines on Mip-NeRF 360, validating the framework's ability to enhance surface reconstruction accuracy without compromising photometric fidelity.

[90] Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering

Marco Pintore,Maura Pintor,Dimosthenis Karatzas,Battista Biggio

Main category: cs.CV

TL;DR: 本文提出了一种针对文档视觉问答(DocVQA)系统的新型对抗攻击方法,通过在视觉上不可察觉但语义上有针对性地篡改文档内容,诱导模型产生错误答案。

Details Motivation: 现有的DocVQA模型虽然表现优异,但在面对对抗性攻击时仍显脆弱,亟需研究其安全性问题。 Method: 设计专门的攻击算法,生成针对不同攻击目标(如定向误导或系统性失效)的对抗性伪造文档,并在Pix2Struct和Donut两种先进模型上进行验证。 Result: 实验表明所提攻击方法能有效误导两种SOTA模型,揭示了当前DocVQA系统的关键安全漏洞。 Conclusion: 当前DocVQA模型易受语义感知的对抗攻击,未来需加强鲁棒性防御机制的研究。 Abstract: Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers' goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.

[91] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang,Xiangzhao Hao,Hengzhu Tang,Zhenyu Zhang,Jiawei Sheng,Xiaodong Li,Zhenyang Li,Li Gao,Daiting Shi,Dawei Yin,Tingwen Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为COOPER的统一多模态大语言模型,通过利用深度和分割等辅助模态,并采用两阶段训练方法,增强视觉空间感知与自适应交错推理能力,从而提升模型的空间理解能力。实验表明,该模型在空间推理任务上平均提升了6.91%,且仅用于生成辅助模态的变体在距离和大小估计上也提升了7.92%,说明生成辅助模态有助于内化空间知识。

Details Motivation: 现有的多模态大语言模型在3D空间推理方面仍存在不足,通常将感知增强与推理增强分开处理,缺乏统一框架来同时提升空间感知与推理能力。 Method: 提出COOPER模型,结合深度和分割作为辅助模态,采用两阶段训练:第一阶段学习生成辅助模态,第二阶段发展自适应交错推理能力,从而实现对空间关系的深入理解。 Result: COOPER在空间推理任务上平均提升6.91%,保持通用性能;仅训练辅助模态生成的变体在距离与大小估计任务上提升7.92%。 Conclusion: 通过统一框架学习生成辅助模态并进行交错推理,能够有效增强多模态大语言模型的空间智能,表明感知与推理的联合训练有助于内化空间知识。 Abstract: Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

[92] Dataset creation for supervised deep learning-based analysis of microscopic images -- review of important considerations and recommendations

Christof A. Bertram,Viktoria Weiss,Jonas Ammeling,F. Maria Schabel,Taryn A. Donovan,Frauke Wilm,Christian Marzahl,Katharina Breininger,Marc Aubreville

Main category: cs.CV

TL;DR: 本文综述了用于深度学习的显微图像数据集创建的关键步骤与挑战,提出了图像获取、标注软件选择和标注生成的指南,并强调了开放数据集对推动病理学领域可重复性和模型泛化的重要性。

Details Motivation: 高质量、大规模的数据集对监督式深度学习在显微图像分析中的成功至关重要,但其构建面临时间、资源、领域变异性和标注偏差等挑战,亟需系统性指导。 Method: 综述现有文献,总结数据集创建的三个关键步骤:图像获取、标注工具选择和标注生成;提出应对域偏移和提升标注质量(正确性、完整性、一致性)的方法,并提供标准操作流程(SOP)作为补充材料。 Result: 提出了确保数据集质量的关键原则,包括处理图像变异性、采用多标注者策略提升标注质量,并倡导使用开放数据集以增强研究的可重复性和模型泛化能力。 Conclusion: 通过系统性实践建议和SOP支持,该综述有助于推动高质量、大规模数据集的建设,促进可泛化且鲁棒的深度学习模型在病理学中的应用。 Abstract: Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.

[93] Prompt2Craft: Generating Functional Craft Assemblies with LLMs

Vitor Hideyo Isume,Takuya Kiyokawa,Natsuki Yamanobe,Yukiyasu Domae,Weiwei Wan,Kensuke Harada

Main category: cs.CV

TL;DR: 本文提出了“工艺组装任务”(Craft Assembly Task),即机器人根据给定目标物体的RGB图像,从可用对象中选择合适的子集进行创意组装,即使这些对象不直接对应目标部件。方法包括:使用掩码分割网络识别可见部分,检索带标签的模板网格并优化其姿态,将网格简化为立方体或圆柱体等基本形状,并基于局部和全局比例设计搜索算法匹配场景中的对象。实验表明该方法在两个不同场景中效果与基线相当,并展示了真实场景的定性结果。

Details Motivation: 受传统手工艺启发,人类能根据现有材料即兴创作组装;而当前机器人组装通常依赖精确匹配的零件,缺乏灵活性。本文旨在让机器人具备类似人类的创造性组装能力,在零件不完全匹配的情况下仍能完成目标结构的近似构建。 Method: 1. 使用掩码分割神经网络从目标物体的RGB图像中提取可见部分;2. 检索带标签的模板网格并进行姿态优化以最好地拟合目标;3. 将优化后的模板网格简化为基本几何体(如立方体、圆柱体);4. 设计结合局部与全局比例特征的搜索算法,在场景中寻找最匹配的可用对象组合。 Result: 所提方法在两个不同场景下达到了与考虑所有可能组合的基线方法相当的结果;同时提供了真实世界场景下的定性实现结果,验证了方法的可行性。 Conclusion: 本文成功定义并实现了Craft Assembly Task,使机器人能够在非标准零件条件下进行创造性组装。通过模板优化与几何简化结合搜索策略,实现了对目标物体的有效近似重建,为开放环境中灵活机器人操作提供了新思路。 Abstract: Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.

[94] TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

Zishuo Wan,Qinqin Kang,Yi Huang,Yun Bian,Dawei Ding,Ke Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为TARDis的新型物理感知框架,用于解决增强CT中因扫描不全导致的模态缺失问题,通过解耦时间-衰减曲线中的解剖与灌注特征,在极低数据条件下仍保持优异的肿瘤分割与诊断性能。

Details Motivation: 由于辐射或扫描限制,临床常无法获取完整的多期相CT图像,导致模态缺失;现有深度学习方法忽略血流动力学的时间连续性,限制了分割与诊断性能。 Method: 提出Time Attenuated Representation Disentanglement (TARDis),采用双路径架构:基于量化的路径通过可学习嵌入字典提取时间不变的解剖结构,概率路径使用条件变分自编码器建模依赖时间的增强动态,并根据估计的扫描时间生成缺失的血流动力学特征。 Result: 在大规模私有腹部CT数据集(2,282例)和两个公开数据集上实验表明,TARDis显著优于现有模态缺失处理方法,尤其在极端稀疏数据下仍保持稳健的诊断性能。 Conclusion: TARDis通过将缺失模态视为连续时间-衰减曲线上的采样点缺失,实现了对解剖与功能特征的有效解耦与重建,为降低辐射暴露同时维持诊断精度提供了可行方案。 Abstract: Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the "missing modality" problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.

[95] Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation

Houzhang Fang,Chenxing Wu,Kun Bai,Tianqi Chen,Xiaolin Wang,Xiyang Liu,Yi Chang,Luxin Yan

Main category: cs.CV

TL;DR: 本文提出了一种基于热红外成像的无人机目标跟踪新方法SiamDFF,通过动态特征融合和知识蒸馏机制,在复杂背景下实现了高效准确的跟踪性能。

Details Motivation: 针对热红外图像中无人机目标特征弱、背景复杂导致跟踪困难的问题,需要提升现有跟踪方法对目标区域的关注能力和特征表达能力。 Method: 提出SiamDFF网络,包含选择性目标增强网络(STEN)、动态空间特征聚合模块(DSFAM)和动态通道特征聚合模块(DCFAM),并引入一种面向跟踪的上下文注意力知识蒸馏机制,增强学生网络在多层级上对关键区域的感知能力。 Result: 在真实红外无人机数据集上的实验表明,该方法在复杂背景下优于现有最先进跟踪器,并实现实时跟踪速度。 Conclusion: SiamDFF通过动态特征融合与目标感知的知识蒸馏策略,有效提升了红外无人机目标的跟踪精度与鲁棒性,适用于反无人机系统中的实时应用。 Abstract: Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.

[96] SAM3-I: Segment Anything with Instructions

Jingjing Li,Yue Feng,Yuchen Guo,Jincai Huang,Yongri Piao,Qi Bi,Miao Zhang,Xiaoqi Zhao,Qiang Chen,Shihao Zou,Wei Ji,Huchuan Lu,Li Cheng

Main category: cs.CV

TL;DR: 本文提出了SAM3-I,一个将概念级理解与指令级推理统一的增强框架,通过引入指令感知的级联适应机制,使SAM3能够直接遵循自然语言指令进行分割,同时保持其原有的概念驱动能力。

Details Motivation: 现有的SAM3仅支持简单的名词短语提示,难以处理包含属性、空间关系、动作等复杂表达的实际需求,且依赖外部多模态代理将复杂指令转换为名词短语,效果粗糙。 Method: 提出SAM3-I框架,引入指令感知的级联适应机制,逐步对齐自然语言指令语义与SAM3的视觉-语言表征;设计了涵盖概念、简单和复杂层级的结构化指令分类体系,并构建了一个多样化的指令-掩码对数据集。 Result: 实验表明SAM3-I在遵循自然语言指令方面表现出色,同时保留了SAM3强大的概念分割能力。 Conclusion: SAM3-I成功扩展了SAM3以支持自然语言指令分割,实现了细粒度、上下文感知的实例分割,且开源代码便于后续研究与应用。 Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

[97] When Robots Should Say "I Don't Know": Benchmarking Abstention in Embodied Question Answering

Tao Wu,Chuhao Zhou,Guangyu Zhao,Haozhi Cao,Yewen Pu,Jianfei Yang

Main category: cs.CV

TL;DR: 本文提出了AbstainEQA,一个用于评估具身问答(EQA)系统在信息不足时拒绝回答能力的新数据集,揭示当前模型在 abstention 能力上远逊于人类,表明该能力是实现可靠具身交互的关键前提。

Details Motivation: 现有EQA基准假设所有问题都必须回答,忽略了智能体应具备判断何时无法作答的能力;本文旨在建立更符合实际的评估标准,推动更具可靠性的具身智能发展。 Method: 基于对500个人类查询的分析,归纳出五类需要拒绝回答的情形,并据此扩展OpenEQA构建AbstainEQA数据集,包含1,636个需拒绝回答的样本及其对应原问题,用于评估模型的拒绝回答能力。 Result: 最佳前沿模型在AbstainEQA上的拒绝召回率仅为42.79%,远低于人类的91.17%;扩大规模、提示工程和推理方法仅带来微小提升,且微调模型易过拟合文本线索。 Conclusion: 拒绝回答能力是具身问答中实现可靠交互和有效澄清的基本前提,当前模型在此方面仍有巨大改进空间。 Abstract: Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

[98] Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

Sheng Hang,Chaoxiang He,Hongsheng Hu,Hanqing Hu,Bin Benjamin Zhu,Shi-Feng Sun,Dawu Gu,Shuo Wang

Main category: cs.CV

TL;DR: 提出了一种零样本管道,可同时检测、识别和定位图像中的有害内容,具有高精度、强鲁棒性,并能生成像素级掩码,适用于实际恶意图像审核场景。

Details Motivation: 传统的NSFW图像标记无法满足内容审核需求,需要精确知道哪些对象使图像非法及其位置。现有方法在细粒度识别、定位和抗攻击能力方面不足。 Method: 采用基础分割模型(SAM)生成候选对象掩码并合并为独立区域;利用视觉-语言模型(VLM)通过开放词汇提示对每个区域进行恶意相关性评分;结合评分融合生成综合恶意物体图;使用多个分割器集成提升对自适应攻击的鲁棒性。 Result: 在790张图像的新标注数据集上,达到85.8%的元素级召回率、78.1%的精度和92.1%的分割成功率;相比直接零样本VLM定位,召回率提高27.4%;面对PGD对抗扰动,精度和召回下降不超过10%;单图处理耗时数秒。 Conclusion: 该方法是首个实用的细粒度、可解释的恶意图像审核工具,兼具高性能与强鲁棒性,可无缝集成到现有VLM工作流中。 Abstract: Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

[99] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan,Yuanbo Yang,Lin-Zhuo Chen,Yao Yao,Zhuzhong Qian

Main category: cs.CV

TL;DR: 本文提出了HeFT(Head-Frequency Tracker),一种利用预训练视频扩散模型视觉先验的零样本点跟踪框架,通过分析VDiT内部表示,提出了一种头-频率感知的特征选择策略,显著提升了零样本跟踪性能。

Details Motivation: 为了理解视频扩散模型如何编码时空信息,并挖掘其在零样本点跟踪任务中的潜力,作者希望探索并利用预训练视频扩散模型的内部机制以提升跟踪性能。 Method: 通过分析Video Diffusion Transformer(VDiT)的注意力头行为,发现不同头具有匹配、语义理解和位置编码等专门功能,并识别出低频特征对建立对应关系至关重要;基于此,提出联合选择最具信息量的注意力头和低频成分的策略,结合单步去噪、特征选择与软argmax定位及前后向一致性检查实现对应点估计。 Result: 在TAP-Vid基准上的大量实验表明,HeFT实现了最先进的零样本跟踪性能,接近有监督方法的精度,同时无需任何标注训练数据。 Conclusion: HeFT验证了视频扩散模型作为强大视觉基础模型的潜力,为构建统一的视觉基础模型提供了新路径。 Abstract: In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

[100] I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

Juntong Wang,Jiarui Wang,Huiyu Duan,Jiaxiang Kang,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 提出了一种名为I2I-Bench的综合性图像到图像编辑模型评测基准,涵盖10类多样化任务、30个细粒度评估维度,并采用自动化混合评估方法,结合专用工具和大型多模态模型,验证了其与人类偏好的一致性。

Details Motivation: 现有图像编辑评测基准存在任务范围有限、评估维度不足且依赖人工标注的问题,限制了可扩展性和实用性,因此需要一个更全面、自动化的评测方案。 Method: 设计I2I-Bench,包含多样化的单图与多图编辑任务,构建30个解耦的评估维度,结合专用工具和大型多模态模型(LMMs)实现自动化混合评估,并通过人类偏好对齐实验验证评估结果的可靠性。 Result: 成功评测了多个主流图像编辑模型,揭示了不同模型在各项编辑任务和评估维度间的性能差距与权衡关系,评估结果与人类偏好高度一致。 Conclusion: I2I-Bench是一个可扩展、自动化且多维度的图像编辑评测基准,能够有效支持未来图像编辑模型的发展,所有组件将开源以促进研究。 Abstract: Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

[101] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang,Hailong Guo,Fangtai Wu,Shifeng Zhang,Shijie Huang,Qijun Gan,Lin Liu,Sirui Zhao,Enhong Chen,Jiaming Liu,Steven Hoi

Main category: cs.CV

TL;DR: Live Avatar提出了一种算法-系统协同设计框架,通过时间步强制流水线并行(TPP)和滚动缓存帧机制(RSFM),实现了基于140亿参数扩散模型的实时、高保真、无限长度虚拟形象生成,首次在工业级长视频合成中实现高效部署。

Details Motivation: 现有基于扩散的视频生成方法受限于串行计算和长序列不一致性,难以满足实时流式音频驱动虚拟形象生成的需求。 Method: 提出Timestep-forcing Pipeline Parallelism (TPP) 实现跨GPU的去噪步骤流水线;引入Rolling Sink Frame Mechanism (RSFM) 动态校准外观以保持时序一致性;采用Self-Forcing Distribution Matching Distillation 实现大模型的因果可流化适配。 Result: 在5块H800 GPU上实现端到端20 FPS的生成速度,显著优于现有方法,支持无限长度、高保真视频流生成。 Conclusion: Live Avatar首次实现了大规模扩散模型在实时虚拟形象生成中的实用化部署,为工业级长视频合成建立了新范式。 Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

[102] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu,Yanhong Zeng,Haobo Li,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Jiapeng Zhu,Hengyuan Cao,Zhipeng Zhang,Xing Zhu,Yujun Shen,Min Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Reward Forcing的新型框架,用于高效流式视频生成,包含EMA-Sink和Re-DMD两种关键技术,在保持长时序一致性的同时显著提升运动质量。

Details Motivation: 现有基于滑动窗口注意力的视频扩散模型因依赖静态sink token导致初始帧复制和动态性能下降,亟需解决帧间依赖与动态表达之间的矛盾。 Method: 提出EMA-Sink机制,通过指数移动平均持续融合退出窗口的令牌以更新固定大小的sink token;并提出Rewarded Distribution Matching Distillation(Re-DMD),利用视觉-语言模型评估动态程度,优先学习高动态样本。 Result: 在标准基准上实现了最先进的性能,单个H100 GPU可实现23.1 FPS的高质量流式视频生成,有效抑制了初始帧复制并提升了运动连贯性。 Conclusion: Reward Forcing通过动态更新sink token和奖励引导的知识蒸馏,解决了长序列视频生成中的误差累积与动态退化问题,为高效流式视频生成提供了有效方案。 Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

[103] Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang,Yuheng Ji,Yuyang Liu,Enshen Zhou,Ziqiang Yang,Yuxuan Tian,Ziheng Qin,Yue Liu,Huajie Tan,Cheng Chi,Zhiyuan Ma,Daniel Dajun Zeng,Xiaolong Zheng

Main category: cs.CV

TL;DR: 本文提出了跨视图点级对应(CVPC)任务和CrossPoint-Bench基准,构建了包含37.8万问答对的CrossPoint-378K数据集,并提出CroPond模型,在跨视图对应任务上显著超越现有VLM。

Details Motivation: 现有视觉语言模型在实现精确的跨视图点级对应方面能力不足,难以支持需要精细操作的空间理解和具身智能应用。 Method: 提出CVPC任务和分层设计的CrossPoint-Bench基准;构建大规模CrossPoint-378K数据集;基于该数据集训练CroPond模型以提升点级对应性能。 Result: 评估显示现有SOTA模型(如Gemini-2.5-Pro)比人类低超54.65%准确率;CroPond在CrossPoint-Bench上比Gemini-2.5-Pro高39.7%准确率。 Conclusion: 跨视图点级对应是当前VLM的重大挑战,通过专用数据集和模型训练可显著提升性能,为具身智能中的精细交互奠定基础。 Abstract: Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

[104] OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution

Xinning Chai,Zhengxue Cheng,Yuhong Zhang,Hengsheng Zhang,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 本文提出了一种名为OmniScaleSR的扩散模型框架,用于实现高质量和高真实感的任意尺度图像超分辨率(ASSR),通过显式且与扩散过程原生兼容的尺度控制机制,结合隐式尺度适应,有效提升了在大倍率放大下的重建精度与视觉真实感。

Details Motivation: 现有任意尺度超分辨率方法多依赖隐式神经表示(INR),难以生成精细细节;而基于扩散的Real-ISR虽在固定尺度表现优异,但在任意尺度下缺乏显式控制,导致在高倍率时出现过度幻觉或模糊。因此需要一种既能保持高真实感又能精确控制放大尺度的方法。 Method: 提出OmniScaleSR,引入扩散原生的显式尺度控制机制,实现对扩散过程的尺度感知与内容感知调节,并结合多域保真度增强设计,提升重建准确性。利用预训练扩散先验并协同隐式尺度适应与显式控制。 Result: 在双三次降质基准和真实世界数据集上,OmniScaleSR在保真度和感知质量方面均优于现有最先进方法,尤其在大倍率放大下表现突出。 Conclusion: OmniScaleSR通过显式与隐式尺度控制的协同机制,实现了高保真与高真实感的任意尺度超分辨率,显著提升了扩散模型在ASSR任务中的性能与稳定性。 Abstract: Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at https://github.com/chaixinning/OmniScaleSR.

[105] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Yigui Feng,Qinglin Wang,Haotian Mo,Yang Liu,Ke Liu,Gencheng Liu,Xinhai Chen,Siqi Shen,Songzhu Mei,Jie Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的生态系统来解决在自然对话中进行生成式心理分析的两个核心挑战:视觉-情感歧义和缺乏可验证的评估指标。主要贡献包括MIND模型、ConvoInsight-DB数据集和PRISM评估框架,在微表情检测上显著超越现有方法。

Details Motivation: 现有的视觉语言模型难以区分说话时的口型动作与真实情绪表达(即发音-情感歧义),且缺乏能够评估视觉定位和推理深度的有效指标,限制了真实场景下心理分析的发展。 Method: 提出了Multilevel Insight Network for Disentanglement (MIND),通过引入状态判断模块抑制由时间特征方差引起的模糊唇部特征;构建了大规模标注数据集ConvoInsight-DB,包含微表情和深层心理推断标签;设计了基于专家引导大语言模型的自动化评估指标PRISM,用于多维评测心理视觉模型性能。 Result: 在PRISM基准测试中,MIND相比现有最优方法在微表情检测上提升了+86.95%;消融实验表明状态判断模块是性能提升的关键因素。 Conclusion: 该研究通过模型、数据集和评估体系的协同创新,有效解决了视觉心理分析中的关键障碍,推动了在自然对话中实现可解释、可验证的深层心理理解。 Abstract: Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

[106] E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Yihong Tang,Haicheng Liao,Tong Nie,Junlin He,Ao Qu,Kehua Chen,Wei Ma,Zhenning Li,Lijun Sun,Chengzhong Xu

Main category: cs.CV

TL;DR: 本文提出了E3AD,一种情感感知的视觉-语言-动作(VLA)框架,用于开放域端到端自动驾驶(OD-E2E),通过引入连续的VAD情感模型和双通路空间推理模块,实现对乘客情绪的理解与驾驶行为的协调,显著提升了视觉定位、轨迹规划及情感一致性。

Details Motivation: 现有端到端自动驾驶系统多忽略乘客情绪状态,而情绪对乘坐舒适性和系统接受度至关重要;因此需要构建能理解自然语言指令并感知情绪的自动驾驶框架。 Method: 提出E3AD框架:1)采用Valence-Arousal-Dominance(VAD)模型从语言中提取情绪状态;2)设计双通路空间推理模块融合自我中心与环境中心视角进行空间认知;3)采用一致性导向的训练策略,结合模态预训练与偏好对齐,确保情绪意图与驾驶行为一致。 Result: 在真实世界数据集上,E3AD在视觉定位和航点规划方面表现优越,并在VAD情感相关性指标上达到SOTA水平,验证了其在情绪感知与驾驶行为协同上的有效性。 Conclusion: 将情绪感知融入VLA式自动驾驶可提升系统的类人化程度,增强人机对齐,推动更人性化、舒适和可接受的自动驾驶发展。 Abstract: End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

[107] MT-Depth: Multi-task Instance feature analysis for the Depth Completion

Abdul Haseeb Nizamani,Dandi Zhou,Xinhai Sun

Main category: cs.CV

TL;DR: 本文提出了一种实例感知的深度补全框架,通过引入二值实例掩码作为空间先验来优化深度预测,在Virtual KITTI2数据集上实现了更低的RMSE,尤其在物体边界、遮挡和细结构区域表现优越。

Details Motivation: 现有深度补全方法多依赖语义分割,但忽略了物体级别理解的优势,本文旨在探索实例级信息对深度补全的提升作用。 Method: 提出一个包含冻结的YOLO V11实例分割分支、U-Net深度补全主干、交叉注意力融合模块和注意力引导预测头的框架,利用实例掩码通过跨注意力机制指导深度补全。 Result: 在Virtual KITTI2数据集上,相比仅使用U-Net的基线和之前的语义引导方法,取得了更低的RMSE和具有竞争力的MAE,显著提升了物体边界和细结构的深度精度。 Conclusion: 引入实例感知线索能有效提升深度补全性能,且无需依赖密集语义标注,为该领域提供了新的研究方向。 Abstract: Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.

[108] Order Matters: 3D Shape Generation from Sequential VR Sketches

Yizi Chen,Sidi Wu,Tianyi Xiao,Nina Wiedemann,Loic Landrieu

Main category: cs.CV

TL;DR: 本文提出了VRSketch2Shape,首个从顺序VR草图生成3D形状的框架和多类别数据集,通过保留笔画时序信息提升了几何精度,并实现了在部分草图上的良好表现。

Details Motivation: 现有草图到形状模型忽略笔画的时间顺序,丢失了结构和设计意图的关键线索,因此需要一种能利用时序信息的方法来提升3D形状生成质量。 Method: 提出了一种自动管线生成顺序VR草图,构建了包含超过2万合成和900手绘草图-形状对的数据集,并设计了顺序感知的草图编码器与基于扩散的3D生成器。 Result: 该方法相比先前工作具有更高的几何保真度,在合成到真实草图的泛化上表现良好且仅需极少监督,并在部分草图上也表现出色。 Conclusion: VRSketch2Shape通过利用草图的时序信息显著提升了3D形状生成的效果,为VR草图设计提供了更高效、直观的工具支持。 Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.

[109] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping,Chengyou Jia,Minnan Luo,Changliang Xia,Xin Shen,Zhuohang Dang,Hangwei Qian

Main category: cs.CV

TL;DR: 本文提出了PaCo-RL框架,通过强化学习解决一致图像生成中的身份、风格和逻辑一致性问题,包含PaCo-Reward一致性评估模型和PaCo-GRPO高效强化学习算法,在性能和训练效率上均取得优越表现。

Details Motivation: 现有监督学习方法因缺乏大规模一致性数据集且难以建模人类感知偏好,难以实现跨图像的视觉一致性生成。 Method: 提出PaCo-RL框架:1)PaCo-Reward——基于自动生成子图配对数据集训练的成对一致性评估模型,采用生成式自回归评分机制并结合任务指令与思维链推理;2)PaCo-GRPO——采用分辨率解耦优化策略和对数抑制多奖励聚合机制的强化学习算法,提升训练效率与稳定性。 Result: 实验表明PaCo-Reward显著提升与人类感知的一致性对齐效果,PaCo-GRPO在两个代表性子任务上实现了最先进的生成一致性,并具有更高的训练效率和稳定性。 Conclusion: PaCo-RL为无需大规模标注数据的一致图像生成提供了有效、可扩展的解决方案,展示了强化学习在复杂视觉一致性建模中的潜力。 Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

[110] LaFiTe: A Generative Latent Field for 3D Native Texturing

Chia-Hao Chen,Zi-Xin Zou,Yan-Pei Cao,Ze Yuan,Guan Luo,Xiaojuan Qi,Ding Liang,Song-Hai Zhang,Yuan-Chen Guo

Main category: cs.CV

TL;DR: 本文提出了LaFiTe,一种用于3D原生纹理生成的框架,通过学习稀疏潜在颜色场实现高保真、无缝的纹理合成,显著超越现有方法。

Details Motivation: 现有的3D纹理生成方法受限于缺乏强大且通用的潜在表示,导致纹理保真度和泛化能力不足。 Method: LaFiTe采用变分自编码器(VAE)将表面外观编码为稀疏结构化潜在空间,并解码为连续的颜色场;结合条件修正流模型进行高质量纹理合成。 Result: 在重建任务中PSNR超过现有最先进方法>10 dB,实现了更高保真度和跨风格、几何形状的一致性纹理生成。 Conclusion: LaFiTe为3D原生纹理生成设立了新基准,支持材质合成和纹理超分辨率等应用,推动下一代3D内容创作发展。 Abstract: Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.

[111] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He,Longhui Wei,Jianbo Ouyang,Lingxi Xie,Qi Tian

Main category: cs.CV

TL;DR: 本文提出了EMMA,一种高效且统一的多模态理解、生成与编辑架构,通过高倍压缩、通道级拼接、共享-解耦网络和混合专家机制,在效率和性能上均优于现有方法。

Details Motivation: 现有的多模态模型在理解与生成任务之间存在效率与训练不平衡问题,且视觉token过多导致计算开销大,因此需要一个更高效的统一架构。 Method: 1) 采用32倍压缩比的高效自编码器减少生成所需的token数量;2) 使用通道级拼接而非token级拼接以进一步降低视觉token数;3) 设计共享-解耦网络实现任务间互促并满足特定需求;4) 在视觉理解编码器中引入混合专家机制提升感知能力。 Result: EMMA-4B在效率和性能上显著优于当前最先进的统一多模态模型(如BAGEL-7B),并在多模态理解和生成任务上达到与专用模型(如Qwen3-VL和Qwen-Image)相当的结果。 Conclusion: EMMA为未来统一多模态架构的发展奠定了坚实基础,兼具高效性与强大性能。 Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

[112] RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS

Chuanyu Fu,Guanying Chen,Yuqi Zhang,Kunbin Yao,Yuan Xiong,Chuan Huang,Shuguang Cui,Yasuyuki Matsushita,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了RobustSplat++,一种针对3D高斯点阵在复杂真实场景中鲁棒建模的方法,通过延迟高斯增长、级联掩码引导和外观建模提升对瞬态物体和光照变化的鲁棒性。

Details Motivation: 现有3D高斯点阵方法在处理包含瞬态物体和光照变化的真实场景时易产生渲染伪影,主要源于高斯致密化过程对动态干扰的过拟合。 Method: 提出三方面改进:1)延迟高斯增长策略,优先优化静态结构;2)尺度级联的掩码自举方法,结合低分辨率特征相似性监督与高分辨率精细化预测;3)结合外观建模联合优化。 Result: 在多个具有挑战性的数据集上实验表明,该方法显著优于现有方法,有效减少伪影并提升渲染质量。 Conclusion: RobustSplat++通过关键设计增强了3DGS在真实复杂场景下的鲁棒性,为处理瞬态干扰和光照变化提供了有效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.

[113] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

Huynh Trinh Ngoc,Hoang Anh Nguyen Kim,Toan Nguyen Hai,Long Tran Quoc

Main category: cs.CV

TL;DR: 提出LatentFM,一种基于潜在空间流模型的医学图像分割方法,通过学习条件速度场生成多样化的分割结果,并提供不确定性估计和置信图。

Details Motivation: 受流匹配(FM)在生成模型中成功应用的启发,希望将其优势扩展到医学图像分割任务中,以实现精确且具备不确定性感知的分割。 Method: 设计两个变分自编码器(VAEs)将医学图像和对应掩码编码至低维潜在空间,在该空间内估计引导流的条件速度场,通过采样多个潜在表示生成多样化分割输出,并生成量化模型置信度的置信图。 Result: 在ISIC-2018和CVC-Clinic数据集上实验表明,该方法在分割精度上优于现有确定性和生成性方法,同时在潜在空间中保持高效性,且能可靠捕捉数据分布并提供有效的不确定性估计。 Conclusion: LatentFM能够在潜在空间中高效地实现高精度、多样的医学图像分割,并提供有价值的不确定性信息和置信图,有助于临床决策支持。 Abstract: Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.

[114] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Shijie Chen,Peixi Peng

Main category: cs.CV

TL;DR: 提出FreeGen框架,通过重建-生成协同训练实现自动驾驶场景的自由视角合成,兼顾插值一致性和外推真实性。

Details Motivation: 现有数据集和生成管线难以提供一致的离轨迹观测,限制了大规模评估与训练;同时生成模型在无需每场景优化的情况下难以同时保证插值一致性与外推真实性。 Method: 提出FreeGen,采用前馈式重建-生成协同训练框架:重建模型提供稳定的几何表征以确保插值一致性,生成模型进行几何感知增强以提升新视角下的真实感,并通过协同训练相互增强。 Result: 实验表明,FreeGen在自由视角驾驶场景合成任务上实现了最先进的性能。 Conclusion: FreeGen有效解决了自由视角驾驶场景合成中一致性与真实性的平衡问题,为自动驾驶的闭环仿真与可扩展预训练提供了更高质量的视觉生成方案。 Abstract: Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

[115] Tokenizing Buildings: A Transformer for Layout Synthesis

Manuel Ladron de Guevara,Jinmo Rhee,Ardavan Bidgoli,Vaidas Razgaitis,Michael Bergin

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的建筑布局合成模型SBM,通过统一异构特征集并设计联合嵌入模块,在BIM场景中实现高效布局生成与语义检索。

Details Motivation: 解决如何在保持建筑元素组成结构的同时,对异构特征进行统一建模以支持BIM场景中的布局合成。 Method: 将建筑元素的异构特征表示为稀疏属性-特征矩阵,设计统一嵌入模块学习类别与连续特征的联合表示,并采用单个Transformer主干网络在编码器-only和编码器-解码器两种模式下训练。 Result: SBM能学习到紧凑且按类型和拓扑聚类良好的房间嵌入,在检索任务中表现优异;在DDEP模式下生成的功能合理布局碰撞更少、边界违规更少且可通行性更好。 Conclusion: SBM有效统一了建筑元素的异构特征表示,支持高质量的布局生成与语义检索,提升了BIM场景中自动布局设计的性能。 Abstract: We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.

[116] A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

Jikang Cheng,Renye Yan,Zhiyuan Yan,Yaozhong Gan,Xueyi Zhang,Zhongyuan Wang,Wei Peng,Ling Liang

Main category: cs.CV

TL;DR: 本文提出了一种新的深度伪造检测范式MID-FFD,旨在解决多域场景下域差异主导特征空间导致单图像判断困难的问题,并设计了模型无关框架DevDet以增强真实/伪造差异的主导性。

Details Motivation: 现有深度伪造检测方法难以在未指定域条件下对单个图像做出准确判断,因训练中多域差异掩盖了真实/伪造差异,需引入大规模多域数据并解决域主导问题。 Method: 提出了MID-FFD新范式和DevDet框架,后者包含Face Forgery Developer(FFDev)和Dose-Adaptive detector Fine-Tuning(DAFT)策略,用以放大真实与伪造之间的差异并提升检测器在未知域下的判别能力。 Result: 实验表明,所提方法在MID-FFD场景下显著提升了真实/伪造判断准确性(ACC),同时保持了对未见数据的良好泛化能力。 Conclusion: 通过引入MID-FFD范式和DevDet框架,有效缓解了多域环境下域差异主导的问题,实现了更可靠的单图像深度伪造检测。 Abstract: Existing methods for deepfake detection aim to develop generalizable detectors. Although "generalizable" is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.

[117] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Chanfan Gan,Tieyuan Chen,Weiyao Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为LineAR的训练-free渐进式KV缓存压缩方法,用于自回归图像生成,通过在行级别管理缓存显著减少内存占用并提升吞吐量,同时保持甚至提升生成质量。

Details Motivation: 现有的自回归图像生成因需缓存所有已生成的视觉标记而面临严重的内存瓶颈,导致存储需求高和吞吐量低。 Method: LineAR利用视觉注意力的内在特性,以二维视角在行级别管理KV缓存,基于行间注意力逐步淘汰对后续生成影响较小的低信息量标记。 Result: 在六个自回归图像生成模型上实验表明,LineAR在仅保留1/6或1/8 KV缓存的情况下,提升了ImageNet和COCO的FID指标以及DPG表现,并实现了最高67.61%的内存减少和7.57倍的速度提升。 Conclusion: LineAR是一种高效、通用且无需训练的KV缓存压缩方案,显著提升了自回归图像生成的效率与可扩展性。 Abstract: Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

[118] Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

Maria-Paola Forte,Nikos Athanasiou,Giulia Ballardini,Jan Ulrich Bartels,Katherine J. Kuchenbecker,Michael J. Black

Main category: cs.CV

TL;DR: 提出了一种结合视觉姿态估计与生物阻抗传感的新型框架BioTUCH,通过引入自接触信息提升野外3D人体姿态捕捉的准确性,并发布了配套的小型可穿戴传感器和数据集。

Details Motivation: 视频姿态估计在自接触场景(如手触脸)下常失败,而生物阻抗可低成本检测皮肤接触,因此需融合两者以提升真实场景下的3D姿态估计精度。 Method: 使用现成的姿态估计器初始化姿态,结合RGB视频、生物阻抗信号和3D运动捕捉数据,在检测到自接触时进行接触感知的姿态优化,通过最小化重投影误差和输入估计偏差,同时施加顶点接近约束。 Result: 在三种输入姿态估计器上平均提升了11.7%的重建精度,验证了方法有效性,并开发了用于大规模采集接触感知训练数据的小型可穿戴生物阻抗传感器。 Conclusion: BioTUCH通过融合视觉与生物阻抗传感显著提升了复杂自接触场景下的3D人体姿态估计性能,为未来数据驱动方法提供了新方向与实用工具。 Abstract: Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at biotuch.is.tue.mpg.de

[119] SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

Qing Xu,Yanqian Wang,Xiangjian Hea,Yue Li,Yixuan Zhang,Rong Qu,Wenting Duan,Zhen Chen

Main category: cs.CV

TL;DR: 提出了一种名为SP-Det的自提示检测框架,用于胸部X光片中多标签病变的自动检测,无需依赖专家标注的提示。

Details Motivation: 现有基于提示的检测方法依赖人工标注提示,费时费力,难以应用于临床。因此需要一种无需专家标注即可自动生成有效提示的方法。 Method: 设计了一个无专家干预的双文本提示生成器(DTPG),结合全局病理模式的语义上下文提示和针对特定疾病的疾病信标提示;并引入双向特征增强器(BFE)融合诊断上下文与疾病嵌入以提升检测性能。 Result: 在两个胸部X光数据集上实验表明,SP-Det优于现有的最先进检测方法,且完全摆脱了对专家标注提示的依赖。 Conclusion: SP-Det通过自动生成高质量文本提示,实现了高效准确的多标签病变检测,具有良好的临床应用前景。 Abstract: Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.

[120] SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

Jiawen Wen,Yu Hu,Suixuan Qiu,Jinshan Huang,Xiaowen Chu

Main category: cs.CV

TL;DR: 本文提出SDG-Track,一种用于边缘设备上实时追踪小型无人机的稀疏检测引导跟踪方法,通过Observer-Follower架构解决高分辨率与高速度之间的冲突,在保持高精度的同时实现35.1 FPS的系统吞吐量。

Details Motivation: 在边缘设备上实时追踪小型无人机面临高分辨率图像中目标特征因下采样而丢失、全分辨率处理又导致速度不足的问题,难以满足平滑云台控制的需求。 Method: 采用Observer-Follower架构:Observer流在GPU上以低频运行高容量检测器,从1920x1080帧中获取精确位置锚点;Follower流在CPU上通过ROI约束的稀疏光流进行高频轨迹插值;并引入无需训练的Dual-Space Recovery机制,结合颜色直方图匹配与几何一致性约束应对遮挡或干扰物导致的跟踪失败。 Result: 在NVIDIA Jetson Orin Nano平台上,SDG-Track实现了35.1 FPS的系统吞吐量,并保留了97.2%的逐帧检测精度,能够在真实环境中成功追踪敏捷飞行的FPV无人机。 Conclusion: SDG-Track有效解决了边缘设备上小目标无人机追踪中的分辨率-速度矛盾,在资源受限条件下实现了高精度与高帧率的平衡,适用于实际地面到空中跟踪任务。 Abstract: Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

[121] You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Priyanto Hidayatullah,Nurjannah Syakrani,Yudi Widhiyasana,Muhammad Rizqi Sholahuddin,Refdinal Tubagus,Zahri Al Adzani Hidayat,Hanri Fajar Ramadhan,Dafa Alfarizki Pratama,Farhan Muhammad Yasin

Main category: cs.CV

TL;DR: 本文提出了一种名为You Only Train Once (YOTO)的方法,结合YOLO11n、DeIT和Proxy Anchor Loss,有效缓解了目标检测中的灾难性遗忘问题,无需重新训练即可高效添加新类别,在零售场景中实现了高精度和近3倍的训练效率提升。

Details Motivation: 目标检测在频繁新增类别的场景(如零售)中面临灾难性遗忘问题,传统方法需用全部数据重训练,导致训练成本和时间开销巨大。 Method: 提出YOTO框架:使用YOLO11n进行目标定位,DeIT与Proxy Anchor Loss进行特征提取和度量学习,通过计算目标产品嵌入特征与Qdrant向量数据库中特征的余弦相似度实现分类。 Result: 在包含140种商品的零售案例中,新旧产品检测均取得良好准确率;无需重训练,训练效率达传统方法的近3倍,且随新产品增加优势更明显;边缘设备上每张图像平均推理时间为580ms。 Conclusion: YOTO框架有效解决了灾难性遗忘问题,显著降低了模型更新成本,具备良好的实际应用前景,尤其适用于需要频繁扩展类别的现实场景。 Abstract: Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.

[122] Equivariant Symmetry-Aware Head Pose Estimation for Fetal MRI

Ramya Muthukrishnan,Borjan Gagoski,Aryn Lee,P. Ellen Grant,Elfar Adalsteinsson,Polina Golland,Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose是一种新的快速姿态估计方法,通过显式建模旋转等变性和物体对称性来提升胎儿头部在MRI扫描中的6自由度姿态估计精度。

Details Motivation: 现有方法在处理临床MRI数据时因解剖结构的对称性、低分辨率、噪声和伪影导致姿态估计泛化能力差,难以支持自适应2D切片定位。 Method: 提出E(3)-Pose,联合建模旋转等变性和对象对称性,利用快速获取的3D MRI体积数据进行6-DoF姿态估计,并在架构上保持刚体姿态的等变性和解剖对称性。 Result: 在公开且具代表性的临床胎儿MRI数据集上验证了方法的优越鲁棒性和跨域泛化能力,在临床MRI体积数据上达到最先进的精度。 Conclusion: E(3)-Pose通过结构设计有效解决了对称性和等变性问题,实现了高精度、强鲁棒的姿态估计,具备良好的临床应用前景。 Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.

[123] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为ReflexFlow的方法,用于缓解Flow Matching中的暴露偏差问题,通过反漂移校正和频率补偿两个组件提升生成质量。

Details Motivation: Flow Matching方法在训练和推理之间存在暴露偏差,导致生成效果下降,本文旨在探究其根本原因并提出解决方案。 Method: 提出了ReflexFlow,包括反漂移校正(ADR)和频率补偿(FC),通过重设计损失函数和调整预测目标来动态纠正暴露偏差。 Result: 在CIFAR-10、CelebA-64和ImageNet-256上实验表明,ReflexFlow显著优于先前方法,在CelebA-64上FID降低了35.65%。 Conclusion: ReflexFlow是一种通用且有效的方法,能够兼容所有Flow Matching框架,并显著提升图像生成质量。 Abstract: Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.

[124] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Yueming Pan,Ruoyu Feng,Qi Dai,Yuqi Wang,Wenfeng Lin,Mingyu Guo,Chong Luo,Nanning Zheng

Main category: cs.CV

TL;DR: 本文提出了Semantic-First Diffusion (SFD),一种显式优先生成语义的潜在扩散模型,通过异步去噪机制实现语义先于纹理生成,显著提升图像生成质量与收敛速度。

Details Motivation: Latent Diffusion Models(LDMs)具有由粗到细的生成特性,高层语义通常早于细节纹理生成,但现有方法仍同步去噪语义和纹理,忽略了这一自然顺序。此外,尽管已有工作引入预训练视觉编码器的语义先验,但未充分利用语义对纹理生成的引导作用。因此,作者希望设计一种能显式优先构建语义结构、从而更好指导纹理生成的扩散框架。 Method: 提出Semantic-First Diffusion (SFD),首先通过专用的Semantic VAE从预训练视觉编码器中提取紧凑的语义潜在表示,并与VAE编码的纹理潜在表示组合成复合潜在空间。核心是采用不同的噪声调度方案对语义和纹理潜在变量进行异步去噪:语义潜在变量提前去噪,形成稳定的语义锚点,为后续纹理生成提供清晰的高层指导,实现自然的由粗到细生成过程。 Result: 在ImageNet 256x256任务中,SFD结合DiT模型取得了FID 1.06(LightningDiT-XL)和FID 1.04(LightningDiT-XXL)的优异性能,且收敛速度比原始DiT快达100倍。同时,SFD还能有效提升ReDi、VA-VAE等现有方法的性能,验证了异步、语义主导建模的有效性。 Conclusion: SFD通过引入异步去噪机制,显式地优先生成语义信息,强化了高层语义对细节纹理的引导作用,不仅实现了更高质量的图像生成和极快的收敛速度,也展示了语义先行策略在扩散模型中的普适增益效果。 Abstract: Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.

[125] Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting

Paul Henderson

Main category: cs.CV

TL;DR: 提出了一种全新的自顶向下方法,通过全局拟合显式参数化模型到神经网络预测的卷状纸草位置,实现对严重损坏的赫库兰尼姆卷轴CT扫描的自动虚拟展开,优于现有自动化方法。

Details Motivation: 赫库兰尼姆卷轴因维苏威火山喷发被烧毁并碳化,物理展开极易损坏,急需非破坏性数字化读取手段;现有自动虚拟展开方法效果有限,且难以处理严重损毁或信号缺失区域。 Method: 提出一种自顶向下的全自动方法,利用神经网络预测纸草在CT扫描中的可能位置,再全局拟合一个显式的参数化卷轴表面模型,确保生成的表面为连续的二维曲面,即使在CT中不可见区域也能合理推断并通过。 Result: 在两个高分辨率卷轴CT扫描数据上进行了全面实验,成功实现了大范围区域的虚拟展开,性能超过目前唯一适用于此类数据的自动化展开方法。 Conclusion: 该方法是首个能全自动拟合严重损毁卷轴CT扫描的表面模型技术,保证了展开结果的连续性和完整性,显著提升了虚拟展开的效率与适用性,为古籍数字化提供了有力工具。 Abstract: The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.

[126] LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu,Cheng Lin,Tao Xie,Wei Yin,Ben Li,Zhiyuan Pu,Weize Li,Yao Yao,Xun Cao,Xiaoyang Guo,Xiao-Xiao Long

Main category: cs.CV

TL;DR: 本文提出了LiteVGGT,一种高效的3D视觉基础模型,通过几何感知的缓存令牌合并策略,实现了比现有方法高达10倍的速度提升和显著的内存减少,同时保持了核心性能,适用于大规模场景的高效处理。

Details Motivation: 现有的3D视觉基础模型如VGGT在处理长序列时耗时且内存消耗大,限制了其在超过数百张图像的大规模场景中的应用。因此,需要一种更高效的方法来解决这一问题。 Method: 基于对局部图像区域令牌具有内在几何相关性和相邻网络层间令牌相似性稳定的两个关键洞察,设计了一种简单而高效的几何感知缓存令牌合并策略。该策略分析每个令牌的几何重要性,优化锚点令牌选择以更好地保留重建所需的关键信息,并跨层缓存和重用合并索引,大幅降低延迟的同时对精度影响极小。 Result: 实验验证了LiteVGGT的有效性、可扩展性和鲁棒性,能够在保持VGGT核心性能的同时实现高效的微调和FP8量化,进一步提高效率。 Conclusion: LiteVGGT通过引入几何感知的缓存令牌合并机制,在保证3D重建质量的前提下,极大地提高了处理速度和降低了内存使用,为大规模场景下的3D视觉任务提供了有效的解决方案。 Abstract: 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

[127] Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

Novanto Yudistira

Main category: cs.CV

TL;DR: 提出一种基于深度神经网络与自适应多模态融合(RGB、光流、音频、深度)的人类动作识别新方法,通过门控机制实现信息选择性融合,显著提升准确性和鲁棒性。

Details Motivation: 传统单模态动作识别方法受限于信息单一,难以应对复杂场景,需通过多模态融合提升性能。 Method: 采用深度神经网络结合门控机制的自适应融合策略,对RGB、光流、音频和深度等多种模态信息进行加权整合,选择性提取关键特征。 Result: 在动作识别、暴力检测和自监督学习任务的基准数据集上均取得更高的识别精度,优于传统单模态方法。 Conclusion: 该方法能有效提升动作识别的准确性与鲁棒性,具有在智能监控、人机交互及主动辅助生活等领域广泛应用的潜力。 Abstract: This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.

[128] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

Yicheng Liu,Shiduo Zhang,Zibin Dong,Baijun Ye,Tianyuan Yuan,Xiaopeng Yu,Linqi Yin,Chenhao Lu,Junhao Shi,Luca Jiang-Tao Yu,Liangtao Zheng,Tao Jiang,Jingjing Gong,Xipeng Qiu,Hang Zhao

Main category: cs.CV

TL;DR: FASTer是一个用于高效和可泛化的机器人学习的统一框架,通过引入可学习的分词器和基于块的自回归解码策略,在保持高压缩率的同时提升了动作重建质量和推理速度。

Details Motivation: 现有的自回归视觉-语言-动作(VLA)模型在动作分词过程中面临重建保真度与推理效率之间的权衡问题,限制了其在实际机器人操作中的应用。 Method: 提出FASTer框架,包括FASTerVQ和FASTerVLA:FASTerVQ将动作块编码为单通道图像以捕捉全局时空依赖并实现高压缩;FASTerVLA在此基础上采用块状自回归解码和轻量级动作专家进行策略学习。 Result: 实验表明,FASTerVQ在重建质量、令牌利用率及跨任务、跨形态泛化能力上表现优异;FASTerVLA进一步提升了推理速度和任务性能,超越现有最先进VLA模型。 Conclusion: FASTer框架有效解决了动作分词中的效率与性能权衡问题,为机器人学习提供了更高效、通用且实用的解决方案。 Abstract: Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

[129] GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

Yupu Yao,Bowen Yang

Main category: cs.CV

TL;DR: 本文提出了Geometric Positional Embedding (GeoPE),一种基于四元数的3D欧氏空间位置编码方法,用于恢复视觉Transformer中的二维空间流形,有效解耦虚假序列邻近与真实空间距离,在图像分类、目标检测和3D语义分割任务中表现优越。

Details Motivation: 标准Vision Transformer将2D图像展平为1D序列,破坏了原有的空间拓扑结构;现有2D RoPE方法未能有效区分空间上相邻与序列上相邻的差异,导致建模失真。 Method: 提出GeoPE,利用四元数在3D欧氏空间中扩展旋转位置编码,并通过李代数中的几何平均构造统一的旋转算子,实现对空间维度的几何耦合编码。 Result: 在图像分类、目标检测和3D语义分割任务上,GeoPE均优于现有的2D RoPE变体,并显著增强模型的形状偏差,验证其对真实几何结构的建模能力。 Conclusion: GeoPE成功恢复了视觉Transformer中的二维空间结构,解决了传统方法中空间邻近性与序列邻近性混淆的问题,展现出更强的空间感知能力和泛化性能。 Abstract: Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

[130] Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis

Jasmaine Khale,Ravi Prakash Srivastava

Main category: cs.CV

TL;DR: 本文提出了一种针对视网膜多疾病图像数据集的平衡少样本学习框架,通过平衡 episodic 采样、针对性数据增强和预训练 ResNet-50 编码器,在小样本条件下实现了更公平、准确的疾病诊断,尤其提升了稀有病种的识别性能。

Details Motivation: 传统深度学习依赖大量标注数据,且在疾病类别不平衡时表现受限,而视网膜疾病的标注数据获取成本高、分布不均,因此需要一种能在少量样本下泛化并缓解类别偏倚的诊断方法。 Method: 提出一种平衡的少样本 episodic 学习框架,包含:1)平衡 episodic 采样,确保每类在5-way 5-shot任务中均等参与;2)针对少数类的数据增强,包括CLAHE和颜色/几何变换;3)采用ImageNet预训练的ResNet-50提取特征,使用余弦相似度进行原型分类。 Result: 在RFMiD数据集上前10大类上实验表明,该方法在100个训练episode和1000个测试episode下显著提高了整体准确率,有效降低了对多数类的偏倚,尤其改善了罕见疾病如视盘水肿和分支静脉阻塞的诊断效果。 Conclusion: 结合数据集感知的少样本学习策略、平衡采样与CLAHE增强预处理,可在数据受限条件下实现更鲁棒且临床公平的视网膜疾病自动诊断。 Abstract: Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.

[131] Rethinking the Use of Vision Transformers for AI-Generated Image Detection

NaHyeon Park,Kunhee Kim,Junsuk Choe,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一种名为MoLD的新方法,通过动态融合CLIP-ViT中多层特征来提升AI生成图像的检测性能,实验表明其在多种生成模型和现实场景下具有更好的泛化性和鲁棒性。

Details Motivation: 现有方法主要使用CLIP-ViT最后一层的特征进行AI生成图像检测,但不同层次的特征可能包含互补信息,因此探索层间特征的贡献并加以融合可提升检测效果。 Method: 系统分析了ViT各层特征在检测任务中的作用,发现早期层提供更局部且可泛化的特征;基于此,提出了MoLD方法,采用基于门控的机制动态融合多层特征。 Result: MoLD在GAN和扩散模型生成的图像上均显著提升了检测性能,增强了对不同生成模型的泛化能力,并在真实场景中表现出更强的鲁棒性;同时验证了该方法在DINOv2等其他ViT模型上的适用性。 Conclusion: 多层特征融合优于单一最终层特征,MoLD通过自适应整合ViT多层次特征,为AI生成图像检测提供了更强大、灵活且可扩展的解决方案。 Abstract: Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

[132] Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks

Leonid Pogorelyuk,Niels Bracher,Aaron Verkleeren,Lars Kühmichel,Stefan T. Radev

Main category: cs.CV

TL;DR: 提出了一种稳定的对比损失方法,用于学习像素级表示,能够同时捕捉语义和几何信息,无需动量教师-学生训练即可实现跨图像的精确点对应。

Details Motivation: 为了在不依赖复杂训练机制的情况下,获得兼具语义意义和几何一致性的像素级表示。 Method: 设计了一种稳定的对比损失家族,将每个像素映射到过完备描述符,实现视图不变且语义清晰的表示。 Result: 在合成2D和3D环境中验证了该方法能有效建立精确的点对应,并展现出良好的表示特性。 Conclusion: 该方法为像素级表示学习提供了一种简洁有效的新途径,适用于需要精细对应的任务。 Abstract: We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.

[133] Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

NaHyeon Park,Namin An,Kunhee Kim,Soyeon Yoon,Jiahao Huo,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文研究了基于大视觉语言模型(LVLM)的文本到图像生成系统中的社会偏见问题,发现其比非LVLM模型产生更明显的偏见,并指出系统提示是主要驱动因素;作者提出了无需训练的元提示框架FairPro,可在测试时实现自我审计并构建公平性感知的系统提示,实验表明该方法能显著降低偏见同时保持图文对齐。

Details Motivation: 探究LVLM-based T2I系统是否加剧社会偏见,并理解偏见来源,以推动更公平、负责任的图像生成模型发展。 Method: 构建包含1024个提示的基准,覆盖四种语言复杂度层级,系统评估多个属性上的群体偏见;通过解码中间表示、标记概率诊断和嵌入关联分析,揭示系统提示如何编码并传播偏见;提出FairPro框架,在测试时动态生成去偏的系统提示。 Result: 发现LVLM模型相比非LVLM模型生成更多带有社会偏见的图像;系统提示被证实为偏见传播的关键机制;FairPro在SANA和Qwen-Image两个模型上显著降低了偏见程度,同时保持了生成质量与文本对齐能力。 Conclusion: 系统提示在LVLM-based T2I系统的偏见传播中起核心作用,FairPro提供了一种实用且可部署的方法来缓解这一问题,有助于构建更公平的生成式AI系统。 Abstract: Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.

[134] A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs

Qiong Chang,Weimin Wang,Junpei Zhong,Jun Miyazaki

Main category: cs.CV

TL;DR: 本文提出了一种面向嵌入式GPU的内存优化策略,用于高效点云配准算法VANICP,实现了超过97%的内存消耗降低,同时保持原有性能。

Details Motivation: VANICP虽提升了点云处理效率,但其高内存占用限制了在嵌入式等资源受限环境中的应用。 Method: 提出一种面向GPU的动态内存分配策略,优化膨胀操作中的内存使用,并构建轻量化的VANICP增强版本。 Result: 所提方法在嵌入式GPU上实现了超过97%的内存减少,同时保持与原算法相当的性能。 Conclusion: 该优化策略有效解决了VANICP在资源受限设备上的部署难题,推动了其在边缘计算和嵌入式系统中的应用潜力。 Abstract: This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: https://github.com/changqiong/VANICP4Em.git.

[135] Reflection Removal through Efficient Adaptation of Diffusion Transformers

Daniyar Zakarin,Thiemo Wandel,Anton Obukhov,Dengxin Dai

Main category: cs.CV

TL;DR: 本文提出了基于扩散变换器(DiT)的单图像去反射框架,利用预训练的DiT基础模型结合物理合成数据和高效的LoRA微调方法,在域内和零样本基准上实现了最先进的性能。

Details Motivation: 现有的反射去除方法依赖于特定任务的架构,且缺乏足够多样、可扩展和逼真的训练数据,限制了模型的泛化能力和恢复质量。 Method: 采用预训练的DiT基础模型,通过条件输入受反射污染的图像并引导其输出清晰的透射层;构建基于Blender中Principled BSDF的物理渲染(PBR)管线来生成具有真实感的玻璃材质和反射效果的合成数据,并使用LoRA进行高效微调。 Result: 在现有和新构建的数据集上验证了方法的有效性,实验表明该方法在域内和零样本设置下均达到最先进水平。 Conclusion: 预训练的扩散变换器结合物理合理的数据合成与高效适配策略,为去反射任务提供了一个可扩展且高保真的解决方案。 Abstract: We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

[136] Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects

Xianghui Fan,Zhaoyu Chen,Mengyang Pan,Anping Deng,Hang Yang

Main category: cs.CV

TL;DR: 提出了一种自监督方法用于透明物体深度补全,通过在非透明区域模拟深度缺陷并利用原始深度图作为监督信号,减少了对标注数据的依赖,并取得了与有监督方法相当的性能。

Details Motivation: 由于光的折射和反射,传统深度传感器难以感知透明物体的深度,且现有方法依赖大量标注数据,成本较高。 Method: 提出一种自监督深度补全方法,在非透明区域模拟透明物体的深度缺陷,并使用原始完整深度图作为监督信号进行训练。 Result: 该方法在实验中达到了与有监督方法相当的性能,且在小样本情况下通过预训练可提升模型表现。 Conclusion: 所提自监督方法有效减少了对标注数据的依赖,为透明物体深度感知提供了一种高效可行的解决方案。 Abstract: The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.

[137] Generative Neural Video Compression via Video Diffusion Prior

Qi Mao,Hao Cheng,Tinghan Yang,Libiao Jin,Siwei Ma

Main category: cs.CV

TL;DR: GNVC-VD是首个基于DiT的生成式神经视频压缩框架,统一了时空潜在压缩与序列级生成优化,利用视频扩散Transformer联合增强帧内与帧间潜在表示,显著减少感知闪烁并提升低码率下的感知质量。

Details Motivation: 现有基于图像生成先验的感知视频压缩方法缺乏时间建模,导致帧间闪烁和不一致;需要一种能利用视频原生生成先验的新型压缩框架以改善时空一致性。 Method: 提出GNVC-VD,采用视频扩散Transformer(DiT)构建端到端可训练的生成式压缩框架;引入统一的流匹配潜在优化模块,从解码后的时空潜变量初始化,并学习修正项以适配压缩退化;通过条件适配器将压缩感知信号注入中间层,实现强时间相干性与伪影去除。 Result: 在极低码率(低于0.01 bpp)下,GNVC-VD在感知质量上优于传统和学习型编解码器,显著减少先前生成方法中的闪烁伪影,验证了视频原生生成先验在神经压缩中的优势。 Conclusion: GNVC-VD成功将视频生成基础模型融入神经视频压缩,实现了时空一致的高质量重建,展示了基于视频扩散模型的生成式压缩在下一代感知优化编码中的潜力。 Abstract: We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

[138] HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

Pham Thach Thanh Truc,Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy

Main category: cs.CV

TL;DR: 本文提出了一种名为HTR-ConvText的手写文本识别模型,结合卷积神经网络与MobileViT,有效捕捉局部笔画特征和全局上下文依赖,在数据有限且书写风格多样的场景下表现出更强的泛化能力。

Details Motivation: 手写文本识别因数据稀缺、书写风格多样及复杂变音符号而具有挑战性,现有方法在缺乏大量合成数据时泛化能力不足。 Method: 提出HTR-ConvText模型:在特征提取阶段融合残差CNN与带位置编码的MobileViT;引入ConvText编码器,通过分层结构整合全局上下文与局部特征并缩短序列长度;设计辅助模块注入文本上下文以增强CTC性能。 Result: 在IAM、READ2016、LAM和HANDS-VNOnDB数据集上验证,该方法在识别性能和泛化能力上优于现有方法,尤其在训练样本少和书写多样性高的情况下表现突出。 Conclusion: HTR-ConvText通过融合局部细节与全局上下文信息,显著提升了手写文本识别的准确性和鲁棒性,适用于低资源和高变异场景。 Abstract: Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.

[139] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Nicolas Houdré,Diego Marcos,Hugo Riffaud de Turckheim,Dino Ienco,Laurent Wendling,Camille Kurtz,Sylvain Lobry

Main category: cs.CV

TL;DR: 本文提出了RAMEN,一种可调节分辨率的多模态编码器,能够以传感器无关的方式学习地球观测数据的共享视觉表示,支持跨异构模态的统一表征与推理。

Details Motivation: 现有基础模型通常要求固定输入分辨率或依赖特定传感器编码器,限制了在多种地球观测模态间的泛化能力。 Method: 提出RAMEN,将模态、空间和时间分辨率视为输入特征,通过单个统一的Transformer编码器重建掩码的多模态地球观测数据,并将空间分辨率设为可调控输出参数。 Result: 预训练后的RAMEN在PANGAEA基准上优于更大的先进模型,能有效迁移到已知和未见的传感器配置。 Conclusion: RAMEN实现了对多源、多分辨率地球观测数据的统一建模,提供了精度与计算成本之间的显式权衡,具有良好的泛化性和实用性。 Abstract: Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

[140] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Abhigyan Bhattacharya,Hiranmoy Roy

Main category: cs.CV

TL;DR: 本文提出了一种语义引导的分层生成架构,用于解决人脸图像修复中大区域缺失导致的结构不一致和纹理模糊问题,在CelebA-HQ和FFHQ数据集上优于现有方法。

Details Motivation: 现有方法在处理大面积不规则遮罩时存在边缘模糊、语义不一致和面部结构失真问题,主要由于直接像素级生成和对人脸先验利用不足。 Method: 采用语义引导的两阶段分层合成方法:第一阶段结合CNN与Vision Transformer融合局部与全局特征,生成语义布局;第二阶段使用多模态纹理生成器进行跨尺度细节优化,并通过动态注意力机制适应任意遮罩形状。 Result: 在CelebA-HQ和FFHQ数据集上实现了更优的LPIPS、PSNR和SSIM指标,尤其在大区域修复任务中展现出更优的语义保持和视觉真实感。 Conclusion: 所提出的语义引导分层架构有效提升了人脸图像修复的质量,尤其适用于复杂大范围缺失情况,且无需针对特定遮罩训练即可泛化至任意遮罩配置。 Abstract: Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

[141] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang,Ziyi Wang,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为MoRe4D的新框架,用于从单张静态图像生成交互式动态4D场景。该方法通过联合进行运动生成与几何重建,解决了现有方法中几何与运动分离导致的时空不一致问题。作者构建了大规模轨迹数据集TrajScene-60K,并设计了基于扩散模型的4D-STraG模块来生成几何一致且运动合理的4D点轨迹,结合深度引导的运动归一化和运动感知模块实现几何与动态的有效融合。此外,提出了4D-ViSM模块以支持任意相机路径的视频渲染。实验表明,MoRe4D能从单图生成高质量、多视角一致且动态细节丰富的4D场景。

Details Motivation: 现有4D场景生成方法大多将几何重建与运动生成解耦,导致时空不一致和泛化能力差。同时缺乏高质量的4D场景数据,限制了模型训练与评估。因此需要一种能够联合建模几何与运动并充分利用单视图先验的方法来提升生成质量与一致性。 Method: 提出MoRe4D框架:1)构建包含6万视频样本的大规模点轨迹数据集TrajScene-60K;2)设计基于扩散模型的4D Scene Trajectory Generator(4D-STraG),联合生成4D点轨迹;3)引入深度引导的运动归一化策略和运动感知模块,融合单视图先验;4)开发4D View Synthesis Module(4D-ViSM)支持任意视角视频合成。 Result: MoRe4D在多个指标上优于现有方法,能够生成具有多视图一致性、丰富动态细节和空间连贯性的高质量4D场景。定性与定量实验均验证了其在运动合理性与几何准确性方面的优势。 Conclusion: MoRe4D通过联合建模几何与运动,有效提升了从单图像生成4D动态场景的质量与一致性。所提出的轨迹生成框架与数据集为未来4D场景合成研究提供了新方向与资源支持。 Abstract: Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

[142] 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Xianfeng Wu,Yajing Bai,Minghan Li,Xianzu Wu,Xueqi Zhao,Zhongyuan Lai,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 提出4DLangVGGT,首个基于Transformer的前馈统一框架,用于4D语言定位,实现跨场景的高效训练与强泛化能力。

Details Motivation: 现有4D语义场构建方法依赖于需逐场景优化的高斯溅射技术,泛化能力差且难以规模化应用。 Method: 设计了包含4D视觉几何Transformer(StreamVGGT)和语义桥接解码器(SBD)的4DLangVGGT框架,联合建模时空几何与语言对齐语义。 Result: 在HyperNeRF和Neu3D数据集上达到SOTA性能,相比单场景训练提升达2%,多场景训练提升1%。 Conclusion: 4DLangVGGT实现了无需逐场景优化的4D语言接地,具备良好的部署效率与泛化性,推动了开放词汇4D场景理解的发展。 Abstract: Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

[143] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang,Qihang Zhang,Shengqu Cai,Tong Wu,Jan Ackermann,Zhengfei Kuang,Yang Zheng,Frano Rajič,Siyu Tang,Gordon Wetzstein

Main category: cs.CV

TL;DR: 本文提出了一种4D可控视频扩散框架,通过将场景动态与相机姿态显式解耦,实现对场景动态和相机视角的细粒度控制。

Details Motivation: 现有视频扩散模型将场景动态与相机运动耦合,限制了对空间和时间的精确控制。 Method: 引入4D位置编码和自适应归一化机制,以连续的世界-时间序列和相机轨迹作为条件输入,训练一个可解耦控制的视频扩散模型。 Result: 在新构建的数据集上训练后,模型在多种时序模式和相机轨迹下实现了鲁棒的真实世界4D控制,生成质量高且优于先前方法。 Conclusion: 该框架有效实现了场景动态与相机运动的解耦控制,显著提升了视频生成中的时空可控性。 Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

[144] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Minghan Zhu,Zhiyi Wang,Qihang Sun,Maani Ghaffari,Michael Posa

Main category: cs.CV

TL;DR: 本文提出了一种结合生成模型和接触信息的接触引导3D生成方法,用于改善机器人操作中的物体几何重建。

Details Motivation: 由于相机只能捕捉到物体的部分视图,尤其是在发生遮挡时,物体重建具有挑战性,因此需要减少视觉信号的模糊性。 Method: 利用生成模型学习常见物体形状的先验,并通过视频和物理交互获取接触信息作为几何边界的稀疏约束,将两者结合进行接触引导的3D生成。 Result: 在合成和真实世界数据上的实验表明,该方法相比纯3D生成和基于接触优化的方法,在重建效果上有提升。 Conclusion: 结合生成先验与接触约束能有效提升部分观测下的物体几何重建精度。 Abstract: Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

[145] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Jung Yi,Wooseok Jang,Paul Hyunbin Cho,Jisu Nam,Heeji Yoon,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为Deep Forcing的训练-free方法,用于改进自回归视频扩散模型中的长序列生成,通过Deep Sink和Participative Compression机制有效缓解了时间重复、漂移和运动减速问题,实现了超过12倍的外推生成,同时保持实时性。

Details Motivation: 现有视频扩散模型在长序列生成中存在时间重复、上下文漂移和运动变慢的问题,直接应用类似StreamingLLM的注意力机制会导致图像质量下降和运动停滞,因此需要一种无需微调即可稳定长期生成的方法。 Method: 提出了Deep Forcing,包含两个核心组件:1) Deep Sink,将滑动窗口的一半固定为持久化的sink token,并重新对齐其时间RoPE相位以维持全局上下文;2) Participative Compression,基于重要性进行KV缓存剪枝,仅保留近期参与注意力的关键token,去除冗余历史信息,减少误差累积。 Result: 该方法在无需任何微调的情况下,实现了从5秒训练到60秒以上的生成(超过12倍外推),图像质量优于LongLive,美学质量优于RollingForcing,整体一致性保持良好,并显著提升了动态程度,同时维持实时生成速度。 Conclusion: 训练-free的KV缓存管理策略(如Deep Forcing)可以匹敌甚至超越依赖微调的方法,在自回归视频流式生成中具有巨大潜力。 Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

[146] Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Haobo Yuan,Yueyi Sun,Yanwei Li,Tao Zhang,Xueqing Deng,Henghui Ding,Lu Qi,Anran Wang,Xiangtai Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 本文提出了视觉推理追踪(VRT)任务,旨在让多模态大模型不仅能定位目标对象,还能显式地预测形成推理路径的中间对象,并发布了评测基准VRT-Bench、评估指标和大规模训练数据集VRT-80k,实验证明基于VRT-80k训练的模型在推理路径追踪上表现更优。

Details Motivation: 现有的多模态大语言模型虽然在视觉任务上表现良好,但其推理过程不透明,缺乏对中间推理步骤和细粒度证据的展示,与人类的视觉推理链能力存在差距。 Method: 提出视觉推理追踪(VRT)任务,构建VRT-Bench评测集和VRT-80k训练集,并设计新的评估指标来衡量模型生成的推理路径质量。 Result: 实验表明现有模型虽能输出正确结果,但在中间推理环节定位能力弱;而使用VRT-80k训练的模型在推理路径追踪方面有显著提升。 Conclusion: 通过引入VRT任务及相关数据集,可有效推动多模态模型实现更透明、可解释的视觉推理过程。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

[147] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Yuan Gao,Jin Song

Main category: cs.CV

TL;DR: 本文提出了Spatial Aesthetics,一种评估室内图像美学质量的新范式,并构建了首个大规模基准SA-BENCH。基于此,开发了SA-IQA作为综合评估框架,并在生成优化和图像筛选任务中验证其有效性。

Details Motivation: 现有AI生成图像质量评估方法主要针对人像和艺术图像,缺乏对室内场景空间美学的系统性评估,限制了AIGC在室内设计等领域的应用发展。 Method: 提出四维评估维度(布局、和谐性、光照、失真),构建包含18,000张图像和50,000条标注的SA-BENCH基准;通过MLLM微调与多维度融合策略开发SA-IQA模型,并将其用于GRPO强化学习奖励信号和Best-of-N选择。 Result: SA-IQA在SA-BENCH上显著优于现有IQA方法,能有效提升AIGC生成图像的质量,在下游任务中表现出优异性能。 Conclusion: SA-IQA为室内场景的空间美学评估设立了新标准,推动了AI生成内容在室内设计等领域的应用,未来将开源代码与数据集以促进相关研究发展。 Abstract: In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.

[148] EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Jiaqi Ma,Shengkai Hu,Jun Wan,Jiaxing Huang,Lefei Zhang,Salman Khan

Main category: cs.CV

TL;DR: 本文提出了一种名为EvoIR的全合一图像恢复框架,通过引入进化频率调制和进化优化策略,显式建模频率信息并动态调整恢复目标,从而在多种退化场景下实现更优性能。

Details Motivation: 现有全合一图像恢复方法通常缺乏显式的频率建模,并依赖固定的优化策略,限制了对异质退化的泛化能力。 Method: 提出EvoIR框架,包含频率调制模块(FMM),将特征显式分解为高频和低频分支并自适应调制;同时设计进化优化策略(EOS),通过基于种群的进化过程动态调整频率感知目标,平衡结构准确性和感知质量。 Result: 在多个基准上的实验表明,EvoIR优于当前最先进的全合一图像恢复方法,且FMM与EOS具有协同增效作用。 Conclusion: EvoIR通过显式频率建模与进化优化,有效提升了图像恢复在复杂退化下的鲁棒性与通用性,为全合一恢复提供了新思路。 Abstract: All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

[149] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Yu Zeng,Charles Ochoa,Mingyuan Zhou,Vishal M. Patel,Vitor Guizilini,Rowan McAllister

Main category: cs.CV

TL;DR: 提出了一种相位保持扩散模型φ-PD,通过保留输入相位并随机化幅度,在不改变架构或增加参数的情况下实现结构对齐的生成,并引入FSS噪声以连续控制结构刚性,适用于图像和视频的图像到图像生成任务。

Details Motivation: 标准扩散过程破坏了数据的空间结构,因为其在傅里叶域中使用具有随机幅度和相位的高斯噪声,这使其不适合需要几何一致性的任务,如重渲染、仿真增强和图像到图像翻译。因此,需要一种能够保持空间结构的方法。 Method: 提出了Phase-Preserving Diffusion(φ-PD),在扩散过程中保留输入的相位信息,仅随机化幅度;同时提出Frequency-Selective Structured (FSS) 噪声,通过一个频率截止参数连续控制结构刚性。该方法无需修改模型架构或增加额外参数,且无推理开销。 Result: φ-PD在真实感和风格化重渲染、驾驶规划器的仿真到真实增强等任务中均产生可控且空间对齐的结果;应用于CARLA模拟器时,使CARLA到Waymo的规划器性能提升50%。 Conclusion: φ-PD是一种通用、高效且无需额外参数的扩散模型重构方法,能够在保持空间结构的同时实现高质量生成,广泛适用于图像到图像及视频到视频的任务,并与现有条件生成方法互补。 Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

[150] ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Rundong Luo,Noah Snavely,Wei-Chiu Ma

Main category: cs.CV

TL;DR: ShadowDraw是一个将普通3D物体转化为投影绘图艺术作品的框架,通过优化场景参数和线条绘制,使投射阴影补全图像,实现算法设计与艺术叙事的结合。

Details Motivation: 旨在探索计算视觉艺术的设计空间,将3D对象与阴影艺术结合,实现算法生成的艺术表达。 Method: 通过预测物体姿态、光照等场景参数,并结合部分线条绘制,优化配置以生成有意义的阴影;利用阴影引导线条生成,并采用自动评估保证阴影与绘图的一致性和视觉质量。 Result: 实验表明,ShadowDraw在多种输入(如真实扫描、数据集和生成资产)上均能生成引人注目的结果,并可自然扩展到多物体场景、动画和实物部署。 Conclusion: ShadowDraw提供了一个实用的投影绘图艺术创作流程,拓展了计算视觉艺术的可能性,弥合了算法设计与艺术创作之间的差距。 Abstract: We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!

[151] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Shengyuan Ding,Xinyu Fang,Ziyu Liu,Yuhang Zang,Yuhang Cao,Xiangyu Zhao,Haodong Duan,Xiaoyi Dong,Jianze Liang,Bin Wang,Conghui He,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: ARM-Thinker是一种新型的智能体多模态奖励模型,能够自主调用外部工具(如图像裁剪、文档检索)来验证视觉细节和推理主张,从而提升对复杂多模态任务的判断准确性和可解释性。

Details Motivation: 现有奖励模型存在幻觉、视觉基础薄弱以及无法使用工具进行验证的问题,限制了其在复杂多模态推理任务中的可靠性。 Method: 提出ARM-Thinker,通过自主调用外部工具实现动态、可验证的奖励判断,并采用多阶段强化学习联合优化工具调用决策和判断准确性。同时构建ARMBench-VL评估基准,包含细粒度视觉定位、多页文档理解和指令遵循三个任务。 Result: ARM-Thinker在奖励建模基准上平均提升+16.2%,在工具使用任务中提升+9.6%,并在多模态数学与逻辑推理任务上优于基线模型。 Conclusion: 赋予奖励模型智能体能力(如工具使用)可显著增强其判断准确性与可解释性,为未来可靠多模态对齐系统提供新方向。 Abstract: Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

[152] Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Hao-Jen Chien,Yi-Chuan Huang,Chung-Ho Wu,Wei-Lun Chao,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出Splannequin方法,通过动态高斯点阵的时序锚定正则化,提升单目MC视频生成高保真冻结3D场景的质量。

Details Motivation: 从单目Mannequin-Challenge视频中合成高质量的冻结3D场景存在因稀疏时序监督导致的鬼影和模糊问题,需保留细微动态并避免渲染伪影。 Method: 提出Splannequin,检测高斯基元的隐藏和缺陷状态,在动态高斯框架下对隐藏状态锚定至最近良好观测的过去状态,缺陷状态锚定至未来强监督状态,通过简单损失项集成到现有流程中。 Result: 显著提升视觉质量,减少鬼影与模糊,实现用户可选的高保真冻结时间渲染,在用户研究中获得96%的偏好。 Conclusion: Splannequin是一种架构无关、无推理开销的正则化方法,有效改善了单目动态场景中冻结帧的重建质量。 Abstract: Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/

[153] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Tianqi Liu,Zhaoxi Chen,Zihao Huang,Shaocong Xu,Saining Zhang,Chongjie Ye,Bohan Li,Zhiguo Cao,Wei Li,Hao Zhao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了Light-X,一种从单目视频中实现视点和光照控制的视频生成框架,通过解耦几何与光照信号并引入合成数据集Light-Syn,在联合控制相机轨迹与光照方面优于现有方法。

Details Motivation: 现有基于图像的光照控制方法在扩展到视频时面临光照保真度与时间一致性的权衡,且缺乏对相机轨迹与光照联合控制的能力,限制了真实场景的生成建模。 Method: 1) 提出解耦设计:通过动态点云沿用户定义的相机轨迹投影来捕捉几何与运动,利用重光照帧提供一致的光照线索;2) 构建Light-Syn合成管道,基于退化与逆映射从野外单目视频生成多视角多光照训练对。 Result: 实验表明,Light-X在联合相机-光照控制任务上优于基线方法,并在文本和背景条件下的视频重光照任务中超越先前方法,生成结果具有更高光照质量和时间一致性。 Conclusion: Light-X实现了从单目视频中高质量、可控的视频生成,支持自由的视点和光照编辑,为真实场景的动态建模提供了有效解决方案。 Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.