cs.CL [Back]

[1] On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Wenlong Deng,Yushu Li,Boying Gong,Yi Ren,Christos Thrampoulidis,Xiaoxiao Li

Main category: cs.CL

TL;DR: 本文研究了在工具集成强化学习（TI-RL）中，基于GRPO的训练方法（如Search-R1）常出现训练崩溃的问题，提出其根本原因为“懒惰似然位移”（LLD），并设计了一种轻量级的正则化方法LLDS来缓解该问题，显著提升了训练稳定性与性能。

Details

Motivation: GRPO在工具集成强化学习中具有收敛快、无需价值网络的优点，但常发生训练崩溃。作者旨在识别其根本原因，并提出一种低干扰、高效的解决方案以实现稳定训练。 Method: 提出Lazy Likelihood Displacement (LLD) 是导致GRPO训练崩溃的核心机制，并引入LLD Death Spiral模型描述其自强化过程；为此设计LLDS正则化方法，仅在序列似然下降时激活，并针对责任token进行细粒度正则化。 Result: 在七个开放域和多跳问答基准上验证了LLDS的有效性，成功稳定训练、防止梯度爆炸，在Qwen2.5-3B和Qwen2.5-7B上分别取得+37.8%和+32.0%的性能提升。 Conclusion: LLD是GRPO类工具集成强化学习中的根本瓶颈，LLDS提供了一种高效、低干扰的解决方案，为大规模稳定训练工具集成大模型提供了可行路径。 Abstract: Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

[2] Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

Mansour Essgaer,Khamis Massud,Rabia Al Mamlook,Najah Ghmaid

Main category: cs.CL

TL;DR: 本研究比较了多种机器学习模型在利比亚方言推文分类中的表现，使用QADI语料库进行实验，发现多项式朴素贝叶斯（MNB）结合词和字符n-gram特征取得了最佳性能。

Details

Motivation: 由于利比亚方言存在正字法不一致和非标准拼写等问题，现有NLP工具难以有效处理，因此需要针对该方言的文本分类方法进行系统评估。 Method: 采用逻辑回归、线性SVM、多项式朴素贝叶斯和伯努利朴素贝叶斯模型，结合chi-square检验筛选元特征，并比较不同词和字符n-gram表示下的分类性能。 Result: MNB模型在(1,2)词n-gram和(1,5)字符n-gram表示下达到85.89%准确率和0.85741 F1分数，优于其他模型；chi-square分析显示电子邮件提及和情感标志等特征对分类无显著贡献。 Conclusion: 精心选择的n-gram表示和分类模型对提升阿拉伯方言识别准确性至关重要，MNB是处理此类任务的有效选择，为阿拉伯语方言NLP研究提供了实证基准。 Abstract: This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.

[3] SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh

Main category: cs.CL

TL;DR: SQuARE是一种混合检索框架，用于在复杂电子表格上进行准确问答，通过基于表头深度和合并密度的路由机制，在结构保持的检索与SQL查询之间动态选择，提升精度和可验证性。

Details

Motivation: 现有方法在处理多行表头、合并单元格和带单位注释的电子表格时表现不佳，而传统SQL视图无法适应缺乏一致模式的文件，导致问答准确性下降。 Method: 提出SQuARE框架，计算表级复杂度得分（如表头深度和合并密度），动态路由查询至结构保持的分块检索或自动生成的关系化SQL表示，并由轻量级代理监督结果的整合与优化。 Result: 在多表头企业资产负债表、高度合并的世行工作簿及多种公开数据集上，SQuARE在检索精确率和端到端答案准确率上均优于单一策略基线和ChatGPT-4o，且延迟可控。 Conclusion: SQuARE通过解耦检索与模型选择，有效保留原始表格结构信息，兼容新兴的表格基础模型，为鲁棒的表格理解提供了实用路径。 Abstract: Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

[4] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei,Jinxiang Meng,Yiming Huang,Junjie Zhao,Yitong Zhang,Jianwen Luo,Xin Zou,Ruiyi Yang,Wenbo Shi,Yan Gao,Shizhu He,Zuo Wang,Qian Liu,Yang Wang,Ke Wang,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: DAComp是一个包含210个任务的基准，用于评估企业数据智能工作流中的数据工程和数据分析能力，揭示当前最先进代理在复杂管道协调和开放性推理方面的显著不足。

Details

Motivation: 现有的基准未能充分反映真实企业环境中数据工程与数据分析结合的复杂性，因此需要一个更全面、更现实的测试平台来推动自主数据代理的发展。 Method: 提出DAComp基准，涵盖需从零构建多阶段SQL管道的数据工程任务和需战略规划与迭代分析的数据分析任务；通过基于执行的多指标评估工程任务，使用经过实验验证的LLM裁判结合分层评分标准评估开放性分析任务。 Result: 最先进的代理在DAComp上表现不佳：数据工程任务成功率低于20%，数据分析任务平均得分低于40%，表明现有方法在整体管道编排和开放性推理方面存在严重缺陷。 Conclusion: DAComp提供了一个严格且贴近现实的测试环境，能够明确诊断当前自主数据代理的关键瓶颈，有助于推动兼具数据工程与分析能力的企业级智能系统发展。 Abstract: Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

[5] ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Yiming Xu,Yuan Yuan,Vijay Viswanathan,Graham Neubig

Main category: cs.CL

TL;DR: 本文提出了ClusterFusion，一种将大语言模型（LLM）作为聚类核心的混合文本聚类框架，结合轻量级嵌入方法，在标准和特定领域任务中均实现领先性能。

Details

Motivation: 传统聚类算法在特定领域表现不佳，需昂贵微调；现有LLM应用未充分发挥其上下文推理能力。 Method: 提出三阶段框架：嵌入引导的子集划分、LLM驱动的主题摘要生成、LLM-based主题分配，以LLM为核心进行聚类。 Result: 在三个公开基准和两个新构建的领域特定数据集上实验表明，ClusterFusion在标准任务和专业领域均优于现有方法。 Conclusion: ClusterFusion有效融合LLM的上下文理解与嵌入方法的效率，实现了可解释且适应性强的文本聚类，并支持领域知识与用户偏好的整合。 Abstract: Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.

[6] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

Muyu Pan,Matthew Walter,Dheeraj Kodakandla,Mahfuza Farooque

Main category: cs.CL

TL;DR: LangSAT是一个将自然语言描述转换为CNF表达式并利用强化学习增强CDCL SAT求解器进行求解的框架，提升了SAT求解的可访问性和效率。

Details

Motivation: 现有的SAT求解平台需要输入合取范式（CNF），限制了非专业用户的使用。因此，需要一个能直接处理自然语言输入的系统，以降低SAT求解的门槛。 Method: 提出LangSAT框架，包含两个部分：Lang2Logic将英文句子转化为CNF表达式；SmartSAT是一个基于强化学习的SAT求解器，将子句-变量关系编码为图结构，并提取全局特征供RL代理使用，从而优化CDCL过程中的启发式选择。 Result: Lang2Logic能够处理长达450词的自然语言输入，生成的CNF由SmartSAT求解，性能与传统CDCL启发式方法相当。 Conclusion: LangSAT提供了一个更易用、可扩展的SAT求解方案，在推理、形式验证和调试等任务中具有广泛应用潜力。 Abstract: Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfiability (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.

[7] MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

Zhou Yang,Shunyan Luo,Jiazhen Zhu,Fang Jin

Main category: cs.CL

TL;DR: 本文提出了一个模型无关的显著性估计框架MASE，用于解释基于文本的深度学习模型的预测结果。

Details

Motivation: 现有的解释方法在处理自然语言处理中的离散词数据时存在局限，且多为事后解释，难以准确反映模型决策过程。 Method: 通过在嵌入层引入归一化的线性高斯扰动（NLGP）来估计输入的显著性，避免直接对离散词进行扰动，从而实现对任意文本模型的局部解释。 Result: 实验结果表明，MASE在Delta Accuracy等指标上优于其他模型无关的解释方法，具有更强的解释能力。 Conclusion: MASE是一种有效的、模型无关的文本模型解释框架，能够更准确地揭示NLP模型的决策机制。 Abstract: Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.

[8] Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

Subrata Karmaker

Main category: cs.CL

TL;DR: 本研究使用经典机器学习方法和显式特征工程，在不依赖神经网络或上下文信息的情况下进行讽刺检测。基于SARC 2.0数据集的10万条评论，结合词级和字符级TF-IDF特征与简单风格指标，评估了四种模型。朴素贝叶斯和逻辑回归表现最佳，对讽刺评论的F1分数约为0.57，为轻量且可解释的讽刺检测方法提供了清晰可复现的基线。

Details

Motivation: 讽刺在在线讨论中常见，但由于其字面意义与实际意图相反，机器难以识别。现有方法常依赖复杂模型或上下文信息，缺乏可解释性和可复现性。 Method: 使用SARC 2.0数据集的10万条评论子样本，结合词级和字符级TF-IDF特征与简单风格指标，采用逻辑回归、线性SVM、多项式朴素贝叶斯和随机森林四种经典机器学习模型进行讽刺检测，不使用神经网络或父评论上下文。 Result: 朴素贝叶斯和逻辑回归表现最好，针对讽刺评论的F1分数约为0.57；由于缺乏对话上下文，整体性能受限。 Conclusion: 尽管性能受限于无上下文输入，但经典机器学习方法结合显式特征工程可为讽刺检测提供轻量、可解释且可复现的基线方案。 Abstract: Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.

[9] RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

Guoshenghui Zhao,Huawei Lin,Weijie Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为RapidUn的影响驱动、参数高效的大语言模型遗忘框架，通过快速估计单样本影响并生成自适应更新权重，实现对有害行为的遗忘同时保留通用知识，在多个模型和数据集上显著优于现有方法。

Details

Motivation: 从大语言模型中移除特定数据的影响具有挑战性，因为重新训练成本高，而现有的近似遗忘方法通常不稳定，尤其当遗忘集较小或不平衡时问题更严重。 Method: 提出RapidUn框架：首先通过快速估计模块计算每个样本的影响分数，然后将这些分数映射为自适应的更新权重，指导模型进行选择性参数更新。 Result: 在Mistral-7B和Llama-3-8B模型及Dolly-15k、Alpaca-57k数据集上，RapidUn比完全重训练效率高达100倍，并在分布内和分布外遗忘任务上 consistently 优于Fisher、GA和LoReUn等方法。 Conclusion: 基于影响引导的参数重加权是一种可扩展且可解释的大语言模型遗忘范式。 Abstract: Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.

[10] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

Yuanshuo Zhang,Aohua Li,Bo Chen,Jingbo Sun,Xiaobing Zhao

Main category: cs.CL

TL;DR: 提出了一种多阶段、多专家框架MSME，用于解决零样本立场检测中的复杂现实场景问题，通过知识准备、专家推理和决策聚合三个阶段实现最先进的性能。

Details

Motivation: 现有的基于大语言模型的零样本立场检测方法在需要动态背景知识、复合实体定义和修辞手法（如反讽）的真实复杂场景中表现不佳。 Method: 设计了一个三阶段框架MSME：第一阶段进行知识准备，检索背景知识并明确立场标签；第二阶段由三个专家模块分别从知识、标签和语用角度进行推理；第三阶段由元裁判整合分析结果做出最终预测。 Result: 在三个公开数据集上的实验表明，MSME在零样本立场检测任务上实现了最先进的性能。 Conclusion: MSME通过多阶段和多专家协同机制有效提升了复杂场景下的零样本立场检测效果，具有较强的可解释性和泛化能力。 Abstract: LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.

[11] UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

Tianmai M. Zhang,Zhaoyi Sun,Sihang Zeng,Chenxi Li,Neil F. Abernethy,Barbara D. Lam,Fei Xia,Meliha Yetisgen

Main category: cs.CL

TL;DR: 本文研究了从癌症患者的电子健康记录中构建系统性抗癌治疗时间线的方法，重点是子任务2——从原始临床记录中生成化疗时间线。我们采用两步流程：首先用大语言模型提取单条记录中的化疗事件，再通过算法整合为患者级时间线。尝试了思维链、监督微调、直接偏好优化和词典查找等策略，其中微调后的Qwen3-14B表现最佳（得分为0.678），结果对后续类似任务具有参考价值。

Details

Motivation: 为了提升从非结构化临床文本中自动构建癌症患者化疗时间线的准确性与效率，推动电子健康记录的自动化处理在临床决策支持中的应用。 Method: 采用两步法：第一步使用大语言模型（如Qwen3-14B）从单条临床记录中提取化疗事件，第二步通过标准化和聚合算法构建患者层面的时间线；比较了思维链提示、监督微调、直接偏好优化和基于词典查找四种方法。 Result: 多种方法在测试集上表现良好，其中经过监督微调的Qwen3-14B达到最优官方得分0.678，显著优于其他配置。 Conclusion: 监督微调结合大语言模型能有效提升化疗事件提取与时间线构建的性能，验证了该两步框架的可行性，为未来类似信息抽取任务提供了实践参考。 Abstract: The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

[12] EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

Pengfei Cao,Zeao Ji,Daojian Zeng,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 提出了一种新的终身自由文本知识编辑任务（LF-Edit），并构建了大规模基准MRLF-Bench，同时设计了EvoEdit方法以增强知识注入和防止遗忘。

Details

Motivation: 现有知识编辑方法依赖结构化三元组且仅支持一次性更新，难以适应大语言模型在部署后持续更新自由文本形式知识的需求。 Method: 提出了LF-Edit任务，构建MRLF-Bench基准，并设计基于潜在扰动增强和知识驱动参数融合的EvoEdit方法。 Result: 实验结果表明，EvoEdit在MRLF-Bench上显著优于现有的知识编辑方法，有效支持持续的知识更新并缓解遗忘问题。 Conclusion: EvoEdit为实现大语言模型的持续自由文本知识编辑提供了有效解决方案，推动了知识编辑向更贴近实际应用的方向发展。 Abstract: Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model's internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.

[13] AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

Yangning Li,Shaoshen Chen,Yinghui Li,Yankai Chen,Hai-Tao Zheng,Hui Wang,Wenhao Jiang,Philip S. Yu

Main category: cs.CL

TL;DR: AdmTree是一种自适应的分层上下文压缩框架，通过基于信息密度动态分段并利用gist token构建语义二叉树，在保持高语义保真度的同时提升长上下文处理效率。

Details

Motivation: 自注意力机制的二次复杂度限制了大语言模型处理长上下文的能力，现有压缩方法在保留局部细节、避免位置偏差或捕捉长距离语义依赖方面存在不足。 Method: 提出AdmTree框架：动态分割输入文本，使用gist token表示变长片段作为语义二叉树的叶节点，结合轻量级聚合机制和冻结的骨干LLM，实现高效的分层抽象。 Result: AdmTree能有效保持细粒度细节与全局语义一致性，缓解位置偏差，并根据内容动态调整，从而更完整地保留长上下文的语义信息。 Conclusion: AdmTree在高效处理长上下文的同时显著提升了语义保真度，为大语言模型的上下文压缩提供了兼顾性能与精度的新方案。 Abstract: The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.

[14] ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

Pritam Kadasi,Abhishek Upperwal,Mayank SIngh

Main category: cs.CL

TL;DR: ADAPT是一种元学习算法，通过在显式令牌预算下学习多任务指令调优中的任务采样比例，动态调整任务分布以优化下游性能，相比静态混合策略在更少训练令牌下取得相当或更好的效果。

Details

Motivation: 传统多任务学习中任务权重通常手动固定，缺乏对不同任务重要性和难度的自适应能力，导致资源分配不均和性能次优。 Method: 提出ADAPT算法，维护一个连续的任务分布，并通过平滑最坏情况验证目标的元梯度更新该分布，实现基于令牌预算的自适应任务采样和课程学习。 Result: 在三个约1B参数的开源大模型上实验表明，ADAPT在仅使用1%到10%监督令牌的情况下，匹配甚至略优于均匀或按规模加权的基线方法，在11个跨域基准上实现了更好或相当的平均性能，同时将预算更多分配给困难且与基准对齐的任务。 Conclusion: ADAPT能有效学习任务采样策略，在有限令牌预算下提升多任务指令微调的效率与效果，展现出自适应任务调度在大规模语言模型训练中的潜力。 Abstract: We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1\%$, $5\%$, and $10\%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.

[15] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Wenjin Liu,Haoran Luo,Xin Feng,Xiang Ji,Lijuan Zhou,Rui Mao,Jiapu Wang,Shirui Pan,Erik Cambria

Main category: cs.CL

TL;DR: 本文提出了LexGenius，一个用于评估大语言模型（LLM）在中文法律领域通用智能的专家级基准测试，涵盖七个维度、十一项任务和二十种能力，并通过真实案例与考试题构建多选题，结合人工与LLM审核以确保数据可靠性。实验显示现有LLM在法律智能上仍显著落后于人类专家。

Details

Motivation: 现有法律AI评估基准过于注重结果，缺乏对法律理解、推理与决策等综合智能能力的系统性评估，限制了法律通用智能的发展。因此需要一个更全面、结构化的评估框架。 Method: 提出基于维度-任务-能力（Dimension-Task-Ability）框架的LexGenius基准，使用真实法律案例和司法考试题目生成多选题，结合人工标注与大语言模型审查，进行多轮验证以降低数据泄露风险并提高准确性。 Result: 在12个最先进的大语言模型上进行了评估，结果显示各模型在不同法律智能能力上存在显著差异，即使是表现最好的模型也远逊于人类法律专业人士。 Conclusion: LexGenius能够有效评估大语言模型的法律通用智能水平，有助于推动法律人工智能向更高层次的认知与推理能力发展。 Abstract: Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

[16] Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

Carolin Mueller-Spitzer,Samira Ochs,Jan Oliver Ruediger,Sascha Wolfer

Main category: cs.CL

TL;DR: 本研究通过分析当代德语新闻文本中通性阳性词（GM）的分布和语言特征，揭示了不同名词在使用上的显著差异，并发现GM主要出现在复数和不定名词短语中，且不常用于指代整个群体。

Details

Motivation: 探讨通性阳性词在真实语料中的实际使用情况，以解决其是否具有性别中立性的争议，并为心理语言学研究提供更贴近现实的语言刺激依据。 Method: 基于大型新闻语料库，对21个阳性人称名词的屈折范式进行手动标注，共标注6,195个词例，分析其句法和语义特征。 Result: 发现被动角色名词与高地位名词之间存在显著差异；GM多用于复数形式和不定指名词短语，且较少用于表示整类人群。 Conclusion: 通性阳性词的实际使用并不支持其普遍指代混合性别群体的说法，应根据具体词汇和语法环境理解其意义，研究结果有助于改进心理语言学实验中的语言材料设计。 Abstract: This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.

[17] OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models

Zhuoyue Wan,Wentao Hu,Chen Jason Zhang,Yuanfeng Song,Shuaimin Li,Ruiqiang Xiao,Xiao-Yong Wei,Raymond Chi-Wing Wong

Main category: cs.CL

TL;DR: 本文提出了OsmT，一个开源的、标签感知的语言模型，用于在自然语言与OverpassQL（OpenStreetMap的查询语言）之间建立桥梁，并引入标签检索增强机制（TRA）以提升查询生成的准确性和结构有效性。

Details

Motivation: 现有的自然语言到结构化查询转换方法多依赖大规模闭源语言模型，存在推理成本高、透明度低和难以轻量化部署的问题。同时，地理空间查询具有复杂的拓扑结构和标签依赖关系，需要专门建模。 Method: 提出OsmT模型，结合标签检索增强（TRA）机制，在生成OverpassQL时引入上下文相关的标签知识；定义反向任务OverpassQL-to-Text用于查询解释；基于开源预训练语言模型进行轻量级设计。 Result: 在公开基准上超越强基线模型，显著提升查询生成与解释性能；尽管参数量更少，仍达到有竞争力的准确率。 Conclusion: OsmT证明了轻量级开源模型通过领域适配（如标签增强）可在地理空间语义解析任务中媲美甚至优于大型闭源模型，推动了可解释、可部署的自然语言接口发展。 Abstract: Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.

[18] SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Wenhua Cheng,Weiwei Zhang,Heng Guo,Haihao Shen

Main category: cs.CL

TL;DR: SignRoundV2是一种高效的后训练量化框架，能够在极低比特（如2-4位）下显著提升大语言模型的量化性能，无需混合精度即可接近全精度模型表现。

Details

Motivation: 极端低比特量化对部署大语言模型至关重要，但通常会导致严重性能下降，尤其是在2比特和4比特情况下。 Method: 提出SignRoundV2，包含一种结合梯度信息与量化偏差的快速敏感性度量方法，用于指导逐层比特分配；并引入轻量级预调优搜索量化尺度以改善极低比特量化效果。 Result: 在4-5比特下仅约1%的精度差异即可实现接近生产级性能，在2比特下仍保持良好结果。 Conclusion: SignRoundV2有效缓解了极端低比特量化带来的性能退化问题，为大语言模型的高效部署提供了实用解决方案。 Abstract: Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.

[19] Model Whisper: Steering Vectors Unlock Large Language Models' Potential in Test-time

Xinyue Kang,Diwei Shi,Li Chen

Main category: cs.CL

TL;DR: 提出了一种轻量级的测试时引导向量（TTSV）方法，通过冻结大模型参数并在输入前添加可优化的引导向量，以最小化输出熵来提升模型在特定任务上的推理能力。

Details

Motivation: 现有测试时自适应方法常需微调模型参数，计算开销大且可能损害预训练模型原有能力，亟需一种高效、轻量且不破坏原模型的方法。 Method: 设计Test-Time Steering Vectors（TTSV），将其置于输入前并固定LLM参数；通过在测试数据上优化TTSV以最小化模型输出熵，从而引导模型进入更高置信度的内部状态，激活其与当前任务最相关的内在能力。 Result: 在MATH500任务上，TTSV使Qwen2.5-Math-7B模型获得45.88%的相对性能提升，Qwen3-4B模型获得16.22%的提升；方法具有强泛化性和跨任务迁移性。 Conclusion: TTSV是一种高效、轻量、即插即用的测试时自适应方法，能够在不修改模型参数的情况下显著提升大语言模型在不同任务和分布上的推理表现。 Abstract: It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model's pre-existing abilities.To address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM's parameters entirely frozen. By optimizing the TTSV on test data to minimize the model's output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach's effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.

[20] EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

Ruilin Li,Yibin Wang,Wenhong Zhu,Chenglin Li,Jinghao Zhang,Chenliang Li,Junchi Yan,Jiaqi Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的知识编辑范式Edit-then-Consolidate，通过分阶段的编辑与知识整合机制，有效提升了大语言模型在真实场景下的知识更新效果和泛化能力。

Details

Motivation: 现有知识编辑方法在理想化评估中表现良好，但在持续学习等现实场景中效果有限，主要存在过拟合新事实和缺乏知识整合机制两大问题。 Method: 提出Edit-then-Consolidate框架：首先使用目标化近端监督微调（TPSFT）限制策略漂移以减少过拟合；然后引入基于组相对策略优化（GRPO）的巩固阶段，利用推理路径级奖励信号对齐思维链（CoT）生成行为。 Result: 实验表明该方法在真实场景评估中显著提升了编辑的可靠性与泛化性，同时更好地保持了模型原有能力与编辑局部性。 Conclusion: 分阶段的编辑-巩固策略能有效弥合理论知识编辑与实际生成行为之间的差距，为大模型持续学习提供了更实用的解决方案。 Abstract: Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work's empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs' inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.

[21] Challenging the Abilities of Large Language Models in Italian: a Community Initiative

Malvina Nissim,Danilo Croce,Viviana Patti,Pierpaolo Basile,Giuseppe Attanasio,Elio Musacchio,Matteo Rinaldi,Federico Borazio,Maria Francis,Jacopo Gili,Daniel Scalena,Begoña Altuna,Ekhi Azurmendi,Valerio Basile,Luisa Bentivogli,Arianna Bisazza,Marianna Bolognesi,Dominique Brunato,Tommaso Caselli,Silvia Casola,Maria Cassese,Mauro Cettolo,Claudia Collacciani,Leonardo De Cosmo,Maria Pia Di Buono,Andrea Esuli,Julen Etxaniz,Chiara Ferrando,Alessia Fidelangeli,Simona Frenda,Achille Fusco,Marco Gaido,Andrea Galassi,Federico Galli,Luca Giordano,Mattia Goffetti,Itziar Gonzalez-Dios,Lorenzo Gregori,Giulia Grundler,Sandro Iannaccone,Chunyang Jiang,Moreno La Quatra,Francesca Lagioia,Soda Marem Lo,Marco Madeddu,Bernardo Magnini,Raffaele Manna,Fabio Mercorio,Paola Merlo,Arianna Muti,Vivi Nastase,Matteo Negri,Dario Onorati,Elena Palmieri,Sara Papi,Lucia Passaro,Giulia Pensa,Andrea Piergentili,Daniele Potertì,Giovanni Puccetti,Federico Ranaldi,Leonardo Ranaldi,Andrea Amelio Ravelli,Martina Rosola,Elena Sofia Ruzzetti,Giuseppe Samo,Andrea Santilli,Piera Santin,Gabriele Sarti,Giovanni Sartor,Beatrice Savoldi,Antonio Serino,Andrea Seveso,Lucia Siciliani,Paolo Torroni,Rossella Varvara,Andrea Zaninello,Asya Zanollo,Fabio Massimo Zanzotto,Kamyar Zeinalipour,Andrea Zugarini

Main category: cs.CL

TL;DR: CALAMITA是意大利语的大规模协作基准测试项目，旨在系统评估大语言模型在多任务、多领域下的表现，强调方法论、社区参与和可持续评估框架。

Details

Motivation: 现有大语言模型的评估主要集中于英语，缺乏对意大利语等语言的系统性、多样化评估，且缺少统一的方法论和协作机制。 Method: 通过联合80多名来自学术界、工业界和公共部门的研究者，设计并整合了20多个任务、近100个子任务，覆盖语言能力、常识推理、事实一致性等多个维度，并建立统一的评估流程。 Result: 发布了目前最全面、多样的意大利语基准测试集，评估了四个开源大模型，揭示了其在不同能力上的优劣，并识别出任务特定评估中的挑战。 Conclusion: CALAMITA不仅是一个资源丰富的基准，更提供了一个可扩展、社区驱动的评估框架，为其他语言的系统性模型评估提供了可复制的蓝图。 Abstract: The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

[22] AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages

Pooja Singh,Sandeep Kumar

Main category: cs.CL

TL;DR: AdiBhashaa 是一项社区驱动的倡议，创建了印度四种主要部落语言（Bhili、Mundari、Gondi 和 Santali）的首个开源平行语料库和基线机器翻译系统，旨在解决语言技术中的不平等问题。

Details

Motivation: 许多部落语言在大型语言模型和机器翻译系统中处于“不可见”状态，加剧了教育、治理和数字参与方面的结构性不平等，因此需要推动语言技术的公平性。 Method: 通过与母语使用者协作进行参与式数据构建，结合人工闭环验证，并对编码器-解码器机器翻译模型和大型语言模型进行系统评估。 Result: 成功构建了四种印度部落语言的首个开源平行语料库，并开发了相应的基线机器翻译系统，同时验证了社区参与模式在低资源语言技术开发中的有效性。 Conclusion: AdiBhashaa 展示了一种更加公平的人工智能研究范式，强调本地专业知识、培养边缘化社区的早期研究人员，并在语言技术开发中突出人工验证的重要性。 Abstract: Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.

[23] DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

Gianluca Barmina,Nathalie Carmen Hau Norman,Peter Schneider-Kamp,Lukas Galke

Main category: cs.CL

TL;DR: 本文提出了一种增强的丹麦语语言可接受性评估基准，通过分析常见书写错误并引入十四种破坏函数生成错误句子，从而更全面地评估大语言模型的表现。

Details

Motivation: 现有的语言可接受性基准在覆盖错误类型和区分模型性能方面存在不足，因此需要一个更全面、更具挑战性的基准来准确评估大语言模型在丹麦语中的表现。 Method: 基于对书面丹麦语常见错误的分析，设计了十四种系统性引入错误的破坏函数，并通过人工和自动方法验证其有效性，最终构建了一个新的基准用于评估大语言模型的语言可接受性判断能力。 Result: 新基准比现有基准更广泛且更具挑战性，显著降低了大语言模型的表现水平，同时展现出更高的区分能力，能更好地区分高性能与低性能模型。 Conclusion: 该研究提出的扩展基准在语言可接受性评估中具有更高难度和更强的模型区分力，为丹麦语语言模型的评估提供了更严谨的工具。 Abstract: We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

[24] DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

L. D. M. S. Sai Teja,N. Siva Gopala Krishna,Ufaq Khan,Muhammad Haris Khan,Partha Pakray,Atul Mishra

Main category: cs.CL

TL;DR: 本文提出了一种名为Info-Mask的新框架，用于检测人类与AI混合撰写文本中的作者身份转换点，结合风格特征、困惑度信号和结构化边界建模，在对抗性条件下展现出更强的鲁棒性，并发布了用于评估的MAS数据集。

Details

Motivation: 随着大语言模型的发展，人类与AI生成文本的界限日益模糊，亟需有效方法识别混合文本中的作者转换点，以保障内容真实性、信任度和人类监督。 Method: 提出Info-Mask框架，融合风格计量特征、基于困惑度的信号和结构化边界建模进行混合作者检测，并构建对抗性基准数据集MAS；引入可解释的归因可视化（HIA overlays）辅助人类理解模型决策。 Result: Info-Mask在多种模型架构下显著提升了对抗条件下的片段级检测鲁棒性，建立了新的基线性能；HIA可视化被小规模人类实验验证有助于理解模型预测。 Conclusion: 研究表明，尽管Info-Mask在鲁棒性和可解释性方面取得进展，但对抗环境下混合作者检测仍存在挑战，对人机协同创作中的信任与监督具有重要意义。 Abstract: In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.

[25] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

Atsuki Yamaguchi,Terufumi Morishita,Aline Villavicencio,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文提出了一种名为Source-Shielded Updates (SSU)的方法，在低资源条件下仅使用未标注的目标语言数据来适应指令型大语言模型，有效缓解了灾难性遗忘问题，并在多语言场景中表现出优于全量微调的性能。

Details

Motivation: 扩展指令型大语言模型的语言多样性对全球可访问性至关重要，但通常受限于目标语言标注数据的高成本和适应过程中的灾难性遗忘问题。本文旨在解决在仅有无标签目标语言数据的低资源约束下的模型适应挑战。 Method: 提出Source-Shielded Updates (SSU)，一种选择性参数更新策略：利用少量源语言数据和参数重要性评分方法识别对保持源语言能力关键的参数，并在适应前采用按列冻结策略保护这些参数。 Result: 在五种类型多样的语言和7B、13B规模模型上的实验表明，SSU显著缓解了灾难性遗忘，源任务性能平均仅下降3.4%（7B）和2.8%（13B），远优于全量微调的20.3%和22.3%；同时在目标语言性能上与全量微调相当甚至更优，7B模型在所有基准上超越全量微调，13B模型在多数基准上表现更佳。 Conclusion: SSU是一种有效的低资源多语言模型适应方法，能够在几乎不损失源语言能力的前提下大幅提升目标语言性能，为构建多样化的指令模型提供了可行路径。 Abstract: Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.

[26] SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Hao Wang,Jialun Zhong,Changcheng Wang,Zhujun Nie,Zheng Li,Shunyu Yao,Yanzeng Li,Xinchi Li

Main category: cs.CL

TL;DR: 本文提出SEAL，一种基于自进化代理学习的两阶段语义解析框架，用于知识库对话式问答，通过提取最小S-表达式核心并结合模板化补全，显著提升结构准确性和计算效率。

Details

Motivation: 解决现有方法在指代消解、上下文建模和复杂逻辑推理中的不足，尤其是处理大规模知识图谱上的复杂查询时存在的结构错误和高计算成本问题。 Method: 采用两阶段框架：第一阶段由大语言模型提取查询的最小S-表达式核心，并通过代理校准模块修正语法和对齐实体关系；第二阶段基于问题类型预测和占位符实例化进行模板化补全，生成可执行的完整S-表达式。引入自进化机制，结合局部与全局记忆及反思模块，实现无需重训练的持续优化。 Result: 在SPICE基准上的实验表明，SEAL在多跳推理、比较和聚合任务中达到最先进性能，显著提升了结构准确性和计算效率。 Conclusion: SEAL框架有效解决了KBCQA中的关键挑战，具备强健且可扩展的对话推理能力，为复杂查询处理提供了高效可靠的新范式。 Abstract: Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework's capacity for robust and scalable conversational reasoning.

[27] LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Weiye Shi,Zhaowei Zhang,Shaoheng Yan,Yaodong Yang

Main category: cs.CL

TL;DR: 本文介绍了一个新的多语言体裁分类数据集，用于研究大语言模型是否能从原始文本中学习句法结构、隐喻计数和语音特征等深层语言特性，并分析这些特征对分类性能的影响。

Details

Motivation: 探讨大语言模型是否真正捕捉到深层语言属性（如句法结构、语音线索和韵律模式），而不仅仅依赖表面文本特征。 Method: 构建了一个来自Project Gutenberg的多语言数据集，包含六种语言的诗歌、小说和戏剧二元分类任务，并引入句法树结构、隐喻数量和语音度量三种显式语言特征，评估其对LLM分类性能的影响。 Result: 实验表明，尽管LLM可以从原始文本或显式特征中学习潜在的语言结构，但不同特征在不同任务中的贡献不均，说明复杂语言信号在训练中的重要性。 Conclusion: 为了提升语言模型在文学体裁等复杂任务上的表现，应更系统地融入深层语言特征，而非仅依赖模型从原始文本中隐式学习。 Abstract: Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.

[28] Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Nex-AGI Team,:,Yuxuan Cai,Lu Chen,Qiaoling Chen,Yuyang Ding,Liwen Fan,Wenjie Fu,Yufei Gao,Honglin Guo,Pinxue Guo,Zhenhua Han,Zhengfu He,Hanglei Hu,Kai Hu,Shengjia Hua,Tianyu Huai,Baodai Huang,Li Ji,Zhen Jiang,Zhikai Lei,Bufan Li,Jiahang Lin,Lizhi Lin,Jinxiu Liu,Shichun Liu,Ziming Liu,Yuchen Ni,Pengfang Qian,Yujiong Shen,Qingyun Shi,Wentao Shu,Peng Sun,Yiran Suo,Tian Tang,Boyu Tian,Guoteng Wang,Junzhe Wang,Peixin Wang,Zhiheng Xi,Hang Yan,Jie Yang,Zhixiong Yang,Tianchu Yao,Guangze Ye,Qianxi Yu,Shuo Zhang,Xinyue Zhang,Yiqi Zhang,Jiarong Zhao,Miao Zheng,Rui Zheng,Enyu Zhou,Jiazheng Zhou,Maosen Zhou,Yuhao Zhou,Tao Gui,Yining Zheng,Xinchi Chen,Jie Zhou,Siyuan Feng,Qin Chen,Liang He,Qi Zhang,Xuanjing Huang,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的交互环境构建方法，通过复杂性、多样性和保真度三个维度提升大语言模型在自主代理任务中的策略学习能力，并开源了Nex生态系统与模型权重。

Details

Motivation: 现有的大语言模型从被动响应转向自主代理时，缺乏能够生成高质量交互信号的可扩展基础设施，限制了基于激励决策的学习范式发展。 Method: 提出了Nex生态系统，包含三个核心组件：NexAU（支持通过简单配置构建复杂代理层次结构）、NexA4A（从自然语言自动生成多样化代理层次）和NexGAP（融合动态真实环境以提升轨迹真实性）。基于该系统构建的复杂交互环境训练出Nex-N1模型。 Result: 在SWE-bench和tau2等基准上，Nex-N1持续优于最先进的开源模型，并在复杂代理任务中媲美前沿专有模型。 Conclusion: 通过系统化扩展交互环境的多样性与复杂性，可有效提升大语言模型作为自主代理的决策能力，且所提出的三维度框架为未来代理型AI的发展提供了可复现、可扩展的基础。 Abstract: The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.

[29] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Francielle Vargas,Daniel Pedronette

Main category: cs.CL

TL;DR: 提出了一种名为Self-Explaining Contrastive Evidence Re-Ranking（CER）的新方法，通过对比学习和生成token级归因理由来重构基于事实证据的检索。

Details

Motivation: 现有检索方法在安全关键领域（如临床试验）中易产生幻觉，缺乏可解释性和基于证据的推理支持。 Method: 利用主客观性标准自动选择难负样本，通过对比学习微调嵌入，并为每个检索段落生成token级归因理由，构建与证据推理对齐的嵌入空间。 Result: 在临床试验报告上实验表明，CER提升了检索准确率，减少了RAG系统中的幻觉风险，并提供了透明、基于证据的检索结果。 Conclusion: CER通过将嵌入空间与事实证据对齐，增强了检索系统的可靠性与可解释性，特别适用于安全关键场景。 Abstract: This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.

[30] Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran,Rishabh Tiwari,Yuezhou Hu,Kerem Dilmen,Coleman Hooper,Haocheng Xi,Nicholas Lee,Mehrdad Farajtabar,Michael W. Mahoney,Kurt Keutzer,Amir Gholami

Main category: cs.CL

TL;DR: 提出了一种名为Arbitrage的新型逐步推测生成框架，通过动态路由机制在草稿模型和目标模型之间选择更优的推理步骤，显著提升了推理效率，在数学推理任务中相比现有方法最多可减少约2倍的推理延迟。

Details

Motivation: 传统基于token的推测解码在推理任务中因语义等价但token不匹配导致过多不必要的拒绝；现有逐步验证方法对被拒绝步骤改进有限且重复生成，浪费计算资源。 Method: 提出Arbitrage框架，引入轻量级路由器动态预测何时目标模型能产生明显更优的推理步骤，据此决定生成路径；该路由器近似理想的仲裁 oracle，实现最优的效率-精度权衡。 Result: 在多个数学推理基准上，Arbitrage consistently 超过现有的逐步推测解码基线方法，最多可将推理延迟降低约2倍（在相同精度下）。 Conclusion: Arbitrage通过动态路由机制有效利用草稿与目标模型的相对优势，显著提高了复杂推理任务中的推测生成效率，为降低大模型推理成本提供了新思路。 Abstract: Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

[31] Structured Document Translation via Format Reinforcement Learning

Haiyue Song,Johannes Eschbach-Dymanus,Hour Kaing,Sumire Honda,Hideki Tanaka,Bianka Buschbeck,Masao Utiyama

Main category: cs.CL

TL;DR: 提出了一种名为FormatRL的格式强化学习方法，通过组相对策略优化来直接优化结构感知奖励，从而提升文档级XML/HTML结构的翻译质量。

Details

Motivation: 现有工作在处理复杂的文档级XML或HTML结构翻译时存在局限性，仅限于句子级别。 Method: 采用监督微调模型基础上的Group Relative Policy Optimization，并设计了两种新的结构感知奖励：TreeSim（衡量XML树之间的结构相似性）和Node-chrF（在XML节点层面衡量翻译质量），同时应用StrucAUC指标区分轻微错误与重大结构错误。 Result: 在SAP软件文档基准上的实验显示，在六个指标上均有改进，并通过分析展示了不同奖励函数对结构和翻译质量提升的贡献。 Conclusion: FormatRL能有效提升结构化文本的翻译质量，尤其在复杂文档结构中表现出优越的性能。 Abstract: Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

[32] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: 提出了一种名为语义软引导（SSB）的自蒸馏技术，用于提升大语言模型在长上下文推理中的训练效率和准确性，无需人工干预即可自动构建师生训练对，并在多个数学基准上显著优于现有RLVR方法。

Details

Motivation: 现有的基于强化学习与可验证奖励（RLVR）的长上下文推理训练方法存在稀疏奖励、样本效率低和计算资源消耗大等问题，限制了其广泛应用。 Method: 提出语义软引导（SSB），利用同一基础语言模型作为教师和学生，通过不同语义上下文生成正确和常见错误答案，并据此生成带验证答案的逐步推理解释；自动构建配对的师生训练数据集，并让学生模型仅从问题出发拟合教师生成的logits序列。 Result: 在Qwen2.5-3B-Instruct模型上使用GSM8K数据集进行参数高效微调后，在MATH500和AIME2024基准测试中分别实现了10.6%和10%的准确率提升，显著优于GRPO等RLVR方法。 Conclusion: SSB是一种高效、无需人工标注的自蒸馏训练方法，能有效提升大语言模型在数学推理任务上的表现，具有高样本效率和实用性，为长上下文推理提供了一种新的训练范式。 Abstract: Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.

cs.CV [Back]

[33] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Alejandro Cobo,Roberto Valle,José Miguel Buenaposada,Luis Baumela

Main category: cs.CV

TL;DR: 提出一种基于合成视频的深度伪造检测方法，通过引入面部运动基的操控生成具有生物力学不一致性的训练数据，提升模型对未见操纵方式的泛化能力。

Details

Motivation: 现有视频深度伪造检测方法多关注帧间不稳定性，忽略了面部区域间自然运动依赖关系被破坏这一关键漏洞，且难以泛化到未知篡改方式。 Method: 设计一种合成视频生成方法：训练自编码器将面部关键点分解为运动基，通过操纵这些基来打破面部运动的自然相关性，并利用人脸变形技术将这种细微的运动不一致性注入原始视频中，用于训练检测网络。 Result: 在多个主流基准上实现了最先进的泛化性能，验证了利用生物力学瑕疵进行深度伪造检测的有效性。 Conclusion: 通过建模面部运动中的动力学异常可有效提升深度伪造检测模型对未知伪造方法的鲁棒性和泛化能力。 Abstract: Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.

[34] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology

Jinzhen Hu,Kevin Faust,Parsa Babaei Zadeh,Adrienn Bourkas,Shane Eaton,Andrew Young,Anzar Alvi,Dimitrios George Oreopoulos,Ameesha Paliwal,Assem Saleh Alrumeh,Evelyn Rose Kamski-Hennekam,Phedias Diamandis

Main category: cs.CV

TL;DR: OnSight Pathology 是一个平台无关的计算机视觉软件，通过本地运行的实时AI推理辅助病理学家分析数字病理切片，支持多种应用场景且无需复杂集成，促进AI在组织病理学中的普及。

Details

Motivation: 现有AI病理分析工具多为专有系统，依赖复杂集成和高性能设备，限制了实际临床部署；同时传统病理诊断依赖主观判断，需要更客观、易访问的辅助工具。 Method: 开发了一个可本地运行、无需复杂安装的独立可执行程序，利用连续屏幕截图实现跨平台实时AI推理，兼容多种数字切片查看器和活体显微镜视频流（包括智能手机摄像头），并集成多模态聊天助手提供图像描述。 Result: 在超过2,500个公开全切片图像及临床案例中验证了系统的有效性，成功应用于脑肿瘤分类、有丝分裂检测和免疫组化染色量化等任务，并展示了对不同平台和实时显微镜输入的良好兼容性。 Conclusion: OnSight Pathology 降低了AI在病理学中应用的技术门槛，提供安全、低成本、即插即用的解决方案，有助于推动AI在常规病理工作流以及术中、远程和资源有限环境中的广泛应用。 Abstract: The microscopic examination of surgical tissue remains a cornerstone of disease classification but relies on subjective interpretations and access to highly specialized experts, which can compromise accuracy and clinical care. While emerging breakthroughs in artificial intelligence (AI) offer promise for automated histological analysis, the growing number of proprietary digital pathology solutions has created barriers to real-world deployment. To address these challenges, we introduce OnSight Pathology, a platform-agnostic computer vision software that uses continuous custom screen captures to provide real-time AI inferences to users as they review digital slide images. Accessible as a single, self-contained executable file (https://onsightpathology.github.io/ ), OnSight Pathology operates locally on consumer-grade personal computers without complex software integration, enabling cost-effective and secure deployment in research and clinical workflows. Here we demonstrate the utility of OnSight Pathology using over 2,500 publicly available whole slide images across different slide viewers, as well as cases from our clinical digital pathology setup. The software's robustness is highlighted across routine histopathological tasks, including the classification of common brain tumor types, mitosis detection, and the quantification of immunohistochemical stains. A built-in multi-modal chat assistant provides verifiable descriptions of images, free of rigid class labels, for added quality control. Lastly, we show compatibility with live microscope camera feeds, including from personal smartphones, offering potential for deployment in more analog, inter-operative, and telepathology settings. Together, we highlight how OnSight Pathology can deliver real-time AI inferences across a broad range of pathology pipelines, removing key barriers to the adoption of AI tools in histopathology.

[35] Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

Bishoy Galoaa,Xiangyu Bai,Shayda Moezzi,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas

Main category: cs.CV

TL;DR: LAPA是一种基于Transformer的端到端多摄像头点跟踪架构，结合外观匹配与几何约束，通过注意力机制实现跨视角和时间的联合推理，显著提升复杂运动和遮挡场景下的跟踪性能。

Details

Motivation: 传统多摄像头点跟踪方法将检测、关联和跟踪解耦，导致误差传播和时序不一致问题，尤其在复杂场景下表现不佳。LAPA旨在通过统一模型解决这些挑战。 Method: 提出LAPA模型，采用跨视角注意力机制融合外观信息与几何先验，通过注意力加权聚合构建3D点表示，避免传统三角化方法的局限；利用Transformer解码器建模长距离依赖以保持时序一致性。 Result: 在TAPVid-3D-MC和PointOdyssey-MC数据集上分别达到37.5%和90.3%的APD性能，显著优于现有方法，尤其在复杂运动和遮挡场景中表现突出。 Conclusion: LAPA通过端到端的注意力机制实现了更鲁棒和一致的多摄像头点跟踪，有效处理不确定性与遮挡，为未来视频理解任务提供了新的解决方案。 Abstract: This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-

[36] Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur\

Main category: cs.CV

TL;DR: 本文提出了PARSE框架，一种无需监督即可从流视频中学习多尺度事件结构的统一方法，通过分层递归预测器实现人类类似的时序抽象和事件理解。

Details

Motivation: 模仿人类将连续经验感知为时间上嵌套的层次化事件的能力，使计算机视觉模型能够进行预测性、层次化的视频分割。 Method: 构建一个分层的递归预测器架构，各层以不同的时间粒度运作，底层建模短期动态，高层通过基于注意力的反馈整合长期上下文；事件边界由预测误差的瞬时峰值自然产生。 Result: 在Breakfast Actions、50 Salads和Assembly 101三个基准上达到最先进的流式方法性能，并在时间对齐（H-GEBD）和结构一致性（TED, hF1）指标上媲美离线基线。 Conclusion: 预测性学习在不确定性下为实现类人的时间抽象和组合式事件理解提供了一条可扩展的路径。 Abstract: Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

[37] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai,He Liang,Bishoy Galoaa,Utsav Nandi,Shayda Moezzi,Yuhang He,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到视频生成框架MoReGen，能够结合物理模拟器生成符合牛顿力学规律的精确运动视频，并构建了包含1275个标注视频的基准MoReSet用于评估视频的物理一致性。

Details

Motivation: 尽管文本到视频生成在视觉真实感方面取得了进展，但生成符合物理规律、运动连贯的视频仍具挑战性，尤其是在遵循牛顿运动定律方面。 Method: 提出MoReGen框架，融合多智能体大语言模型、物理引擎和渲染器，在代码域中生成物理准确且可复现的视频；同时设计基于物体轨迹匹配的客观评估指标，并构建MoReSet基准进行系统评测。 Result: 实验表明现有主流T2V模型在物理有效性上表现较差，而MoReGen能显著提升生成视频的物理合理性和运动一致性。 Conclusion: MoReGen为实现物理可信、意图对齐的视频生成提供了有效路径，强调了将显式物理建模引入生成模型的重要性。 Abstract: While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

[38] ReasonX: MLLM-Guided Intrinsic Image Decomposition

Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück

Main category: cs.CV

TL;DR: ReasonX是一种利用多模态大语言模型作为感知判别器，通过无标签真实图像上的相对内在比较来微调内在分解模型的新框架，在多种架构和模态上显著提升了性能。

Details

Motivation: 现有基于合成数据配对监督的内在图像分解模型在真实世界场景中的泛化能力有限，需要一种能更好适应多样真实环境的方法。 Method: 提出ReasonX框架，使用多模态大语言模型（MLLM）作为感知判别器，提供相对内在比较，并将这些比较作为GRPO奖励用于微调内在分解模型；通过条件预测器输出间的解析关系与判别器评估的一致性进行对齐。 Result: 在多个基础架构和模态上，ReasonX显著提升了性能，包括在IIW反照率任务中WHDR降低9-25%，在ETH3D深度估计中准确率提升高达46%。 Conclusion: MLLM引导的比较监督具有潜力，可弥合低层与高层视觉推理之间的差距，且该方法具有模型无关性，适用于不同的内在分解模型。 Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

[39] 6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer,Piotr Kalinowski,Caroline Ebersbach,Marcel Knopp,Tim Rädsch,Evangelia Christodoulou,Annika Reinke,Fiona R. Kolbinger,Lena Maier-Hein

Main category: cs.CV

TL;DR: AdversarialAnatomyBench是首个评估视觉-语言模型在罕见解剖变异上表现的基准，揭示了现有模型在非典型解剖结构上的性能显著下降，暴露其对常见解剖先验的依赖和泛化能力不足的问题。

Details

Motivation: 现有视觉-语言模型（VLM）基准主要关注常见解剖结构，无法反映临床中罕见变异带来的挑战，导致模型在实际应用中的可靠性存疑。因此需要一个专门评估模型在罕见解剖变异上表现的基准。 Method: 提出AdversarialAnatomyBench，包含多种成像模态和解剖区域的真实罕见解剖变异数据，并系统评测22个最先进的VLM在典型与非典型解剖图像上的表现差异。 Result: 在基本医学感知任务中，模型平均准确率从典型解剖的74%降至非典型解剖的29%；顶级模型如GPT-5、Gemini 2.5 Pro和Llama 4 Maverick也出现41-51%的性能下降；错误模式反映出明显的解剖偏见，且模型规模扩大或偏差感知提示等干预措施均未能缓解问题。 Conclusion: 当前VLM在面对罕见解剖结构时存在严重泛化缺陷，暴露出其对‘典型’解剖先验的过度依赖；AdversarialAnatomyBench为衡量和缓解多模态医学AI中的解剖偏见提供了基础工具。 Abstract: Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.

[40] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出MVRoom，一种基于多视角扩散和粗略3D布局的可控室内场景新视角合成方法，通过两阶段设计和迭代框架实现高保真、多对象的3D场景生成。

Details

Motivation: 现有新视角合成方法在保持多视角一致性与复杂室内场景可控生成方面存在不足，尤其在布局控制和细节保真之间难以平衡。 Method: 采用两阶段设计：第一阶段利用新颖表示连接3D布局与图像条件信号；第二阶段通过布局感知的极线注意力机制进行图像条件下的多视角生成，并引入迭代框架支持文本到场景的生成。 Result: 实验表明该方法在定量和定性上均优于现有最先进基线方法，能生成高保真且具多视角一致性的室内场景，支持不同复杂度的场景生成。 Conclusion: MVRoom通过结合3D布局引导与扩散模型，在新视角合成中实现了更好的控制性和视觉质量，为文本驱动的室内场景生成提供了有效解决方案。 Abstract: We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

[41] UniLight: A Unified Representation for Lighting

Zitian Zhang,Iliyan Georgiev,Michael Fischer,Yannick Hold-Geoffroy,Jean-François Lalonde,Valentin Deschaintre

Main category: cs.CV

TL;DR: 本文提出了UniLight，一种统一的多模态光照表示方法，通过共享潜在空间将文本、图像、辐照度和环境图等多种光照模态对齐，支持跨模态检索、环境图生成和基于扩散模型的图像合成中的光照控制。

Details

Motivation: 不同光照表示（如环境图、球谐函数、文本等）之间不兼容，限制了跨模态的迁移与应用，因此需要一种统一的表示方法。 Method: 设计模态特定的编码器，通过对比学习将不同模态的光照信息映射到共享的潜在空间，并引入球谐函数预测辅助任务以增强方向性理解。 Result: 在光照检索、环境图生成和扩散模型中的光照控制任务上验证了方法的有效性，实验表明该表示具有良好的一致性与跨模态可迁移性。 Conclusion: UniLight实现了多模态光照信息的统一表示，支持灵活的跨模态光照操作，为后续视觉任务提供了通用的光照编码基础。 Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

Tasmiah Haque,Srinjoy Das

Main category: cs.CV

TL;DR: 提出了一种新的推理时优化方法GRU-SNF，结合MCMC采样增强GRU-NF的多样性，提升视频运动预测中的多模态建模能力。

Details

Motivation: 现有GRU-NF模型因确定性变换结构限制了表达能力，难以充分捕捉未来序列的多样性，需提升其在时间序列生成中的多模态近似能力。 Method: 在GRU-NF基础上引入MCMC随机采样步骤，构建GRU-SNF模型，在推理阶段注入随机性以探索更丰富的输出空间，无需重新训练即可增强多样性。 Result: 实验表明GRU-SNF在关键点视频运动转移任务中优于GRU-NF，能在保持预测准确性的同时显著提升输出多样性，尤其在长时预测中表现更优。 Conclusion: 通过在推理阶段融合随机动力学与流模型，GRU-SNF有效增强了生成序列的多样性，展示了其在时间序列生成和实际应用中的潜力。 Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.

[43] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

Fan Jia,Yuhao Huang,Shih-Hsin Wang,Cristina Garcia-Cardona,Andrea L. Bertozzi,Bao Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于随机微分方程（SDE）的PnP-Flow模型连续极限分析方法，用于提升图像恢复性能。

Details

Motivation: 尽管PnP-Flow在图像恢复中取得了实证成功，但其理论理解滞后，缺乏对误差控制和加速机制的深入分析。 Method: 推导了PnP-Flow的连续极限，构建了一个SDE代理模型，并据此提出改进策略：优化步长调度、正则化神经网络向量场的Lipschitz常数，并通过外推法加速模型。 Result: 在去噪、去模糊、超分辨率和修复等任务上验证了该方法的有效性，显著优于基线PnP-Flow和其他先进方法。 Conclusion: SDE建模为PnP-Flow提供了理论支持，并通过误差分析与加速策略实现了性能提升。 Abstract: Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.

[44] Learning Single-Image Super-Resolution in the JPEG Compressed Domain

Sruthi Srinivasan,Elham Shakibapour,Rajy Rawther,Mehdi Saeedi

Main category: cs.CV

TL;DR: 提出一种基于JPEG DCT系数的轻量级超分辨率 pipeline，直接在编码域进行训练，显著提升数据加载和训练速度，同时保持良好的视觉质量。

Details

Motivation: 深度学习模型的数据加载已成为训练和推理速度的主要瓶颈，尤其是在输入数据规模不断增大的背景下。现有方法通常需要完全解码JPEG图像，带来不必要的计算开销。 Method: 设计了一种直接在JPEG离散余弦变换（DCT）系数上进行操作的轻量级超分辨率 pipeline，利用频率域中的编码特征进行单图像超分辨率（SISR）任务，避免了完整的JPEG解码过程。 Result: 该方法实现了2.6倍的数据加载加速和2.5倍的训练加速，同时在视觉质量上与传统的SISR方法相当。 Conclusion: 在JPEG编码特征上直接进行超分辨率训练是高效可行的，能够在不牺牲恢复质量的前提下显著提升处理速度，为高效图像恢复提供了新思路。 Abstract: Deep learning models have grown increasingly complex, with input data sizes scaling accordingly. Despite substantial advances in specialized deep learning hardware, data loading continues to be a major bottleneck that limits training and inference speed. To address this challenge, we propose training models directly on encoded JPEG features, reducing the computational overhead associated with full JPEG decoding and significantly improving data loading efficiency. While prior works have focused on recognition tasks, we investigate the effectiveness of this approach for the restoration task of single-image super-resolution (SISR). We present a lightweight super-resolution pipeline that operates on JPEG discrete cosine transform (DCT) coefficients in the frequency domain. Our pipeline achieves a 2.6x speedup in data loading and a 2.5x speedup in training, while preserving visual quality comparable to standard SISR approaches.

[45] Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

Gasser Elazab,Maximilian Jansen,Michael Unterreiner,Olaf Hellwich

Main category: cs.CV

TL;DR: 本文提出了一种轻量级单目几何估计方法Gamma-from-Mono (GfM)，通过解耦全局和局部结构来提升车辆周围3D环境感知精度，尤其在道路细小几何特征（如颠簸、坡度）的恢复上表现优异。该方法利用无量纲参数gamma表示相对于主导路面平面的垂直偏差，结合相机高度即可解析恢复度量深度，无需完整外参标定，并支持自监督学习。在KITTI和RSRD数据集上验证了其在近场深度和gamma估计中的最先进性能。

Details

Motivation: 传统单目深度估计容易过度平滑道路的细小几何特征（如颠簸、坡度），影响运动规划与行驶稳定性，因此需要一种能精细恢复近场路面结构且对不同相机配置鲁棒的方法。 Method: 提出Gamma-from-Mono (GfM) 方法，将重建问题分解为预测主导路面平面和残差变化；引入gamma（点的高度与其深度之比）作为无量纲变量表达局部垂直偏差，基于平面视差几何理论，仅需相机离地高度即可通过闭式解恢复度量深度，并采用自监督学习避免依赖标注数据。 Result: 在KITTI和RSRD数据集上，GfM在近场深度和gamma估计方面达到最先进水平，同时保持良好的全局深度性能；模型仅8.88M参数，适应多种相机设置，并是首个在RSRD上评估的自监督单目方法。 Conclusion: GfM通过物理可解释的gamma表示有效解决了单目重建中的投影模糊问题，在不依赖复杂标定和标注数据的前提下，实现了对道路细粒度几何的高精度恢复，适用于实际自动驾驶中的安全与舒适性控制。 Abstract: Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.

[46] How (Mis)calibrated is Your Federated CLIP and What To Do About It?

Mainak Singha,Masih Aminbeidokhti,Paolo Casari,Elisa Ricci,Subhankar Roy

Main category: cs.CV

TL;DR: 本文研究了联邦学习（FL）环境下CLIP模型的校准问题，发现现有的文本提示调优和训练中校准方法在FL中效果有限。为此，作者提出了FL²oRA，一种基于LoRA的简单方法，能自然提升联邦学习中的模型校准性能，减少对显式校准步骤的需求。

Details

Motivation: 尽管CLIP等视觉-语言模型已被广泛研究，其在联邦学习设置下的校准问题尚未被探索，而模型校准对可靠预测至关重要。现有方法在分布式环境下的表现不佳，亟需有效解决方案。 Method: 分析了多种文本提示调优方法在联邦学习中的校准表现，评估了现有训练中校准技术结合四种全局聚合方法的效果，并提出FL²oRA——一种选择性微调策略，基于LoRA在FL框架中改进校准。 Result: 实验表明，Textual Prompt Tuning会损害FL下的校准性能；现有校准技术提升有限；FL²oRA在多个基准上显著改善了模型校准，输出更可靠的预测概率。 Conclusion: 微调组件的选择是影响联邦学习中CLIP校准的关键因素；FL²oRA通过合理的设计，在无需额外校准步骤的情况下实现了稳定且良好的校准效果，为分布式视觉-语言模型的可靠性提供了有效路径。 Abstract: While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

[47] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Rui Fonseca,Bruno Martins,Gil Rocha

Main category: cs.CV

TL;DR: 本文提出了一种无需对齐图像-文本对的纯文本训练图像描述生成方法TOMCap，利用CLIP表示和检索增强来引导预训练语言模型生成描述，在无监督设置下优于现有方法。

Details

Motivation: 减少对人工标注图像-文本数据的依赖，探索在没有对齐图像-文本对的情况下进行图像描述生成的方法。 Method: 通过CLIP提取图像表示，并结合检索到的文本示例和潜在向量表示，减少模态差距，以此提示预训练语言模型解码器生成图像描述。 Result: 实验表明，TOMCap在多种无训练、纯文本设置下优于现有的非监督方法，并验证了不同检索增强和模态对齐策略的影响。 Conclusion: TOMCap有效实现了无需图像-文本配对训练的图像描述生成，展示了检索增强与模态对齐策略在跨模态生成中的潜力。 Abstract: Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

[48] Real-time Cricket Sorting By Sex

Juan Manuel Cantarero Angulo,Matthew Smith

Main category: cs.CV

TL;DR: 本研究提出了一种低成本、实时的家蟋蟀（Acheta domesticus）自动性别分选系统，结合计算机视觉与物理执行装置，可在资源受限设备上高效运行，提升昆虫养殖的可持续性与生产效率。

Details

Motivation: 由于食用昆虫作为可持续蛋白来源的需求上升，优化蟋蟀养殖中的性别分选有助于提高繁殖效率、营养控制和生产可持续性，但目前缺乏自动化解决方案。 Method: 开发了一个基于Raspberry Pi 5和专用AI摄像头的系统，采用轻量级YOLOv8 nano目标检测模型进行性别识别，并通过舵机驱动的分选臂实现物理分选。 Result: 模型在测试中达到0.977的mAP@0.5，实际实验中整体分选准确率为86.8%。 Conclusion: 证明了在资源受限设备上部署轻量级深度学习模型用于昆虫性别自动分选的可行性，为蟋蟀工业化养殖提供了实用且高效的解决方案。 Abstract: The global demand for sustainable protein sources is driving increasing interest in edible insects, with Acheta domesticus (house cricket) identified as one of the most suitable species for industrial production. Current farming practices typically rear crickets in mixed-sex populations without automated sex sorting, despite potential benefits such as selective breeding, optimized reproduction ratios, and nutritional differentiation. This work presents a low-cost, real-time system for automated sex-based sorting of Acheta domesticus, combining computer vision and physical actuation. The device integrates a Raspberry Pi 5 with the official Raspberry AI Camera and a custom YOLOv8 nano object detection model, together with a servo-actuated sorting arm. The model reached a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.977 during testing, and real-world experiments with groups of crickets achieved an overall sorting accuracy of 86.8%. These results demonstrate the feasibility of deploying lightweight deep learning models on resource-constrained devices for insect farming applications, offering a practical solution to improve efficiency and sustainability in cricket production.

[49] Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

Haolin Xiong,Tianwen Fu,Pratusha Bhuvana Prasad,Yunxuan Cai,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出了Mind-to-Face，首个通过非侵入式脑电图（EEG）信号直接生成高保真面部表情的框架，实现了在无视觉输入情况下解码情绪并驱动个性化虚拟头像。

Details

Motivation: 现有表情捕捉系统依赖视觉线索，在面部遮挡或情绪内敛时表现不佳；本文旨在探索EEG信号是否能有效编码情感性面部动态，以实现更鲁棒、个性化的表情重建。 Method: 构建双模态同步采集系统获取EEG与多视角面部视频数据，采用CNN-Transformer编码器将EEG映射为密集3D位置图，并结合改进的3D高斯溅射渲染 pipeline 生成逼真、视角一致的表情动画。 Result: 实验证明仅凭EEG即可准确预测个体化、动态的面部表情，包括细微情绪变化，验证了神经信号中蕴含丰富的情感与几何信息。 Conclusion: Mind-to-Face开创了神经驱动虚拟头像的新范式，为沉浸式环境中的情感感知远程呈现和认知交互提供了新可能。 Abstract: Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.

[50] DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

Jiashu Liao,Pietro Liò,Marc de Kamps,Duygu Sarikaya

Main category: cs.CV

TL;DR: 本文提出DisentangleFormer，通过空间-通道解耦的并行架构解决Vision Transformer中空间与通道表示纠缠的问题，尤其适用于高光谱成像任务，在多个基准上实现SOTA性能，同时降低计算成本。

Details

Motivation: 标准自注意力机制在处理多通道视觉数据时，空间与通道维度相互纠缠，难以独立建模结构与语义依赖，限制了表示学习能力，尤其是在高光谱成像等通道具有明确物理意义的任务中问题更为突出。 Method: 基于信息论去相关表示学习原则，设计并行的空间-通道解耦架构：(1) 并行解耦模块分别处理空间token和通道token；(2) 挤压token增强器动态融合双流信息；(3) 多尺度前馈网络结合局部上下文捕获细粒度依赖。 Result: 在Indian Pine、Pavia University、Houston、BigEarthNet遥感数据集及红外病理数据集上均达到SOTA性能，在ImageNet上保持竞争力的同时减少17.8%的FLOPs。 Conclusion: DisentangleFormer通过解耦空间与通道建模，实现了更高效、更具表达力的多通道视觉表示，在高光谱成像等多通道任务中表现出优越性能与计算效率。 Abstract: Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

[51] SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

Yonghan Lee,Tsung-Wei Huang,Shiv Gehlot,Jaehoon Choi,Guan-Ming Su,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出了一种名为SyncTrack4D的新方法，用于处理真实世界中未同步的多视频4D高斯点阵化（4DGS）重建，首次实现了无需预定义场景对象或先验模型的通用4DGS方法。

Details

Motivation: 动态3D场景建模因高维性和多视角信息融合需求而具有挑战性，现有方法难以处理未同步视频下的时间演化几何与运动重建。 Method: 采用密集4D轨迹表示作为跨视频同步与4DGS重建的线索；通过融合Gromov-Wasserstein最优传输计算跨视频4D特征轨迹及其对应关系；利用全局帧级时间对齐最大化匹配轨迹的运动重叠，并结合基于运动样条框架的多视频4D高斯点阵化实现亚帧级同步。 Result: 在Panoptic Studio和SyncNeRF Blender数据集上验证，平均时间误差低于0.26帧，重建保真度达26.3 PSNR；输出包含每段视频的时间偏移、显式3D轨迹的同步4DGS表示。 Conclusion: SyncTrack4D是首个适用于未同步视频集合的通用4D高斯点阵化方法，无需依赖预设对象或先验模型，实现了高精度同步与高质量4D重建。 Abstract: Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.

[52] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

Biao Chen,Zhenhua Lei,Yahui Zhang,Tongzhi Niu

Main category: cs.CV

TL;DR: 提出基于非均匀B样条曲面的DIC数据集生成方法和Bayes-DIC Net网络，实现高质量位移场预测与置信度估计。

Details

Motivation: 现有DIC数据集难以覆盖真实位移场景且深度学习模型缺乏预测不确定性估计能力。 Method: 通过随机生成控制点构建非均匀B样条位移场生成大规模DIC数据集；设计Bayes-DIC Net网络，采用多级特征提取、单跳连接聚合及轻量卷积块扩大感受野，并引入推理阶段激活的dropout模块实现贝叶斯神经网络。 Result: 所提方法能生成涵盖多种真实位移场景的大规模数据集；Bayes-DIC Net在保持低计算成本的同时提升了预测精度，并可输出预测置信度。 Conclusion: 该研究为DIC领域的数据集生成和算法性能提升提供了新思路，增强了深度学习模型在实际应用中的可靠性与实用性。 Abstract: This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.

[53] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: NN-RAG是一个检索增强生成系统，将大规模PyTorch代码库转化为可搜索、可执行的神经网络模块库，支持跨仓库模块复用与验证，显著提升神经架构的可发现性与多样性。

Details

Motivation: 现有工具难以在大量开源仓库中有效发现、提取和验证可重用的神经网络模块，限制了研究效率和架构多样性。 Method: 提出NN-RAG系统，通过作用域感知的依赖解析、保持导入结构的重建和验证门控机制，从异构PyTorch代码库中提取并验证神经模块，并结合多级去重（精确、词法、结构）识别唯一架构。 Result: 在19个主要仓库中提取1,289个候选模块，成功验证941个（73.0%），其中超80%结构唯一；为LEMUR数据集贡献约72%的新型网络结构，并实现跨仓库架构模式迁移。 Conclusion: NN-RAG首次在开源领域实现了大规模、可验证、可执行的神经模块检索与复用，推动了神经网络架构的可重现性与算法发现。 Abstract: Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

[54] Open Set Face Forgery Detection via Dual-Level Evidence Collection

Zhongyi Cai,Bryce Gernon,Wentao Bao,Yifan Li,Matthew Wright,Yu Kong

Main category: cs.CV

TL;DR: 本文研究了开放集人脸伪造检测（OSFFD）问题，提出了一种基于双层次证据的伪造检测方法（DLED），通过空间和频率层面融合类别特定证据来估计预测不确定性，在检测新型伪造类型方面显著优于现有方法，同时在传统真/假分类任务中表现优异。

Details

Motivation: 现有伪造检测方法多局限于二分类或已知伪造类别识别，难以应对不断涌现的新型伪造技术，缺乏对新类别伪造的检测能力。 Method: 提出Dual-Level Evidential Detection (DLED) 方法，从空间和频率两个层次收集并融合类别特定证据，利用不确定性估计来识别已知和未知的伪造类型。 Result: 在多种实验设置下，DLED在检测新型伪造类别上平均超越基线模型20%，并在传统真/假检测任务中保持竞争力。 Conclusion: DLED通过不确定性建模有效提升了对新型人脸伪造的检测能力，增强了检测系统在开放环境下的适用性和鲁棒性。 Abstract: The proliferation of face forgeries has increasingly undermined confidence in the authenticity of online content. Given the rapid development of face forgery generation algorithms, new fake categories are likely to keep appearing, posing a major challenge to existing face forgery detection methods. Despite recent advances in face forgery detection, existing methods are typically limited to binary Real-vs-Fake classification or the identification of known fake categories, and are incapable of detecting the emergence of novel types of forgeries. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which demands that the detection model recognize novel fake categories. We reformulate the OSFFD problem and address it through uncertainty estimation, enhancing its applicability to real-world scenarios. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which collects and fuses category-specific evidence on the spatial and frequency levels to estimate prediction uncertainty. Extensive evaluations conducted across diverse experimental settings demonstrate that the proposed DLED method achieves state-of-the-art performance, outperforming various baseline models by an average of 20% in detecting forgeries from novel fake categories. Moreover, on the traditional Real-versus-Fake face forgery detection task, our DLED method concurrently exhibits competitive performance.

[55] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang,Wei-Yuan Cheng,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 提出了一种名为SANTA的自增强对比对齐框架，用于减轻多模态大语言模型在视频描述生成中的物体和动作幻觉问题，通过自增强生成负样本并进行细粒度对比对齐，显著提升了描述的真实性。

Details

Motivation: 现有的多模态大语言模型在生成视频描述时存在严重的事实性错误（即幻觉问题），尤其是在动态视频中的物体和动作描述上，现有方法难以有效缓解这一问题。 Method: 提出了Self-Augmented Contrastive Alignment (SANTA) 框架：1）采用幻觉感知的自增强策略生成包含潜在幻觉的负样本描述；2）引入tracklet-短语对比对齐机制，将区域物体与视觉短语、关系引导的动作与时间短语进行细粒度匹配，打破虚假关联，增强对真实视觉事实的关注。 Result: 在多个幻觉评测基准上，SANTA显著优于现有方法，在减轻物体和动作幻觉方面表现出更强的能力，同时提升了整体描述质量。 Conclusion: SANTA通过自增强生成负样本和细粒度的跨模态对比对齐，有效缓解了多模态大模型在视频描述中的物体和动作幻觉问题，提高了生成内容的事实一致性。 Abstract: Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

[56] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

Ao Xu,Rujin Zhao,Xiong Xu,Boceng Huang,Yujia Jia,Hongfeng Long,Fuxuan Chen,Zilong Cao,Fangyuan Chen

Main category: cs.CV

TL;DR: 提出了一种基于多频率自适应融合的立体匹配网络MAFNet，仅使用高效的2D卷积即可生成高质量视差图，在准确性和实时性之间取得了良好平衡。

Details

Motivation: 现有立体匹配方法在计算开销或非局部上下文建模方面存在不足，难以在资源受限的移动设备上实现实时应用。 Method: 设计了自适应频域滤波注意力模块，将完整代价体分解为高频和低频部分并分别进行特征聚合，并引入基于Linformer的低秩注意力机制来自适应融合高低频信息。 Result: 在Scene Flow和KITTI 2015等公开数据集上显著优于现有的实时方法。 Conclusion: MAFNet通过高效的2D卷积和频域注意力机制，有效提升了立体匹配的精度与速度，适合部署于移动设备。 Abstract: Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.

[57] FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

Geunhyuk Youk,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: FMA-Net++ 是一种用于真实视频恢复的框架，能够联合处理超分辨率和去模糊问题，显式建模了运动与动态曝光变化的耦合效应。

Details

Motivation: 现有方法大多忽视了真实视频中由运动与动态曝光变化（如自动曝光或低光拍摄）共同导致的复杂退化问题，这限制了其在实际场景中的表现。因此，需要一个能同时考虑这两种因素的模型来提升恢复效果。 Method: 提出 FMA-Net++，采用基于层次化双向传播的序列级架构；引入曝光时间感知调制层和曝光感知的光流引导动态滤波模块，分别建模每帧的曝光条件和运动相关的退化核；并通过解耦退化学习与图像恢复过程，利用预测的先验信息指导恢复网络。 Result: 在合成数据上训练的 FMA-Net++ 在新提出的 REDS-ME、REDS-RE 以及 GoPro 基准上均达到最先进的性能，表现出优异的时间一致性和推理速度，并在真实世界视频中展现出良好的泛化能力。 Conclusion: FMA-Net++ 能有效应对真实视频中运动与动态曝光耦合带来的复杂退化，在视频超分辨率与去模糊任务中实现了高精度、高效且鲁棒的恢复效果。 Abstract: Real-world video restoration is plagued by complex degradations from motion coupled with dynamically varying exposure - a key challenge largely overlooked by prior works and a common artifact of auto-exposure or low-light capture. We present FMA-Net++, a framework for joint video super-resolution and deblurring that explicitly models this coupled effect of motion and dynamically varying exposure. FMA-Net++ adopts a sequence-level architecture built from Hierarchical Refinement with Bidirectional Propagation blocks, enabling parallel, long-range temporal modeling. Within each block, an Exposure Time-aware Modulation layer conditions features on per-frame exposure, which in turn drives an exposure-aware Flow-Guided Dynamic Filtering module to infer motion- and exposure-aware degradation kernels. FMA-Net++ decouples degradation learning from restoration: the former predicts exposure- and motion-aware priors to guide the latter, improving both accuracy and efficiency. To evaluate under realistic capture conditions, we introduce REDS-ME (multi-exposure) and REDS-RE (random-exposure) benchmarks. Trained solely on synthetic data, FMA-Net++ achieves state-of-the-art accuracy and temporal consistency on our new benchmarks and GoPro, outperforming recent methods in both restoration quality and inference speed, and generalizes well to challenging real-world videos.

[58] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

Hieu Dinh Trung Pham,Huy Minh Nhat Nguyen,Cuong Tuan Nguyen

Main category: cs.CV

TL;DR: 本文提出了傅里叶-注意力表示学习（FARL）框架，通过傅里叶分析显式解耦视觉表示中的结构与风格信息，提升大规模预训练视觉-语言模型的泛化能力。

Details

Motivation: 现有的视觉-语言模型通常将图像的结构与风格隐式纠缠于整体表示中，限制了模型在跨域场景下的泛化能力，因此需要一种显式的解耦机制来增强适应性。 Method: 提出FARL框架，采用双交叉注意力机制，使可学习的表示标记分别从相位谱（结构特征）和幅度谱（风格特征）中提取信息，并通过不对称注入策略将解耦后的特征注入到VLM编码器深层以指导自适应过程。 Result: 在15个数据集上的大量实验表明，所提方法显著优于现有方法，验证了其有效性。 Conclusion: FARL通过显式解耦结构与风格特征并有效融合到VLM中，提升了模型的鲁棒性和跨域泛化能力，为视觉表示学习提供了新思路。 Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image's domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image's structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.

[59] Performance Evaluation of Transfer Learning Based Medical Image Classification Techniques for Disease Detection

Zeeshan Ahmad,Shudi Bao,Meng Chen

Main category: cs.CV

TL;DR: 本文系统分析了基于深度卷积神经网络的迁移学习技术在医学图像分类中的应用，评估了六种预训练模型在胸部X光数据集上的表现，发现InceptionV3性能最优，并探讨了模型架构、数据集大小和任务间领域相似性对迁移学习效果的影响。

Details

Motivation: 由于医学图像数据有限且训练大型深度学习模型成本高，直接从头训练不现实，因此需要研究迁移学习在该领域的有效性及适用性。 Method: 采用六种预训练的深度卷积神经网络（AlexNet、VGG16、ResNet18、ResNet34、ResNet50、InceptionV3）在自定义胸部X光数据集上进行迁移学习实验，并评估其分类性能、不确定性及运行效率。 Result: InceptionV3在所有标准指标上表现最佳；ResNet系列随深度增加性能提升；VGG16和AlexNet表现尚可但准确率较低；迁移学习在数据量少时尤其有益；良好的特征提取器配合轻量前馈网络即可实现高效预测。 Conclusion: 迁移学习在医学图像分类中具有显著优势，模型选择应综合考虑架构、数据规模和任务间领域相似性，本研究为实际应用中的模型选型提供了实践指导。 Abstract: Medical image classification plays an increasingly vital role in identifying various diseases by classifying medical images, such as X-rays, MRIs and CT scans, into different categories based on their features. In recent years, deep learning techniques have attracted significant attention in medical image classification. However, it is usually infeasible to train an entire large deep learning model from scratch. To address this issue, one of the solutions is the transfer learning (TL) technique, where a pre-trained model is reused for a new task. In this paper, we present a comprehensive analysis of TL techniques for medical image classification using deep convolutional neural networks. We evaluate six pre-trained models (AlexNet, VGG16, ResNet18, ResNet34, ResNet50, and InceptionV3) on a custom chest X-ray dataset for disease detection. The experimental results demonstrate that InceptionV3 consistently outperforms other models across all the standard metrics. The ResNet family shows progressively better performance with increasing depth, whereas VGG16 and AlexNet perform reasonably well but with lower accuracy. In addition, we also conduct uncertainty analysis and runtime comparison to assess the robustness and computational efficiency of these models. Our findings reveal that TL is beneficial in most cases, especially with limited data, but the extent of improvement depends on several factors such as model architecture, dataset size, and domain similarity between source and target tasks. Moreover, we demonstrate that with a well-trained feature extractor, only a lightweight feedforward model is enough to provide efficient prediction. As such, this study contributes to the understanding of TL in medical image classification, and provides insights for selecting appropriate models based on specific requirements.

[60] Dual-Stream Spectral Decoupling Distillation for Remote Sensing Object Detection

Xiangyi Gao,Danpei Zhao,Bo Yuan,Wentao Li

Main category: cs.CV

TL;DR: 提出了一种通用的双流频谱解耦蒸馏方法DS2D2，用于遥感目标检测，结合显式和隐式知识蒸馏，有效提升小而密集目标的检测性能。

Details

Motivation: 现有知识蒸馏方法在遥感图像中存在特征混合和忽略细微特征差异导致的知识混淆问题。 Method: 基于频谱分解设计显式与隐式蒸馏：使用一阶小波变换进行频谱分解并设计密度无关尺度权重（DISW）；通过全频段和高频放大器提取学生-教师模型间的细微特征差异作为隐式知识。 Result: 在DIOR和DOTA数据集上实验表明，DS2D2显著优于现有方法，在DIOR上RetinaNet和Faster R-CNN的AP50分别提升4.2%和3.8%。 Conclusion: DS2D2是一种架构无关且有效的蒸馏方法，能有效解耦遥感图像中的知识，提升检测性能，尤其适用于密集小目标场景。 Abstract: Knowledge distillation is an effective and hardware-friendly method, which plays a key role in lightweighting remote sensing object detection. However, existing distillation methods often encounter the issue of mixed features in remote sensing images (RSIs), and neglect the discrepancies caused by subtle feature variations, leading to entangled knowledge confusion. To address these challenges, we propose an architecture-agnostic distillation method named Dual-Stream Spectral Decoupling Distillation (DS2D2) for universal remote sensing object detection tasks. Specifically, DS2D2 integrates explicit and implicit distillation grounded in spectral decomposition. Firstly, the first-order wavelet transform is applied for spectral decomposition to preserve the critical spatial characteristics of RSIs. Leveraging this spatial preservation, a Density-Independent Scale Weight (DISW) is designed to address the challenges of dense and small object detection common in RSIs. Secondly, we show implicit knowledge hidden in subtle student-teacher feature discrepancies, which significantly influence predictions when activated by detection heads. This implicit knowledge is extracted via full-frequency and high-frequency amplifiers, which map feature differences to prediction deviations. Extensive experiments on DIOR and DOTA datasets validate the effectiveness of the proposed method. Specifically, on DIOR dataset, DS2D2 achieves improvements of 4.2% in AP50 for RetinaNet and 3.8% in AP50 for Faster R-CNN, outperforming existing distillation approaches. The source code will be available at https://github.com/PolarAid/DS2D2.

[61] UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes

Changhe Liu,Ehsan Javanmardi,Naren Bao,Alex Orsholits,Manabu Tsukada

Main category: cs.CV

TL;DR: 提出了一种基于可微三角形的光线追踪管线，直接以三角形为渲染原语，无需代理几何，实现了高质量实时渲染，并统一了新视角合成中的渲染原语。

Details

Motivation: 现有高斯粒子光线追踪方法依赖代理几何，需构建复杂中间网格并进行昂贵的相交测试，且高斯粒子不适合作为光栅化和光线追踪的统一图元。 Method: 提出一种可微的基于三角形的光线追踪管线，直接将优化后的三角形作为渲染原语，无需代理几何，支持与光栅化方法（如Triangle Splatting）共享图元。 Result: 相比现有光线追踪方法显著提升了渲染质量，同时保持实时渲染性能，并能直接渲染由Triangle Splatting优化的三角形。 Conclusion: 该方法通过以三角形为统一渲染原语，解决了高斯粒子在光线追踪中的局限性，实现了高质量、高效的新型视图合成。 Abstract: Ray tracing 3D Gaussian particles enables realistic effects such as depth of field, refractions, and flexible camera modeling for novel-view synthesis. However, existing methods trace Gaussians through proxy geometry, which requires constructing complex intermediate meshes and performing costly intersection tests. This limitation arises because Gaussian-based particles are not well suited as unified primitives for both ray tracing and rasterization. In this work, we propose a differentiable triangle-based ray tracing pipeline that directly treats triangles as rendering primitives without relying on any proxy geometry. Our results show that the proposed method achieves significantly higher rendering quality than existing ray tracing approaches while maintaining real-time rendering performance. Moreover, our pipeline can directly render triangles optimized by the rasterization-based method Triangle Splatting, thus unifying the primitives used in novel-view synthesis.

[62] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

Manar Alnaasan,Md Selim Sarowar,Sungho Kim

Main category: cs.CV

TL;DR: 本文提出了一种可解释的RGB-D多模态框架，用于在真实场景下识别帕金森步态，结合双YOLOv11编码器与跨空间融合机制，并引入冻结的大语言模型生成临床可读解释，提升了准确性、鲁棒性与可解释性。

Details

Motivation: 现有帕金森病步态分析方法受限于单模态输入、鲁棒性差以及缺乏临床透明度，难以满足实际应用需求。 Method: 采用双YOLOv11编码器提取RGB和Depth特征，结合多尺度局部-全局提取模块（MLGE）与跨空间融合机制，并利用冻结的大语言模型将融合特征转化为临床文本解释。 Result: 在多模态步态数据集上实验表明，该方法相比单模态基线模型具有更高的识别精度、更强的环境适应能力，并能生成清晰的视觉-语言推理结果。 Conclusion: 该研究通过融合多模态视觉特征与语言可解释性，构建了可靠的帕金森病步态分析新范式，弥合了视觉识别与临床理解之间的差距。 Abstract: Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM

[63] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Sidan Zhu,Hongteng Xu,Dixin Luo

Main category: cs.CV

TL;DR: 提出了一种自步长、自校正的掩码预测方法SSMP，用于电影预告片生成，通过双向上下文建模和渐进式自我修正机制，实现了当前最优的性能。

Details

Motivation: 现有自动预告片生成方法多采用“先选择后排序”范式，存在误差传播问题，限制了生成质量，因此需要一种更鲁棒的方法来提升效果。 Method: 提出SSMP方法，使用Transformer编码器对电影镜头序列进行双向上下文建模，通过自适应掩码比的掩码预测进行训练，并在生成过程中采用渐进式自我修正机制，逐步填充高置信度镜头并重新掩码剩余位置。 Result: 在定量评估和用户研究中，SSMP均优于现有的自动预告片生成方法，表现出更强的生成能力和编辑逻辑。 Conclusion: SSMP通过自步长训练和自校正生成机制，有效克服了传统方法的误差传播问题，为视频摘要与编辑任务提供了新的思路。 Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: https://github.com/Dixin-Lab/SSMP.

[64] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Bin Suna,Yaoguang Caob,Yan Wanga,Rui Wanga,Jiachen Shanga,Xiejie Fenga,Jiayi Lu,Jia Shi,Shichun Yang,Xiaoyu Yane,Ziying Song

Main category: cs.CV

TL;DR: 本文提出了一种名为MindDrive的端到端自动驾驶框架，结合高质量轨迹生成与多目标决策推理，通过“上下文模拟-候选生成-多目标权衡”范式，在安全性、舒适性和效率方面实现优越表现。

Details

Motivation: 现有自动驾驶方法在轨迹生成与决策选择之间存在割裂：生成导向方法缺乏复杂决策能力，而选择导向方法生成能力不足。本文旨在统一二者优势，构建更智能、可解释的驾驶系统。 Method: 提出MindDrive框架，包含两个核心模块：基于世界动作模型（WaM）的未来感知轨迹生成器（FaTG），用于条件化‘假设’仿真并生成前瞻性轨迹；以及面向视觉语言模型的评估器（VLoE），利用大模型进行安全、舒适、效率等多维度评估与权衡决策。 Result: 在NAVSIM-v1和NAVSIM-v2基准上实验表明，MindDrive在多项驾驶指标上达到最先进水平，显著提升安全性、合规性和泛化能力。 Conclusion: MindDrive通过融合生成与推理，提供了一条通向可解释、认知引导自动驾驶的新路径。 Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

[65] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang,Zhenkai Li,Tianwen Qian,Huanran Zheng,Zheng Wang,Yuqian Fu,Xiaoling Wang

Main category: cs.CV

TL;DR: 本文提出了StreamEQA，首个面向具身场景中流式视频问答的基准，从“具身”和“流式”两个维度评估多模态大模型在感知、交互、规划以及回溯、实时、前向推理方面的能力，基于156个长视频构建42项任务和约2.1万带时间戳的问答对，实验表明现有模型在此类任务上仍面临挑战。

Details

Motivation: 随着具身智能向现实世界部署发展，模型需要持续理解动态视觉输入并进行时序推理，但现有基准无法充分评估模型在具身性和流式输入双重需求下的综合能力，因此需要新的评测基准。 Method: 构建StreamEQA基准，包含156个长视频、42项任务和约2.1万个带精确时间戳的问答对；设计双正交评估维度：具身性（感知、交互、规划）和流式（回溯、实时、前向推理），通过自动化生成与人工精炼结合的方式构建数据，并对13种先进视频大模型进行评测。 Result: 在13种最先进的视频大语言模型上的实验表明，尽管这些模型在传统基准上表现良好，但在StreamEQA所要求的流式具身场景理解任务中仍表现不佳，尤其在需要时序推理和高阶规划的任务上存在明显不足。 Conclusion: StreamEQA填补了具身智能在流式视频理解评测方面的空白，揭示了当前多模态大模型在动态环境感知与连续决策能力上的局限，有望推动面向真实场景的持续视觉理解与推理研究。 Abstract: As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.

[66] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Changjin Kim,HyeokJun Lee,YoungJoon Yoo

Main category: cs.CV

TL;DR: 本文提出了一种名为GuidNoise的单对图像引导扩散模型，用于广义噪声合成，仅需一个噪声-干净图像对作为指导，无需额外元数据，通过GAFM和噪声感知细化损失提升生成真实噪声的能力，并可高效生成配对数据以增强训练，显著提升去噪性能。

Details

Motivation: 现有基于生成模型的真实噪声合成方法通常依赖相机元数据和大量特定目标的噪声-干净图像对，泛化能力有限且数据获取成本高，因此需要一种更轻量、通用且易于部署的噪声合成方法。 Method: 提出GuidNoise，利用单个噪声-干净图像对进行引导；引入引导感知仿射特征修改（GAFM）和噪声感知细化损失，优化扩散模型的反向过程，从而更好地捕捉和生成真实噪声分布。 Result: GuidNoise能在多种噪声环境下生成高质量的合成噪声图像，无需训练或推理时的额外元数据；可在推理阶段高效生成噪声-干净图像对，用于数据增强；实验表明该方法显著提升了去噪模型在轻量级模型和有限训练数据下的性能。 Conclusion: GuidNoise通过单一对图像引导实现了通用且高效的噪声合成，降低了对元数据和大规模配对数据的依赖，具备良好的实用性和扩展性，尤其适用于资源受限的实际场景。 Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.

[67] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Yingzi Ma,Yulong Cao,Wenhao Ding,Shuibai Zhang,Yan Wang,Boris Ivanovic,Ming Jiang,Marco Pavone,Chaowei Xiao

Main category: cs.CV

TL;DR: 本文提出了一种基于离散扩散机制的视觉-语言模型dVLM-AD，用于提升端到端自动驾驶系统在分布外场景中的可控性与一致性，相较于自回归模型表现出更优的推理-动作对齐和规划性能。

Details

Motivation: 现有的自回归视觉-语言模型在端到端自动驾驶中受限于因果注意力和序列生成机制，难以保证高层推理与低层规划之间的一致性和可控性。 Method: 提出dVLM-AD，采用具有双向注意力的离散扩散架构，通过迭代去噪统一感知、结构化推理与低层规划，提升系统对复杂驾驶场景的适应能力。 Result: 在nuScenes和WOD-E2E数据集上验证，dVLM-AD在行为-轨迹一致性上比自回归基线提升9%，在长尾场景的RFS指标上提升6%，且规划性能与现有先进系统相当。 Conclusion: 基于扩散的VLM为端到端自动驾驶提供了更可控、可靠且可扩展的发展路径。 Abstract: The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

[68] UniTS: Unified Time Series Generative Model for Remote Sensing

Yuxiang Zhang,Shunlin Liang,Wenyuan Li,Han Ma,Jianglei Xu,Yichuan Ma,Jiangwei Xie,Wei Li,Mengmeng Zhang,Ran Tao,Xiang-Gen Xia

Main category: cs.CV

TL;DR: 本文提出了一种统一的时间序列生成模型UniTS，基于流匹配生成范式，用于处理遥感中的多种时间序列任务，如去云、重建、变化检测和预测。

Details

Motivation: 现有方法通常针对不同任务设计专用模型，缺乏对多任务时空特征的统一建模。 Method: 提出UniTS模型，采用扩散Transformer架构，引入自适应条件注入器（ACor）和时空调制器（STM），实现多任务统一建模。 Result: 在多个低级和高级时间序列任务上显著优于现有方法，尤其在严重云污染、模态缺失和物候变化预测等挑战下表现优异。 Conclusion: UniTS实现了遥感时间序列分析的统一建模范式，具备强大的生成与认知能力，推动了地球系统动态监测的发展。 Abstract: One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

[69] DeRA: Decoupled Representation Alignment for Video Tokenization

Pengbo Guo,Junke Wang,Zhen Xing,Chengxu Liu,Daoguo Dong,Xueming Qian,Zuxuan Wu

Main category: cs.CV

TL;DR: DeRA是一种新型的1D视频分词器，通过解耦时空表示学习来提升训练效率和性能，采用外观和运动双流结构，并引入SACP模块缓解异构监督带来的梯度冲突，在多个视频生成任务中表现优越。

Details

Motivation: 现有视频分词器在联合建模空间语义与时间动态时存在训练效率低、性能受限的问题，难以有效利用预训练视觉基础模型。因此需要一种更高效且兼容预训练模型的视频表示方法。 Method: 提出DeRA，将视频编码分解为外观和运动两个独立流，分别对齐预训练视觉基础模型以捕捉空间语义和时间动态；设计紧凑的1D潜在空间，并引入SACP模块，通过抑制冲突方向上的梯度分量来解决异构监督导致的梯度冲突问题。 Result: 在UCF-101数据集上，DeRA比之前的最先进分词器LARP提升了25%的rFVD指标；在UCF-101类别条件生成和K600帧预测任务中，基于DeRA的自回归视频生成均达到新的最先进水平。 Conclusion: DeRA通过解耦时空表示和利用预训练模型显著提升了视频分词与生成的性能，SACP模块有效缓解了多任务学习中的梯度冲突，为高效视频建模提供了新思路。 Abstract: This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.

[70] Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun,Oindrila Saha,Subhransu Maji

Main category: cs.CV

TL;DR: 本文介绍了NABirds Look-Alikes (NABLA) 数据集，用于评估鸟类身份保持生成的性能，并展示了通过按物种、年龄和性别分组训练可显著提升模型表现。

Details

Motivation: 现有零样本、身份保持生成模型在非刚性或细粒度类别（如鸟类）上表现有限，且缺乏高质量多视角数据，难以评估与改进。本文旨在建立一个高精度、细粒度的基准以推动此类任务的发展。 Method: 构建了包含4,759对专家整理图像的NABLA数据集，并结合iNaturalist上的1,073对多图观测和少量视频数据形成综合基准；采用按物种、年龄和性别分组的图像进行训练，作为身份代理信号来提升生成一致性。 Result: 实验表明当前最先进的基线模型在该数据集上无法有效保持身份特征，而使用分组图像训练后，在已见和未见物种上均显著提升了身份保持性能。 Conclusion: NABLA为细粒度、非刚性对象的身份保持生成提供了新的挑战性基准，证明了基于语义身份分组训练的有效性，推动图像生成向高精度、真实世界应用迈进。 Abstract: Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.

[71] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Chang-Hsun Wu,Kai-Po Chang,Yu-Yang Sheng,Hung-Kai Chung,Kuei-Chun Wang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SEASON的无训练方法，用于增强视频大语言模型（VideoLLMs）在时空一致性方面的表现，通过自诊断对比解码有效减少时间与空间上的幻觉问题。

Details

Motivation: 现有VideoLLMs在处理视频中的时间信息时存在困难，容易产生时间不一致或因果不合理的内容，导致严重幻觉问题，而当前研究多关注空间幻觉，时间推理问题尚缺乏探索。 Method: 提出Self-Diagnostic Contrastive Decoding（SEASON），通过动态诊断每个输出token的幻觉倾向，并针对其对应的时间和空间负样本进行自适应对比解码，从而提升生成内容的时空保真度。 Result: 在三个幻觉评测基准上，SEASON优于所有现有的无训练幻觉缓解方法，并在四个通用视频理解基准上进一步提升了VideoLLMs的性能。 Conclusion: SEASON是一种有效的训练-free方法，能够自适应地增强VideoLLMs在时间和空间上的忠实性，显著缓解幻觉问题，同时提升整体视频理解能力。 Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

[72] Controllable Long-term Motion Generation with Extended Joint Targets

Eunjong Lee,Eunhee Kim,Sanghoon Hong,Eunho Jung,Jihoon Kim

Main category: cs.CV

TL;DR: 本文提出COMET，一种基于Transformer的条件变分自编码器框架，结合参考引导反馈机制，实现实时、稳定且可控的角色动作生成，支持精细控制与长序列合成，并具备实时风格迁移能力。

Details

Motivation: 现有方法在长时间序列中容易出现运动退化，且缺乏细粒度控制，难以满足交互式应用对稳定性和实时性的需求。 Method: 采用高效的Transformer-based条件VAE架构，实现对任意指定关节的精确控制；引入参考引导的反馈机制，防止误差累积，提升长期稳定性，并支持即插即用的风格迁移。 Result: 实验表明COMET在复杂控制任务中显著优于现有最先进方法，能以实时速度生成高质量、稳定的动作序列。 Conclusion: COMET在实时性、控制灵活性和长期稳定性方面取得平衡，适用于高要求的交互式动画应用。 Abstract: Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.

[73] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang,Renrui Zhang,Haodong Li,Zhuofan Zong,Ziyu Guo,Jun He,Claire Guo,Junyan Ye,Rongyao Fang,Weijia Li,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 提出了一种新的交错推理范式Draft-as-CoT（DraCo），利用低分辨率草图进行视觉规划，并通过选择性修正和超分辨率优化生成图像，显著提升了多模态大模型在文本到图像生成中的表现。

Details

Motivation: 现有统一多模态大语言模型在文本到图像生成中依赖纯文本的思维链推理，缺乏具体视觉信息指导，难以处理细粒度语义对齐和罕见属性组合生成问题。 Method: 首先生成低分辨率草图作为视觉预览，提供结构化规划；然后利用模型自身理解能力检测草图与输入提示间的语义不一致，并通过选择性修正和超分辨率进行精细化优化；采用DraCo-240K数据集训练，并设计DraCo-CFG策略支持交错推理。 Result: 在GenEval、Imagine-Bench和GenEval++等多个评测基准上显著优于直接生成和其他基于思维链的方法，分别提升8%、0.91和3%。 Conclusion: DraCo通过融合视觉与文本的交错推理，有效增强了生成过程中的规划与验证能力，解决了文本思维链粗粒度和罕见属性组合生成难的问题，为多模态生成提供了新范式。 Abstract: Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

[74] Shift-Window Meets Dual Attention: A Multi-Model Architecture for Specular Highlight Removal

Tianci Huo,Lingfeng Qi,Yuhan Chen,Qihong Xue,Jinyuan Shao,Hai Yu,Jie Li,Zhanhua Zhang,Guofa Li

Main category: cs.CV

TL;DR: 本文提出了一种用于去除镜面高光的多模型架构MM-SHR，结合卷积操作提取局部细节和注意力机制捕捉全局特征，通过粗到精的方式及新设计的OAIBlock与HDDAConv模块，在保持计算效率的同时有效处理不同尺度的高光，实验表明其在多个材料表面和基准任务上优于现有方法。

Details

Motivation: 现有方法在处理不同尺度的镜面高光时，单一模型结构难以同时建模局部细节和全局依赖关系，导致去高光效果不佳。 Method: 提出MM-SHR架构：浅层使用卷积提取局部细节，深层引入注意力机制捕获全局依赖；设计OAIBlock和HDDAConv模块，采用全向像素移位和窗口划分操作，以粗到精方式建模长距离依赖。 Result: 在三个基准任务和六种材料表面上的实验表明，MM-SHR在去高光准确性和效率方面均优于当前最先进方法。 Conclusion: MM-SHR通过融合局部与全局建模能力，有效解决了多尺度镜面高光去除问题，兼顾性能与计算效率，具有良好的应用前景。 Abstract: Inevitable specular highlights in practical environments severely impair the visual performance, thus degrading the task effectiveness and efficiency. Although there exist considerable methods that focus on local information from convolutional neural network models or global information from transformer models, the single-type model falls into a modeling dilemma between local fine-grained details and global long-range dependencies, thus deteriorating for specular highlights with different scales. Therefore, to accommodate specular highlights of all scales, we propose a multi-model architecture for specular highlight removal (MM-SHR) that effectively captures fine-grained features in highlight regions and models long-range dependencies between highlight and highlight-free areas. Specifically, we employ convolution operations to extract local details in the shallow layers of MM-SHR, and utilize the attention mechanism to capture global features in the deep layers, ensuring both operation efficiency and removal accuracy. To model long-range dependencies without compromising computational complexity, we utilize a coarse-to-fine manner and propose Omni-Directional Attention Integration Block(OAIBlock) and Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv) , which leverage omni-directiona pixel-shifting and window-dividing operations at the raw features to achieve specular highlight removal. Extensive experimental results on three benchmark tasks and six types of surface materials demonstrate that MM-SHR outperforms state-of-the-art methods in both accuracy and efficiency for specular highlight removal. The implementation will be made publicly available at https://github.com/Htcicv/MM-SHR.

[75] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

Yuduo Jin,Brandon Haworth

Main category: cs.CV

TL;DR: 本文通过控制研究探讨了生成式运动扩散模型中运动表示和损失函数的基本问题，基于代理模型MDM采用v loss目标（vMDM），评估了六种常见运动表示在质量与多样性上的表现，比较了不同配置下的训练速度，并在大型数据集上进行了分析，结果表明不同运动表示和配置对模型性能有显著影响。

Details

Motivation: 旨在深入理解运动扩散模型中不同设计选择（如运动表示、损失函数）的影响，提升条件运动生成模型的性能与训练效率。 Method: 基于MDM模型引入v loss作为预测目标（vMDM），对六种常见运动表示进行实证比较，评估其在质量、多样性及训练时间上的差异，并在大规模运动数据集上进行实验分析。 Result: 实验结果显示不同运动表示在不同数据集上性能差异显著；特定配置能有效加快训练速度；v loss有助于更好地建模潜在数据分布。 Conclusion: 运动表示和训练配置的选择对扩散模型生成人体运动的效果具有重要影响，合理的设计可显著提升模型性能与效率，为后续条件运动生成研究提供了实证基础与优化方向。 Abstract: Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.

[76] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Min Zhao,Bokai Yan,Xue Yang,Hongzhou Zhu,Jintao Zhang,Shilong Liu,Chongxuan Li,Jun Zhu

Main category: cs.CV

TL;DR: UltraImage提出了一种新的框架，通过递归主频校正和熵引导的自适应注意力集中机制，有效解决了扩散变压器在超分辨率图像生成中的内容重复和质量下降问题，实现了最高6K*6K的高质量图像生成。

Details

Motivation: 现有的图像扩散变压器在超越训练尺度生成图像时存在内容重复和质量退化的问题，限制了其在高分辨率生成中的应用。 Method: 通过频域分析位置编码，发现内容重复源于主频率的周期性，并提出递归主频校正；同时针对注意力稀释导致的质量下降，设计了熵引导的自适应注意力集中机制，增强局部细节并保持全局结构。 Result: 实验表明，UltraImage在Qwen-Image和Flux数据集（约4K）上优于先前方法，显著减少重复并提升视觉保真度；可在1328p训练分辨率下生成高达6K*6K的图像，无需低分辨率引导。 Conclusion: UltraImage为扩散变压器实现极端外推下的高质量图像生成提供了有效解决方案，展现出强大的超分辨率生成能力。 Abstract: Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.

[77] DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

Yinghui Xing,Xiaoting Su,Shizhou Zhang,Donghao Chu,Di Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于MAE的双域引导红外基础模型DuGI-MAE，通过熵驱动的确定性掩码策略和双域引导模块，在大规模红外数据集Inf-590K上预训练，有效提升了红外图像理解任务的性能。

Details

Motivation: 现有基于可见光数据训练的基础模型（如MAE）在红外图像解释任务中表现不佳，且当前红外基础模型存在信息令牌丢失、全局关联建模不足及非均匀噪声忽略等问题。 Method: 设计基于令牌熵的确定性掩码策略以保留高熵令牌；引入双域引导（DDG）模块来捕捉全局关系并自适应滤除非均匀背景噪声；构建大规模红外数据集Inf-590K用于预训练。 Result: DuGI-MAE在红外目标检测、语义分割和小目标检测等多个下游任务中表现出强大的泛化能力，优于监督与自监督对比方法。 Conclusion: DuGI-MAE通过改进掩码策略和噪声抑制机制，显著提升了红外基础模型的性能，为红外图像理解提供了有效的解决方案。 Abstract: Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

[78] EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang,Jiarui Ye,Yuanlei Wang,Ming Zhong,Mingju Cao,Wanke Xia,Bowen Zeng,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: EgoLCD是一种面向自我中心的长上下文视频生成框架，通过有效的记忆管理解决内容漂移问题，实现了高质量和时间一致性的视频生成。

Details

Motivation: 现有自回归模型在生成长视频时存在内容漂移问题，导致物体身份和场景语义随时间退化，难以保持手-物交互和程序性任务的连贯性。 Method: 提出EgoLCD框架，结合长期稀疏KV缓存（用于稳定全局上下文）和基于注意力的短期记忆（通过LoRA扩展以实现局部适应），引入记忆调节损失来规范内存使用，并采用结构化叙事提示提供显式时间引导。 Result: 在EgoVid-5M基准上的实验表明，EgoLCD在感知质量和时间一致性方面均达到最先进水平，有效缓解了生成遗忘问题。 Conclusion: EgoLCD通过高效稳定的记忆管理机制，显著提升了自我中心长视频生成的能力，为具身AI中的可扩展世界模型构建提供了重要进展。 Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

[79] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu,Xiaoshan Wu,Xinting Hu,Tao Hu,Yangtian Sun,Xiaoyang Lyu,Bo Wang,Lin Ma,Yuewen Ma,Zhongrui Wang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出了VideoSSM，一种结合自回归扩散与混合状态空间记忆的长视频生成模型，通过统一短时和长时上下文记忆机制，显著提升了分钟级长视频生成的时间一致性和运动稳定性。

Details

Motivation: 自回归扩散模型在长视频生成中面临累积误差、运动漂移和内容重复等问题，难以维持长时间的连贯性，本文从记忆机制的角度出发，解决长程依赖与动态一致性之间的矛盾。 Method: 提出VideoSSM，将自回归扩散与状态空间模型（SSM）结合：SSM作为捕捉全局场景动态的长期记忆，局部上下文窗口则保留短期运动细节，形成混合记忆架构，并支持线性扩展和基于提示的交互控制。 Result: 在短程和长程基准测试中均实现了最先进的时间一致性和运动稳定性，尤其在分钟级视频生成上表现突出，有效减少重复模式并提升内容多样性。 Conclusion: VideoSSM为长视频生成提供了一个可扩展、记忆感知的框架，通过融合扩散模型与状态空间记忆机制，实现了高质量、可控且具连贯性的视频生成。 Abstract: Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

[80] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

Chenlin Xu,Lei Zhang,Lituan Wang,Xinyu Pu,Pengfei Ma,Guangwu Qian,Zizhou Wang,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为BA-TTA-SAM的无任务特定训练的测试时自适应框架，用于提升SAM在医学图像分割中的零样本分割性能。

Details

Motivation: 由于标注数据稀缺和计算成本高，传统微调方法在医学图像分割中面临挑战；现有迁移学习方法依赖下游任务的训练，而SAM在医学领域存在域偏移问题，因此需要高效的零样本增强方法。 Method: 提出了BA-TTA-SAM框架，包含两个关键机制：1）编码器级高斯提示注入，将基于高斯的提示嵌入图像编码器以指导表示学习；2）跨层边界感知注意力对齐，利用ViT主干网内的层次特征交互，对齐深层语义响应与浅层边界线索。 Result: 在ISIC、Kvasir、BUSI和REFUGE四个数据集上实验显示，相比SAM的零样本分割性能，DICE分数平均提升了12.4%，且优于当前最先进的模型。 Conclusion: BA-TTA-SAM显著增强了SAM在医学图像分割中的泛化能力，无需任何源域训练数据，具有良好的通用性和应用前景。 Abstract: Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.

[81] WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism

Ruijing Liu,Cunhua Pan,Jiaming Zeng,Hong Ren,Kezhi Wang,Lei Kong,Jiangzhou Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于WiFi信号的跨域手势识别方法，通过融合多角度多普勒频谱图像和结合空间与通道注意力机制的网络模型，实现了高精度的域内和跨域识别性能。

Details

Motivation: 现有基于WiFi的手势识别方法在训练域内表现良好，但缺乏跨域泛化能力，难以适应未见过的环境。本文旨在提升模型在不同环境下的鲁棒性和通用性。 Method: 从信道状态信息（CSI）中提取多接收端的多普勒频谱，并沿时间轴拼接生成包含多角度信息的融合图像；设计一种结合多语义空间注意力和自注意力通道机制的神经网络，利用CBAM启发结构增强关键特征表达，并采用ResNet18作为主干网络提取深层特征。 Result: 在Widar3公开数据集上测试，域内识别准确率达到99.72%，跨域识别准确率达97.61%，显著优于现有最优方法。 Conclusion: 所提方法有效提升了基于WiFi信号的手势识别在跨域场景下的性能，具有良好的泛化能力和实际应用潜力。 Abstract: While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions.

Guoqing Zhang,Zhun Wang,Hairui Wang,Zhonglin Ye,Yuhui Zheng

Main category: cs.CV

TL;DR: 提出了一种新的ICRE网络，用于可见光-红外行人重识别（VI-ReID），通过挖掘和利用模态特定属性中的隐含判别知识来提升跨模态匹配性能。

Details

Motivation: 现有方法主要关注学习模态不变特征，忽略了模态特定的身份感知知识在判别性特征学习中的重要作用。 Method: 设计了多感知特征精炼（MPFR）模块以聚合共享分支的浅层特征，捕捉易被忽略的模态特定属性；提出了语义蒸馏级联增强（SDCE）模块，从浅层特征中提取身份感知知识并指导模态不变特征的学习；引入了身份线索引导（ICG）损失以缓解增强特征中的模态差异并促进多样化表示空间的学习。 Result: 在多个公开数据集上的大量实验表明，所提出的ICRE方法显著优于现有的最先进方法。 Conclusion: ICRE通过有效融合模态特定的身份线索与模态不变特征，提升了VI-ReID的跨模态匹配精度，为解决模态差异问题提供了新思路。 Abstract: Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.

[83] Auto3R: Automated 3D Reconstruction and Scanning via Data-driven Uncertainty Quantification

Chentao Shen,Sizhe Zheng,Bingqian Wu,Yaohua Feng,Yuanchen Fei,Mingyu Mei,Hanwen Jiang,Xiangru Huang

Main category: cs.CV

TL;DR: 本文提出了Auto3R，一种数据驱动的不确定性量化模型，用于自动化3D扫描与重建，能够在未知真实几何和外观的情况下预测扫描视角的不确定性分布，并在真实机器人平台上实现高质量、可直接使用的三维数字化。

Details

Motivation: 传统高质量3D扫描依赖人工规划扫描路径，随着无人机和机器人等具身系统的发展，亟需实现全自动、高精度的3D扫描与重建方法。 Method: 提出Auto3R，一种基于数据驱动的不确定性量化模型，在迭代的3D重建过程中预测各潜在扫描视角的不确定性分布，指导最优视角选择，适用于具有非朗伯和镜面材质的物体。 Result: 实验表明Auto3R性能显著优于现有最先进方法，并成功部署于搭载相机的机械臂上，实现了对真实世界物体的高效数字化，生成可用于实际的、具有照片级真实感的数字资产。 Conclusion: Auto3R实现了高质量、全自动的3D扫描与重建，具备良好的实际应用能力，尤其适用于复杂材质物体的数字化。 Abstract: Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: https://tomatoma00.github.io/auto3r.github.io .

[84] PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan,Xin Wang,Hong Chen,Tongtong Feng,Wei Feng,Ren Wang,Guangyao Li,Qing Li,Wenwu Zhu

Main category: cs.CV

TL;DR: 本文提出PhyVLLM，一种引入物理运动建模的视频大语言模型框架，通过双分支编码器和Neural ODE模块实现外观与运动解耦及物理动态建模，在无需显式物理标注的情况下提升视频理解与物理推理能力。

Details

Motivation: 现有视频大语言模型依赖外观匹配，难以理解物理动态，限制了其在需要深层物理理解任务中的表现。 Method: 提出PhyVLLM框架：采用双分支编码器分离外观与运动，引入Neural ODE模块建模连续时间下的物理动态，并通过自监督方式学习运动演化，将运动感知表征映射到预训练大语言模型的token空间。 Result: 实验表明，PhyVLLM在物理推理和通用视频理解任务上显著优于当前最先进的视频大语言模型。 Conclusion: 显式引入物理运动建模可有效提升视频大语言模型的理解与推理能力，PhyVLLM为构建具备物理常识的视觉语言系统提供了可行路径。 Abstract: Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

[85] Refaçade: Editing Object with Given Reference Texture

Youze Huang,Penghui Ruan,Bojia Zi,Xianbiao Qi,Jianan Wang,Rong Xiao

Main category: cs.CV

TL;DR: 本文提出了一种新的任务——物体重纹理化（Object Retexture），并设计了Refaçade方法，通过纹理移除和拼图排列策略实现图像和视频中精确可控的局部纹理迁移。

Details

Motivation: 尽管扩散模型在图像和视频编辑方面取得了进展，但局部纹理迁移任务仍缺乏精细控制；现有方法因依赖原始参考图像而导致结构干扰且难以解耦纹理与结构信息。 Method: 提出Refaçade，包含两个关键设计：1）使用在3D网格渲染对上训练的纹理移除器去除外观信息、保留几何与运动信息；2）采用拼图排列打乱参考图全局布局，促使模型关注局部纹理统计而非整体结构。 Result: 实验表明，该方法在视觉质量、编辑精度和可控性方面优于强基线模型，定量和人类评估结果均更优。 Conclusion: Refaçade实现了图像和视频中精确可控的局部纹理迁移，有效解决了结构干扰与纹理-结构耦合问题，为对象重纹理化任务提供了新思路。 Abstract: Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.

[86] Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model

Bita Baroutian,Atefe Aghaei,Mohsen Ebrahimi Moghaddam

Main category: cs.CV

TL;DR: 提出一种基于视频的面部序列分析方法，用于检测酒精中毒，结合图注意力网络和3D ResNet提取时空特征，并在新构建的数据集上验证了模型优越性。

Details

Motivation: 酒精消费是全球事故和死亡的主要原因，亟需一种非侵入式、可靠的酒精中毒检测方法以提升公共安全。 Method: 结合面部关键点分析（GAT）与3D ResNet提取的时空视觉特征，通过动态融合与自适应优先级机制进行酒精中毒分类。 Result: 在包含3542个视频片段的新数据集上，模型达到95.82%准确率、0.977精确率和0.97召回率，优于3D-CNN和VGGFace+LSTM基线模型。 Conclusion: 所提方法在酒精中毒检测中表现优异，具备在公共安全系统中实际部署的潜力。 Abstract: Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.

[87] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang,Hai Ci,Yiren Song,Mike Zheng Shou

Main category: cs.CV

TL;DR: 提出X-Humanoid，一种生成式视频编辑方法，将网页规模的人类视频转化为人形机器人可学习的“机器人化”视频数据，解决了现有方法在处理全身动作和场景遮挡时的局限性，并发布了一个包含360万帧的大规模人形机器人数据集。

Details

Motivation: 现有的视觉-语言-动作模型和世界模型因缺乏大规模、多样化的训练数据而受限，虽然利用人类视频进行机器人策略训练有潜力，但当前方法难以处理第三人称视角下的全身运动和遮挡问题，无法有效实现‘机器人化’。 Method: 提出X-Humanoid，基于Wan 2.2模型构建视频到视频的转换框架，专门用于人类到人形机器人的动作迁移；设计可扩展的数据生成流水线，利用Unreal Engine生成超过17小时的配对合成视频用于微调，并应用于Ego-Exo4D数据集中的60小时视频以生成大规模机器人化数据。 Result: 成功生成并发布了超过360万帧的‘机器人化’人形机器人视频数据；定量分析和用户研究表明，69%的用户认为其在动作一致性上最优，62.1%认为其在具身正确性上表现最佳。 Conclusion: X-Humanoid有效解决了从人类视频中生成高质量人形机器人训练数据的挑战，为推动具身智能的发展提供了可扩展的数据解决方案和宝贵资源。 Abstract: The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

[88] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin,Qingyuan Wang,Wenhao Zhang,Yang Liu,Sijie Cheng

Main category: cs.CV

TL;DR: 本文提出VideoMem框架，通过自适应内存管理将超长视频理解建模为序列生成任务，结合PRPO训练算法提升长期记忆与推理能力。

Details

Motivation: 现有视觉语言模型在处理超长视频时受限于上下文长度和长期记忆能力，而外部知识库方法计算与存储开销大。 Method: 设计动态更新的全局内存缓冲区以保留关键信息，并引入PRPO算法，包含渐进状态传播（PSP）和时序级联奖励（TCR）两个模块，优化训练过程。 Result: 实验表明VideoMem在多个超长视频理解基准上显著优于现有的开源模型。 Conclusion: VideoMem通过自适应内存管理和改进的训练策略，有效提升了视觉语言模型对超长视频的理解能力，且避免了高开销问题。 Abstract: Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

[89] Gaussian Entropy Fields: Driving Adaptive Sparsity in 3D Gaussian Optimization

Hong Kuang,Jianchen Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于熵最小化的3D高斯点阵（3DGS）新视角合成方法，通过低构型熵优化实现高质量表面重建，在多个基准上实现了优越的几何精度和渲染质量。

Details

Motivation: 为了提升3D高斯点阵在新视角合成中的表面重建质量，需有效抑制冗余图元并增强几何一致性，而现有方法在几何精度与渲染保真度之间难以平衡。 Method: 提出了三项技术：1）基于熵最小化的表面建模以降低图元分布的构型熵；2）使用表面邻域冗余指数（SNRI）和图像熵引导的自适应空间正则化；3）通过跨尺度竞争性熵对齐实现多尺度几何保持。 Result: 在DTU上取得了0.64的Chamfer Distance和T&T上0.44的F1分数，在Mip-NeRF 360上获得了最佳SSIM（0.855）和LPIPS（0.136），显著优于现有方法。 Conclusion: 该框架能有效提升表面重建的几何准确性，同时保持优异的渲染质量和光度保真度，验证了熵驱动建模在3D场景表示中的有效性。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading technique for novel view synthesis, demonstrating exceptional rendering efficiency. \replaced[]{Well-reconstructed surfaces can be characterized by low configurational entropy, where dominant primitives clearly define surface geometry while redundant components are suppressed.}{The key insight is that well-reconstructed surfaces naturally exhibit low configurational entropy, where dominant primitives clearly define surface geometry while suppressing redundant components.} Three complementary technical contributions are introduced: (1) entropy-driven surface modeling via entropy minimization for low configurational entropy in primitive distributions; (2) adaptive spatial regularization using the Surface Neighborhood Redundancy Index (SNRI) and image entropy-guided weighting; (3) multi-scale geometric preservation through competitive cross-scale entropy alignment. Extensive experiments demonstrate that GEF achieves competitive geometric precision on DTU and T\&T benchmarks, while delivering superior rendering quality compared to existing methods on Mip-NeRF 360. Notably, superior Chamfer Distance (0.64) on DTU and F1 score (0.44) on T\&T are obtained, alongside the best SSIM (0.855) and LPIPS (0.136) among baselines on Mip-NeRF 360, validating the framework's ability to enhance surface reconstruction accuracy without compromising photometric fidelity.

[90] Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering

Marco Pintore,Maura Pintor,Dimosthenis Karatzas,Battista Biggio

Main category: cs.CV

TL;DR: 本文提出了一种针对文档视觉问答（DocVQA）系统的新型对抗攻击方法，通过在视觉上难以察觉但语义上有针对性的方式伪造文档内容，诱导模型产生错误答案。

Details

Motivation: 当前DocVQA模型在实际应用中易受对抗攻击，缺乏鲁棒性，本文旨在揭示其在语义层面的脆弱性。 Method: 设计专门的攻击算法，生成针对不同攻击目标（如误导信息或系统性失效）的对抗性伪造文档，并在Pix2Struct和Donut两种最先进模型上进行验证。 Result: 实验表明所提攻击方法能有效误导两种SOTA模型，使其输出特定或普遍错误的答案，暴露出模型对细微语义篡改的敏感性。 Conclusion: 现有DocVQA系统存在严重安全漏洞，需发展更强的防御机制以应对语义层面的对抗伪造威胁。 Abstract: Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers' goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.

[91] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang,Xiangzhao Hao,Hengzhu Tang,Zhenyu Zhang,Jiawei Sheng,Xiaodong Li,Zhenyang Li,Li Gao,Daiting Shi,Dawei Yin,Tingwen Liu

Main category: cs.CV

TL;DR: 提出COOPER，一个统一的多模态大语言模型，通过深度和分割等辅助模态，在两阶段训练中实现辅助模态生成与自适应交错推理，提升视觉空间理解能力。

Details

Motivation: 现有模型在3D感知的空间推理上仍存在不足，且通常孤立地增强感知或推理能力，缺乏统一框架来同时提升两者。 Method: 设计COOPER模型，利用深度和分割作为辅助模态，采用两阶段训练：第一阶段学习生成辅助模态，第二阶段实现自适应交错推理，以增强空间感知与推理能力。 Result: COOPER在空间推理任务上平均提升6.91%，保持通用性能；仅训练辅助模态生成的变体在距离和大小估计上提升7.92%。 Conclusion: 学习生成辅助模态有助于模型内化空间知识，统一框架下的感知与推理协同可显著提升多模态模型的空间智能。 Abstract: Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

[92] Dataset creation for supervised deep learning-based analysis of microscopic images -- review of important considerations and recommendations

Christof A. Bertram,Viktoria Weiss,Jonas Ammeling,F. Maria Schabel,Taryn A. Donovan,Frauke Wilm,Christian Marzahl,Katharina Breininger,Marc Aubreville

Main category: cs.CV

TL;DR: 本文综述了用于监督式深度学习的显微图像数据集构建中的关键步骤与挑战，强调图像获取、标注软件选择和标注质量（正确性、完整性、一致性）的重要性，并提出标准操作流程以促进高质量、大规模公开数据集的发展，从而推动病理学中可泛化且鲁棒的深度学习模型研究。

Details

Motivation: 构建高质量、大规模的数据集对开发可靠的深度学习模型至关重要，但实际中面临时间、领域变异性和标注偏倚等挑战，亟需系统性指导来提升数据集质量与可重复性。 Method: 本文通过综述现有文献，总结图像采集、标注工具选择和标注过程中的关键步骤，分析影响数据质量的因素（如染色和数字化引起的域偏移），并提出确保标注质量的策略，包括多标注者协同与标准化流程。 Result: 提出了保障数据集质量的三个‘C’原则（正确性、完整性、一致性），并提供了一份标准操作程序（SOP）作为补充材料，以规范数据集创建流程；同时强调开放数据集对推动研究创新和可重复性的价值。 Conclusion: 系统化地设计和构建高质量数据集是实现可泛化深度学习模型的关键，未来应加强标准化实践和数据共享，以提升病理图像分析的研究质量与临床适用性。 Abstract: Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.

[93] Prompt2Craft: Generating Functional Craft Assemblies with LLMs

Vitor Hideyo Isume,Takuya Kiyokawa,Natsuki Yamanobe,Yukiyasu Domae,Weiwei Wan,Kensuke Harada

Main category: cs.CV

TL;DR: 本文提出了“工艺组装任务”（Craft Assembly Task），即在给定目标物体图像的情况下，利用可用对象构建其近似表示的机器人装配任务，并提出了一种基于模板匹配与形状简化的搜索方法来选择合适的物体组合。

Details

Motivation: 受传统手工艺中根据现有材料即兴创作的启发，研究者希望赋予机器人类似能力，使其能在没有精确零件的情况下，利用环境中可得的物体完成目标物体的近似组装。 Method: 首先使用掩码分割神经网络从目标物体的RGB图像中识别可见部分，检索带标签的模板网格并进行姿态优化；然后将优化后的模板网格简化为立方体或圆柱体等基本几何形状；最后设计一种结合局部与全局比例特征的搜索算法，在场景中寻找最匹配的物体组合。 Result: 该方法在两个不同场景下取得了与考虑所有可能组合的基线方法相媲美的结果，并展示了真实场景中的定性实现效果。 Conclusion: 所提出的方法能够有效解决基于不完全匹配部件的创造性组装问题，为机器人在开放环境下的自主装配提供了新思路。 Abstract: Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.

Zishuo Wan,Qinqin Kang,Yi Huang,Yun Bian,Dawei Ding,Ke Yan

Main category: cs.CV

TL;DR: 提出了一种名为TARDis的物理感知框架，通过解耦时间不变和时间依赖特征来解决CT中缺失模态的肿瘤分割与诊断问题。

Details

Motivation: 由于辐射问题或扫描限制，难以获取完整的多期相CT图像，导致模态缺失；现有方法忽略血流动力学的时间连续性。 Method: 提出Time Attenuated Representation Disentanglement (TARDis)，采用双路径架构：基于量化的路径提取解剖结构，基于条件变分自编码器的路径建模增强动态，将缺失模态视为时间衰减曲线上的缺失采样点。 Result: 在大规模私有腹部CT数据集（2,282例）和两个公开数据集上验证，TARDis显著优于现有不完整模态方法，且在极端稀疏场景下仍保持稳健诊断性能。 Conclusion: TARDis能有效利用血流动力学先验知识，在减少辐射暴露的同时维持高精度肿瘤分割与诊断，具有临床应用潜力。 Abstract: Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the "missing modality" problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.

[95] Infrared UAV Target Tracking with Dynamic Feature Refinement and Global Contextual Attention Knowledge Distillation

Houzhang Fang,Chenxing Wu,Kun Bai,Tianqi Chen,Xiaolin Wang,Xiyang Liu,Yi Chang,Luxin Yan

Main category: cs.CV

TL;DR: 提出了一种基于动态特征融合的Siamese网络SiamDFF，用于红外无人机目标跟踪，结合特征增强与上下文注意力知识蒸馏，在复杂背景中实现高效准确的实时跟踪。

Details

Motivation: 红外无人机目标特征弱、背景复杂，现有方法难以实现精确跟踪，因此需要提升特征表达能力与对目标区域的关注度。 Method: 设计了SiamDFF网络，包含选择性目标增强网络（STEN）、动态空间特征聚合模块（DSFAM）和动态通道特征聚合模块（DCFAM），并引入面向跟踪的上下文注意力知识蒸馏机制，以增强特征提取能力而不增加计算负担。 Result: 在真实红外无人机数据集上实验表明，该方法在复杂背景下优于现有最先进跟踪器，并实现实时跟踪速度。 Conclusion: SiamDFF通过动态特征融合与知识蒸馏有效提升了红外无人机目标在弱特征和复杂背景下的跟踪性能，具有良好的实用价值。 Abstract: Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.

[96] SAM3-I: Segment Anything with Instructions

Jingjing Li,Yue Feng,Yuchen Guo,Jincai Huang,Yongri Piao,Qi Bi,Miao Zhang,Xiaoqi Zhao,Qiang Chen,Shihao Zou,Wei Ji,Huchuan Lu,Li Cheng

Main category: cs.CV

TL;DR: 本文提出了SAM3-I，一个将概念级理解与指令级推理统一的增强框架，通过引入指令感知的级联适应机制，使SAM3能够直接遵循自然语言指令进行分割，同时保持其原有的概念驱动能力。

Details

Motivation: 现有的SAM3仅支持简单的名词短语提示，难以处理包含属性、空间关系、功能、动作等复杂表达的现实需求，且依赖外部多模态代理将复杂指令转换为名词短语，效果粗糙且不精确。 Method: 提出SAM3-I框架，引入指令感知的级联适应机制，逐步对齐自然语言指令语义与SAM3的视觉-语言表征；设计了一个涵盖概念、简单和复杂层级的结构化指令分类体系，并开发可扩展的数据引擎构建多样化的指令-掩码对数据集。 Result: 实验表明SAM3-I在遵循自然语言指令方面表现出色，能够在不牺牲原有概念分割能力的前提下，实现更精细、准确的实例分割。 Conclusion: SAM3-I成功扩展了SAM3的能力，使其能够直接理解并执行复杂的自然语言指令，同时保留其强大的概念基础，推动了开放词汇分割向更实用、更智能的方向发展。 Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

[97] When Robots Should Say "I Don't Know": Benchmarking Abstention in Embodied Question Answering

Tao Wu,Chuhao Zhou,Guangyu Zhao,Haozhi Cao,Yewen Pu,Jianfei Yang

Main category: cs.CV

TL;DR: 本文提出了AbstainEQA，一个用于评估具身问答（EQA）代理在信息不足时拒绝回答能力的新数据集。研究发现人类提出的32.4%的问题存在上下文缺失或不明确，因此定义了五类需要拒绝回答的情况，并基于OpenEQA构建了包含1,636个拒绝样本的数据集。实验表明现有模型在 abstention recall 上远低于人类水平，揭示出当前方法的局限性。

Details

Motivation: 现有的EQA基准假设所有问题都必须回答，但现实中智能体应能识别何时无法作答。因此，研究者希望探索EQA代理在信息不足时选择不回答的能力（即abstention），这是实现可靠交互和有效澄清的前提。 Method: 通过分析500个人类查询，识别出32.4%的问题存在上下文缺失，并据此提出五类需要拒绝回答的情形：行动限制、指代不明确、偏好依赖、信息不可用和错误预设。基于这些类别，研究人员将OpenEQA中的清晰问题改写为模糊版本，构建了AbstainEQA数据集，包含1,636对原始与需拒绝的问题。使用该数据集评估多种模型的拒绝回答能力。 Result: 在AbstainEQA上的评估显示，最先进的模型仅达到42.79%的拒绝召回率，而人类为91.17%。扩大模型规模、改进提示和推理策略带来的提升有限，且微调模型容易过拟合到文本线索而非真正理解是否应拒绝回答。 Conclusion: 拒绝回答是具身智能体实现可靠交互的关键能力，当前模型在此方面表现远逊于人类，亟需新方法来提升对问题可答性的判断能力。 Abstract: Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

[98] Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

Sheng Hang,Chaoxiang He,Hongsheng Hu,Hanqing Hu,Bin Benjamin Zhu,Shi-Feng Sun,Dawu Gu,Shuo Wang

Main category: cs.CV

TL;DR: 本文提出了一种零样本图像审核管道，能够同时检测、识别并精确定位图像中的非法视觉内容，具备高精度、强鲁棒性，并可无缝集成到现有视觉语言模型流程中。

Details

Motivation: 传统的NSFW图像标记无法满足对非法内容细粒度审核的需求，需要知道哪些物体构成违法内容及其具体位置。现有方法在定位恶意元素方面能力有限，尤其在零样本和对抗攻击场景下表现不佳。 Method: 采用基础分割模型（SAM）生成候选对象掩码，并将其优化为更大的独立区域；利用视觉-语言模型结合开放词汇提示对每个区域进行恶意相关性评分；通过加权融合生成综合的恶意物体图；使用多个分割器的集成策略提升对自适应攻击的鲁棒性。 Result: 在包含790张图像的新标注数据集上，该方法达到85.8%的元素级召回率、78.1%的精确率和92.1%的分割成功率，相比直接使用零样本VLM定位方法召回率提升27.4%；在面对PGD对抗扰动时，精度和召回率下降不超过10%。 Conclusion: 该方法是首个可用于细粒度、可解释性恶意图像审核的实用工具，具有高准确性、强鲁棒性，并能在几秒内完成处理，易于集成到现有系统中。 Abstract: Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

[99] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan,Yuanbo Yang,Lin-Zhuo Chen,Yao Yao,Zhuzhong Qian

Main category: cs.CV

TL;DR: 本文提出了HeFT（Head-Frequency Tracker），一种利用预训练视频扩散模型视觉先验的零样本点跟踪框架。通过分析VDiT的注意力头和频域特征，提出了一种兼顾注意力头与低频成分的特征选择策略，显著提升了无监督点跟踪性能，在TAP-Vid基准上达到最先进水平。

Details

Motivation: 探索视频扩散模型中时空信息的编码方式，并挖掘其在零样本点跟踪任务中的潜力，减少对标注数据的依赖。 Method: 分析Video Diffusion Transformer（VDiT）内部表示，发现注意力头具有功能特化性，且低频特征对匹配更关键；据此设计头-频率感知的特征选择策略，结合单步去噪、软argmax定位与前后向一致性检查实现对应点追踪。 Result: 在TAP-Vid基准上实现了最先进的零样本点跟踪性能，精度接近有监督方法，无需任何标注训练数据。 Conclusion: 视频扩散模型可作为强大的视觉基础模型，支持如点跟踪等下游任务，为统一视觉基础模型的发展提供了新路径。 Abstract: In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

[100] I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

Juntong Wang,Jiarui Wang,Huiyu Duan,Jiaxiang Kang,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 提出了一种名为I2I-Bench的全面图像到图像编辑模型评测基准，涵盖多样化的任务类别和细粒度评估维度，并采用自动化混合评估方法验证与人类偏好的一致性。

Details

Motivation: 现有图像编辑评测基准存在任务范围有限、评估维度不足和依赖人工标注的问题，限制了其可扩展性和实用性，因此需要一个更全面、自动化的评测方案。 Method: 设计了包含10类单图和多图编辑任务的多样化任务集，构建了30个解耦的细粒度评估维度，并结合专用工具与大型多模态模型（LMMs）实现自动化混合评估，同时通过严谨的对齐验证确保评估结果与人类偏好一致。 Result: 成功对多个主流图像编辑模型进行了系统性评测，揭示了不同模型在各项编辑维度上的差距与权衡，验证了I2I-Bench的全面性与有效性。 Conclusion: I2I-Bench为图像编辑模型提供了一个全面、可扩展且自动化的评测平台，有助于推动该领域的进一步发展，所有组件将开源以支持后续研究。 Abstract: Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

[101] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang,Hailong Guo,Fangtai Wu,Shifeng Zhang,Shijie Huang,Qijun Gan,Lin Liu,Sirui Zhao,Enhong Chen,Jiaming Liu,Steven Hoi

Main category: cs.CV

TL;DR: 本文提出了Live Avatar，一种算法与系统协同设计的框架，用于实现高效、高保真且无限长度的语音驱动虚拟头像生成。

Details

Motivation: 现有的基于扩散的视频生成方法受限于序列计算和长时序不一致性，难以满足实时流式应用的需求。 Method: 引入了Timestep-forcing Pipeline Parallelism (TPP) 实现跨GPU的去噪步骤流水线；提出Rolling Sink Frame Mechanism (RSFM) 通过缓存参考图像动态校正外观以增强时序一致性；采用Self-Forcing Distribution Matching Distillation实现因果可流式的大模型适配。 Result: 在5块H800 GPU上实现了端到端20 FPS的生成速度，首次在该规模上实现了实用化的实时高保真头像生成。 Conclusion: Live Avatar建立了部署大规模扩散模型于工业级长视频合成应用的新范式。 Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

[102] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu,Yanhong Zeng,Haobo Li,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Jiapeng Zhu,Hengyuan Cao,Zhipeng Zhang,Xing Zhu,Yujun Shen,Min Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Reward Forcing的新框架，用于高效流式视频生成，包含EMA-Sink和Re-DMD两种关键技术，在保持长时序一致性的同时显著提升运动质量。

Details

Motivation: 现有基于滑动窗口注意力的视频扩散模型因依赖静态sink token导致初始帧复制和动态减弱，难以兼顾高效性与高质量运动生成。 Method: 提出EMA-Sink机制，通过指数移动平均融合退出窗口的token来持续更新固定大小的sink token；并提出Rewarded Distribution Matching Distillation（Re-DMD），利用视觉-语言模型评估动态程度，优先学习高动态样本。 Result: 在标准基准上实现最先进性能，单张H100 GPU实现23.1 FPS的高质量流式视频生成，有效避免帧复制并提升运动连贯性与质量。 Conclusion: Reward Forcing通过动态更新sink token和奖励引导的知识蒸馏，解决了长序列视频生成中的误差累积与动态退化问题，为高效流式视频生成提供了有效方案。 Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

[103] Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang,Yuheng Ji,Yuyang Liu,Enshen Zhou,Ziqiang Yang,Yuxuan Tian,Ziheng Qin,Yue Liu,Huajie Tan,Cheng Chi,Zhiyuan Ma,Daniel Dajun Zeng,Xiaolong Zheng

Main category: cs.CV

TL;DR: 本文提出了跨视角点级对应（CVPC）任务和CrossPoint-Bench基准，揭示了现有视觉语言模型在精确空间对应上的不足，并构建了大规模数据集CrossPoint-378K和新模型CroPond，显著提升了性能。

Details

Motivation: 现有视觉语言模型在实现精细的跨视角点级对应能力上仍存在显著不足，难以支持需要精确交互的具身AI应用。 Method: 提出CVPC任务和分层设计的CrossPoint-Bench基准；构建包含378K问答对的CrossPoint-378K数据集；训练新模型CroPond以提升点级对应性能。 Result: 现有SOTA模型（如Gemini-2.5-Pro）比人类低超54.65%准确率；CroPond在CrossPoint-Bench上超越Gemini-2.5-Pro达39.7%。 Conclusion: 跨视角点级对应是VLM中亟待解决的关键问题，所提出的任务、基准、数据集和模型为未来研究提供了重要基础。 Abstract: Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

[104] OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution

Xinning Chai,Zhengxue Cheng,Yuhong Zhang,Hengsheng Zhang,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 提出OmniScaleSR，一种基于扩散模型的任意尺度超分辨率框架，通过显式尺度控制机制和多域保真增强设计，在高倍放大下实现高保真与高真实感的图像重建。

Details

Motivation: 现有任意尺度超分方法在生成细节真实性和显式尺度控制方面存在不足：基于隐式神经表示的方法难以合成精细纹理，而基于扩散模型的方法缺乏对不同放大倍率的精确调控，导致在高倍放大时出现过度幻觉或模糊。 Method: 提出OmniScaleSR，结合扩散先验中的隐式尺度适应能力，引入扩散原生的显式尺度控制机制，实现对扩散过程的尺度感知与内容感知调制；同时设计多域保真增强模块以提升重建精度。 Result: 在双三次降质基准和真实数据集上，OmniScaleSR在保真度和感知质量上均优于现有最先进方法，尤其在大倍率放大下表现突出。 Conclusion: OmniScaleSR通过显式与隐式尺度控制的协同机制，有效解决了任意尺度超分中高现实感与高保真的平衡问题，显著提升了扩散模型在极端放大场景下的性能。 Abstract: Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at https://github.com/chaixinning/OmniScaleSR.

[105] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Yigui Feng,Qinglin Wang,Haotian Mo,Yang Liu,Ke Liu,Gencheng Liu,Xinhai Chen,Siqi Shen,Songzhu Mei,Jie Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为MIND的分层视觉编码器，结合ConvoInsight-DB数据集和PRISM评估框架，解决了现有视觉语言模型在野生对话中因言语与情感表达混淆而导致的心理分析难题，在微表情识别上显著超越现有方法。

Details

Motivation: 现有视觉语言模型难以区分说话动作与真实情感表达（即构音-情感模糊），且缺乏可验证的评估指标来衡量心理推理的深度和视觉定位能力，限制了生成式心理分析的发展。 Method: 提出了MIND模型，包含一个基于时序特征方差抑制唇部干扰特征的“状态判断模块”，实现视觉特征解耦；构建了大规模标注数据集ConvoInsight-DB，涵盖微表情与深层心理推断标签；设计了基于专家引导大语言模型的自动化评估指标PRISM，用于多维评测心理视觉模型性能。 Result: 在PRISM基准测试中，MIND在微表情检测上相比先前最优方法提升了+86.95%；消融实验表明其状态判断模块是性能提升最关键的组件。 Conclusion: MIND通过显式解耦语音与情感视觉特征，配合高质量数据集与新型评估体系，有效推动了真实场景下对话心理分析的发展，为未来心理健康辅助技术提供了可行路径。 Abstract: Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

[106] E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Yihong Tang,Haicheng Liao,Tong Nie,Junlin He,Ao Qu,Kehua Chen,Wei Ma,Zhenning Li,Lijun Sun,Chengzhong Xu

Main category: cs.CV

TL;DR: 本文提出了E3AD，一种情感感知的视觉-语言-动作（VLA）自动驾驶框架，通过引入连续的VAD情感模型和双通路空间推理模块，实现对乘客情绪的理解与驾驶行为的协调，提升了人机对齐性。

Details

Motivation: 现有端到端自动驾驶系统多忽略乘客情绪状态，影响乘坐舒适性与系统可接受度；而自然语言指令中蕴含的情感信息对于理解用户意图至关重要。因此，需构建能感知并响应乘客情绪的自动驾驶系统。 Method: 提出E3AD框架，包含：1）基于Valence-Arousal-Dominance（VAD）模型从自然语言中提取情感状态；2）双通路空间推理模块融合自我中心与他人中心视角进行路径规划；3）结合模态预训练与偏好对齐的一致性训练策略，确保情感意图与驾驶动作一致。 Result: 在真实世界数据集上，E3AD在视觉定位、航点预测和情感估计（SOTA VAD相关性）方面表现优越，显著提升情感-行为一致性。 Conclusion: 将情感感知融入VLA型自动驾驶框架可增强系统的人性化与可接受性，E3AD实现了更符合人类认知的驾驶决策与交互反馈。 Abstract: End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

[107] MT-Depth: Multi-task Instance feature analysis for the Depth Completion

Abdul Haseeb Nizamani,Dandi Zhou,Xinhai Sun

Main category: cs.CV

TL;DR: 提出一种实例感知的深度补全框架，利用二值实例掩码作为空间先验来优化深度预测，在Virtual KITTI 2数据集上表现优于基线方法和语义引导方法。

Details

Motivation: 现有深度补全方法多依赖语义分割，忽视了对象级理解的优势，且通常需要密集语义标签。 Method: 结合冻结的YOLO V11实例分割分支、U-Net深度补全主干、跨注意力融合模块和注意力引导预测头，通过实例掩码以交叉注意力方式指导深度补全。 Result: 在Virtual KITTI 2数据集上，相比仅使用U-Net的基线和语义引导方法，取得了更低的RMSE和具有竞争力的MAE，尤其在物体边界、遮挡和细结构区域提升明显。 Conclusion: 引入实例感知线索可有效提升深度补全性能，且无需依赖密集语义标注，为该领域提供了新方向。 Abstract: Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.

[108] Order Matters: 3D Shape Generation from Sequential VR Sketches

Yizi Chen,Sidi Wu,Tianyi Xiao,Nina Wiedemann,Loic Landrieu

Main category: cs.CV

TL;DR: 本文提出了VRSketch2Shape，首个从顺序VR草图生成3D形状的框架和多类别数据集，通过引入时序感知的草图编码器和基于扩散的3D生成器，提升了几何保真度，并能有效泛化到真实和部分草图。

Details

Motivation: 现有草图到形状模型忽略笔画的时间顺序，丢失了结构和设计意图的关键线索，限制了从VR草图生成高质量3D形状的能力。 Method: 提出自动化流程生成顺序VR草图，构建包含2万多个合成和900个手工草图-形状对的数据集，并设计时序感知草图编码器与基于扩散的3D生成器相结合的方法。 Result: 该方法在几何保真度上优于先前工作，能以最少监督从合成草图有效泛化到真实草图，并在部分草图上表现良好。 Conclusion: VRSketch2Shape为VR草图到3D形状生成提供了新基准，证明了利用时序信息的重要性，并推动了无需CAD专业知识的设计工具发展。 Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.

[109] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping,Chengyou Jia,Minnan Luo,Changliang Xia,Xin Shen,Zhuohang Dang,Hangwei Qian

Main category: cs.CV

TL;DR: 本文提出PaCo-RL框架，结合一致性奖励模型PaCo-Reward与高效强化学习算法PaCo-GRPO，实现无需大规模标注数据的图像生成一致性优化，显著提升视觉一致性与训练效率。

Details

Motivation: 现有监督学习方法因缺乏大规模一致性数据集且难以建模人类感知偏好，在一致图像生成上表现受限，需一种能学习主观视觉标准的无数据方法。 Method: 提出PaCo-RL框架：1）PaCo-Reward为基于自动子图配对构建的大规模数据训练的成对一致性评估模型，采用生成式自回归评分机制并结合任务指令与思维链推理；2）PaCo-GRPO采用分辨率解耦优化策略降低强化学习成本，并通过log-tamed多奖励聚合机制实现稳定优化。 Result: 实验表明PaCo-Reward显著提升与人类感知的一致性对齐效果，PaCo-GRPO在两个代表性子任务上达到SOTA性能，同时训练更高效稳定。 Conclusion: PaCo-RL为一致图像生成提供了一种实用且可扩展的解决方案，验证了强化学习在该领域的潜力。 Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

[110] LaFiTe: A Generative Latent Field for 3D Native Texturing

Chia-Hao Chen,Zi-Xin Zou,Yan-Pei Cao,Ze Yuan,Guan Luo,Xiaojuan Qi,Ding Liang,Song-Hai Zhang,Yuan-Chen Guo

Main category: cs.CV

TL;DR: 本文提出LaFiTe框架，通过学习生成3D生成稀疏潜在颜色场来实现高保真、无缝的3D原生纹理生成，解决了传统方法因缺乏强大且通用的潜在表示而导致的局限性。

Details

Motivation: 现有的3D纹理生成方法受限于缺乏一个强大而通用的潜在表示，导致生成纹理的保真度和泛化能力不足，难以克服UV映射和多视图投影方法的固有缺陷。 Method: LaFiTe采用变分自编码器（VAE）将复杂表面外观编码为稀疏结构化的潜在空间，并解码为连续的颜色场；在此基础上使用条件修正流模型合成高质量、风格多样且几何适应性强的纹理。 Result: 该方法在重建PSNR上超过现有最先进方法10 dB以上，实现了前所未有的保真度，并支持材质合成和纹理超分辨率等下游应用。 Conclusion: LaFiTe为3D原生纹理生成建立了新基准，有效弥合了表示差距，推动了下一代3D内容创作流程的发展。 Abstract: Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.

[111] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He,Longhui Wei,Jianbo Ouyang,Lingxi Xie,Qi Tian

Main category: cs.CV

TL;DR: EMMA是一种高效且统一的多模态架构，通过高效自编码器、通道级拼接、共享-解耦网络和专家混合机制，在理解、生成和编辑任务上实现了性能与效率的平衡。

Details

Motivation: 现有统一多模态架构在理解与生成任务间存在训练不平衡、视觉token过多导致效率低的问题，需要一种更高效的统一模型设计。 Method: 1) 使用32倍压缩比的高效自编码器减少生成所需的token数；2) 采用通道级拼接而非token级拼接以进一步降低视觉token数量；3) 设计共享-解耦网络实现任务间协同提升并满足特定需求；4) 在视觉理解编码器中引入专家混合机制增强感知能力。 Result: EMMA-4B显著优于当前最先进的统一多模态方法（如BAGEL-7B），并在效率和性能上媲美专用多模态模型（如Qwen3-VL和Qwen-Image）。 Conclusion: EMMA为未来统一多模态架构的发展提供了坚实的基础。 Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

[112] RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS

Chuanyu Fu,Guanying Chen,Yuqi Zhang,Kunbin Yao,Yuan Xiong,Chuan Huang,Shuguang Cui,Yasuyuki Matsushita,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了RobustSplat++，一种针对3D高斯点阵在复杂真实场景中鲁棒建模的方法，通过延迟高斯增长、级联掩码自举和外观建模提升对瞬态物体和光照变化的鲁棒性。

Details

Motivation: 现有3D高斯点阵方法在处理包含瞬态物体和光照变化的真实场景时易产生渲染伪影，主要因高斯密度化过程会过拟合这些动态干扰。 Method: 提出三项关键技术：1）延迟高斯增长策略，优先优化静态结构；2）尺度级联的掩码自举机制，先用低分辨率特征相似性生成可靠初始掩码，再逐步过渡到高分辨率监督；3）结合上述策略与外观建模共同处理复杂场景。 Result: 在多个具有挑战性的数据集上实验表明，该方法显著优于现有方法，有效减少伪影并提升渲染质量。 Conclusion: RobustSplat++通过关键设计增强了3DGS在真实场景下的鲁棒性，为处理瞬态干扰和光照变化提供了有效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.

[113] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

Huynh Trinh Ngoc,Hoang Anh Nguyen Kim,Toan Nguyen Hai,Long Tran Quoc

Main category: cs.CV

TL;DR: 提出LatentFM，一种在潜在空间中运行的基于流的医学图像分割模型，通过变分自编码器和条件速度场估计实现高精度且能估计不确定性的分割。

Details

Motivation: 受流程匹配（FM）在生成模型中取得进展的启发，旨在开发一种能够在潜在空间中进行医学图像分割的流程模型，以提高分割准确性和效率，并能够估计预测不确定性。 Method: 设计两个变分自编码器（VAEs）将医学图像及其对应掩码编码到低维潜在空间；然后估计基于输入图像的条件速度场来引导流动；通过采样多个潜在表示合成多样化的分割输出，并生成量化模型确定性的置信图。 Result: 在ISIC-2018和CVC-Clinic两个数据集上实验表明，该方法相比现有基线方法在分割准确性方面表现更优，同时能在潜在空间中保持高效率，并提供可靠的不确定性估计和置信图。 Conclusion: LatentFM是一种有效的医学图像分割方法，能够在保证高精度的同时提供不确定性感知和临床有用的置信信息，具有良好的应用前景。 Abstract: Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.

[114] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Shijie Chen,Peixi Peng

Main category: cs.CV

TL;DR: 提出FreeGen，一种前馈重建-生成联合训练框架，用于自由视角驾驶场景合成，实现插值一致性和外推真实性。

Details

Motivation: 现有数据集和生成管线难以提供一致的非轨迹观测，限制了大规模闭环仿真和自动驾驶预训练的应用需求。 Method: 通过重建模型提供稳定的几何表示以确保插值一致性，生成模型进行几何感知增强以提升未见视角下的真实感，并通过联合训练相互提升性能。 Result: FreeGen在自由视角驾驶场景合成任务上实现了最先进的性能。 Conclusion: FreeGen有效结合了重建与生成模型的优势，显著提升了自由视角驾驶场景合成的质量与泛化能力。 Abstract: Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

[115] Tokenizing Buildings: A Transformer for Layout Synthesis

Manuel Ladron de Guevara,Jinmo Rhee,Ardavan Bidgoli,Vaidas Razgaitis,Michael Bergin

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的建筑信息建模（BIM）布局合成架构Small Building Model (SBM)，通过统一异构特征集并设计联合嵌入模块，实现高保真房间表示和自回归房间实体预测。

Details

Motivation: 解决如何将建筑元素的异构特征集统一为序列表示，同时保留其组成结构，以支持BIM场景中的布局合成。 Method: 将建筑元素的特征表示为稀疏属性-特征矩阵，并设计统一嵌入模块学习类别与连续特征的联合表示；采用单一Transformer主干网络，分别在编码器-only模式下生成房间嵌入，在编码器-解码器模式下进行数据驱动的实体预测（DDEP）。 Result: 实验表明SBM能学习到紧凑且按类型和拓扑聚类良好的房间嵌入，支持高效的语义检索；在DDEP模式下生成的功能合理布局碰撞更少、边界违规更少且可通行性更好。 Conclusion: SBM有效整合了建筑元素的多模态特征，实现了高质量的布局生成与语义检索，为BIM中的自动化设计提供了新方法。 Abstract: We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.

[116] A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

Jikang Cheng,Renye Yan,Zhiyuan Yan,Yaozhong Gan,Xueyi Zhang,Zhongyuan Wang,Wei Peng,Ling Liang

Main category: cs.CV

TL;DR: 提出了一种新的多域人脸伪造检测范式MID-FFD，并设计了DevDet框架以增强真实/伪造差异，提升在未指定域下的检测性能。

Details

Motivation: 现有深度伪造检测方法难以在未见过的多域场景中实现良好的泛化，且多域差异主导特征空间，导致单帧独立判断困难。 Method: 提出了MID-FFD新范式和DevDet框架，包含Face Forgery Developer（FFDev）和Dose-Adaptive微调策略（DAFT），以放大真实与伪造之间的差异。 Result: 实验表明，该方法在MID-FFD场景下显著提升了真实/伪造判断准确率（ACC），同时保持了对未见数据的良好泛化能力。 Conclusion: 通过引入大规模多域训练并解决域主导问题，DevDet为实际应用中的帧级独立深伪检测提供了有效解决方案。 Abstract: Existing methods for deepfake detection aim to develop generalizable detectors. Although "generalizable" is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.

[117] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Chanfan Gan,Tieyuan Chen,Weiyao Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为LineAR的训练-free渐进式KV缓存压缩方法，用于解决自回归图像生成中的内存瓶颈问题，通过在行级别管理缓存，在显著降低内存占用的同时提升生成速度，并保持甚至提升生成质量。

Details

Motivation: 现有的自回归图像生成因需缓存所有已生成的视觉标记而面临严重的内存瓶颈，导致存储需求高和吞吐量低，限制了其可扩展性。 Method: LineAR利用视觉注意力的内在特性，以2D视角在行级别管理KV缓存，基于行间注意力逐步剔除对后续生成影响较小的不重要标记，仅保留少数行的缓存进行高效生成。 Result: 在六个自回归图像生成模型上实验表明，LineAR在仅保留1/6或1/8 KV缓存的情况下，提升了ImageNet和COCO的FID指标，并实现了最高67.61%的内存减少和7.57倍的速度提升。 Conclusion: LineAR有效缓解了自回归图像生成的内存瓶颈，具备良好的通用性和实用性，能够在几乎不影响甚至提升生成质量的前提下大幅提高效率。 Abstract: Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

Maria-Paola Forte,Nikos Athanasiou,Giulia Ballardini,Jan Ulrich Bartels,Katherine J. Kuchenbecker,Michael J. Black

Main category: cs.CV

TL;DR: 提出了一种结合视觉姿态估计与生物阻抗传感的3D人体姿态捕捉框架BioTUCH，通过引入自接触信息提升姿态重建精度，并发布了配套数据集与微型传感器。

Details

Motivation: 现有基于视频的姿态估计方法在自接触场景（如手触脸）中常失败，而生物阻抗传感可低成本、无感地测量皮肤接触，因此希望利用该信号改进野外3D姿态捕捉。 Method: 提出BioTUCH框架：先用现成姿态估计器初始化姿态，在检测到自接触时进行接触感知的姿态优化，联合最小化重投影误差和对初始估计的偏差，同时施加顶点接近约束。 Result: 在包含同步RGB视频、生物阻抗和3D动作捕捉的新数据集上验证，使用三种输入估计器平均提升11.7%重建精度。 Conclusion: 结合生物阻抗感知的自接触信号可有效提升复杂场景下的3D姿态估计精度，且所提微型传感器有助于大规模采集用于训练的接触感知数据。 Abstract: Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at biotuch.is.tue.mpg.de

[119] SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

Qing Xu,Yanqian Wang,Xiangjian Hea,Yue Li,Yixuan Zhang,Rong Qu,Wenting Duan,Zhen Chen

Main category: cs.CV

TL;DR: 提出SP-Det，一种无需专家标注的自提示检测框架，用于胸部X光多标签病变检测，通过双文本提示生成器和双向特征增强器提升检测性能。

Details

Motivation: 现有基于提示的病变检测方法依赖人工标注提示，耗时且不适用于临床实际；需要一种自动产生文本提示并实现高精度多标签病变检测的方法。 Method: 设计了一个无专家干预的双文本提示生成器（DTPG），结合全局病理模式的语义上下文提示和针对特定疾病的疾病信标提示，并引入双向特征增强器（BFE）融合诊断上下文与疾病嵌入以优化特征表示。 Result: 在两个胸部X光数据集上实验表明，SP-Det优于现有最先进检测方法，且完全摆脱了对专家标注提示的依赖。 Conclusion: SP-Det有效实现了无需人工标注提示的自动化多标签病变检测，提升了临床适用性和检测准确性。 Abstract: Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.

[120] SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

Jiawen Wen,Yu Hu,Suixuan Qiu,Jinshan Huang,Xiaowen Chu

Main category: cs.CV

TL;DR: 提出SDG-Track，一种用于边缘设备上小目标无人机实时跟踪的稀疏检测引导跟踪方法，通过Observer-Follower架构和Dual-Space Recovery机制，在保证高精度的同时实现高速度。

Details

Motivation: 在边缘设备上实时跟踪小型无人机面临分辨率与速度之间的冲突：高分辨率图像下采样会导致小目标特征丢失，而处理全分辨率图像又难以满足实时性需求。 Method: 采用Observer-Follower架构：Observer流在GPU上以低频运行高容量检测器，提供1920x1080帧中的精确位置锚点；Follower流在CPU上以高频执行ROI约束的稀疏光流进行轨迹插值。引入无需训练的Dual-Space Recovery机制，结合颜色直方图匹配与几何一致性约束来应对遮挡或干扰物导致的跟踪失败。 Result: 在NVIDIA Jetson Orin Nano上实现了35.1 FPS的系统吞吐量，同时保持了逐帧检测精度的97.2%，并在真实场景中成功跟踪敏捷的FPV无人机。 Conclusion: SDG-Track有效解决了小目标无人机在边缘设备上实时跟踪中的分辨率-速度矛盾，兼顾精度与效率，适用于实际地面-空中跟踪任务。 Abstract: Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

[121] You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Priyanto Hidayatullah,Nurjannah Syakrani,Yudi Widhiyasana,Muhammad Rizqi Sholahuddin,Refdinal Tubagus,Zahri Al Adzani Hidayat,Hanri Fajar Ramadhan,Dafa Alfarizki Pratama,Farhan Muhammad Yasin

Main category: cs.CV

TL;DR: 本文提出了一种名为YOTO的新方法，用于解决目标检测中的灾难性遗忘问题，通过结合YOLOv11n、DeIT和Proxy Anchor Loss，在无需重新训练的情况下高效添加新类别，显著提升了训练效率并保持良好检测精度。

Details

Motivation: 目标检测在新增类别时需重新训练整个模型，导致训练成本高、耗时长，尤其在零售等频繁上新场景中问题突出，因此需要一种避免灾难性遗忘且高效的增量学习方法。 Method: 提出YOTO框架：使用YOLOv11n进行目标定位，DeIT与Proxy Anchor Loss进行特征提取和度量学习，并利用Qdrant向量数据库存储类别特征；分类时通过计算目标嵌入与数据库中特征的余弦相似度实现。 Result: 在包含140种商品的零售场景实验中，YOTO在不重训的情况下对新旧类别的检测均表现出良好准确率，训练效率较传统方法提升近3倍，且随新品增加优势更明显；边缘设备上的平均推理时间为580ms/图，验证了实用性。 Conclusion: YOTO有效缓解了灾难性遗忘问题，实现了高效、可扩展的增量目标检测，适用于实际零售等动态场景，具备良好的应用前景。 Abstract: Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.

[122] Equivariant Symmetry-Aware Head Pose Estimation for Fetal MRI

Ramya Muthukrishnan,Borjan Gagoski,Aryn Lee,P. Ellen Grant,Elfar Adalsteinsson,Polina Golland,Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose是一种新的快速姿态估计方法，通过显式建模旋转等变性和对象对称性来提升胎儿头部MRI扫描中的6自由度姿态估计鲁棒性和准确性。

Details

Motivation: 解决胎儿MRI诊断中因解剖对称性、低分辨率、噪声和伪影导致的姿态估计模糊问题，实现自动适应性2D切片定位。 Method: 提出E(3)-Pose，联合建模旋转等变性和对象对称性，利用3D MRI体积数据进行6-DoF姿态估计。 Result: 在公开临床胎儿MRI数据集上表现出优越的跨域鲁棒性和泛化能力，并在临床MRI体积数据上达到最先进的精度。 Conclusion: E(3)-Pose能有效应对临床MRI中的挑战，具备良好的临床转化前景。 Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.

[123] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为ReflexFlow的新方法，用于缓解流匹配模型中的暴露偏差问题，通过反漂移校正和频率补偿两个组件提升生成质量。

Details

Motivation: 流匹配方法在训练和推理之间存在暴露偏差，导致生成性能下降，本文旨在探究其根本原因并提出解决方案。 Method: 提出了ReflexFlow，包含反漂移校正（ADR）和频率补偿（FC），前者通过重新设计训练时的损失函数调整预测目标，后者通过重加权损失来补偿低频成分。 Result: 在CIFAR-10、CelebA-64和ImageNet-256数据集上实验表明，ReflexFlow显著降低了暴露偏差，在CelebA-64上FID指标降低了35.65%。 Conclusion: ReflexFlow是一种通用且有效的方法，能够兼容现有的流匹配框架，并显著提升图像生成质量。 Abstract: Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.

[124] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Yueming Pan,Ruoyu Feng,Qi Dai,Yuqi Wang,Wenfeng Lin,Mingyu Guo,Chong Luo,Nanning Zheng

Main category: cs.CV

TL;DR: 本文提出了Semantic-First Diffusion（SFD），一种显式优先生成语义的潜在扩散模型，通过异步去噪机制实现语义先于纹理生成，显著提升图像生成质量与收敛速度。

Details

Motivation: Latent Diffusion Models（LDMs）虽然天然具有由粗到细的生成过程，但现有方法仍同步去噪语义与纹理，忽略了语义应先于纹理生成的顺序，限制了生成效果。 Method: 提出SFD框架，首先通过专用的Semantic VAE提取紧凑语义潜变量，并与纹理潜变量组合成复合潜表示；采用不同的噪声调度策略对语义和纹理潜变量进行异步去噪，语义去噪领先一个时间偏移，从而为纹理生成提供更清晰的高层语义引导。 Result: 在ImageNet 256x256上，SFD结合LightningDiT-XL达到FID 1.06，结合1.0B参数的LightningDiT-XXL达到FID 1.04，并实现比原始DiT快达100倍的收敛速度；同时SFD可提升ReDi、VA-VAE等现有方法的性能。 Conclusion: SFD通过显式的语义优先异步去噪机制，有效利用了LDM中语义与纹理生成的时间差，实现了更自然的由粗到细生成过程，在图像生成质量与训练效率方面均取得显著提升，验证了语义引导时序建模的重要性。 Abstract: Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.

[125] Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting

Paul Henderson

Main category: cs.CV

TL;DR: 提出了一种全新的全自动虚拟展开方法，通过全局拟合显式参数化模型到神经网络预测的卷轴位置，成功实现了对严重损坏的赫库兰尼姆卷轴CT扫描数据的连续二维表面重建。

Details

Motivation: 赫库兰尼姆卷轴因维苏威火山喷发而碳化且极其脆弱，无法物理展开，急需一种高效、自动化的虚拟展开技术以无损提取其中的文字内容。 Method: 提出一种自上而下的方法，结合神经网络对纸草位置的预测，全局拟合一个显式的参数化卷轴模型，确保生成的表面为连续的二维平面，即使在CT图像中不可见区域也能合理推断。 Result: 在两个高分辨率卷轴CT扫描数据上进行了全面实验，结果表明该方法能成功展开大范围区域，性能优于目前唯一适用的自动化对比方法。 Conclusion: 该方法是首个能自动拟合严重损坏卷轴CT扫描的显式表面模型的技术，为赫库兰尼姆卷轴等文物的大规模数字化解读提供了高效可靠的解决方案。 Abstract: The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.

[126] LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu,Cheng Lin,Tao Xie,Wei Yin,Ben Li,Zhiyuan Pu,Weize Li,Yao Yao,Xun Cao,Xiaoyang Guo,Xiao-Xiao Long

Main category: cs.CV

TL;DR: LiteVGGT提出了一种几何感知的缓存token合并策略，显著提升了3D视觉基础模型的效率，实现高达10倍加速和大幅内存减少，支持千图级场景的高效处理。

Details

Motivation: 现有3D视觉模型如VGGT在处理长序列时面临计算和内存开销大的问题，难以应用于大规模场景。 Method: 基于局部图像区域token具有高相似性和跨层合并决策可复用的观察，设计了几何感知的缓存token合并方法，优化锚点token选择并跨层复用合并索引。 Result: 实现了高达10倍的速度提升和显著内存降低，保持了VGGT的核心性能，并支持高效微调和FP8量化。 Conclusion: LiteVGGT通过高效的token合并策略，有效解决了3D基础模型在大规模场景下的效率瓶颈，具备良好的可扩展性与鲁棒性。 Abstract: 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

[127] Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

Novanto Yudistira

Main category: cs.CV

TL;DR: 提出一种基于深度神经网络与自适应多模态融合（RGB、光流、音频、深度）的人类动作识别新方法，通过门控机制实现信息选择性融合，显著提升准确性和鲁棒性。

Details

Motivation: 传统单模态动作识别方法存在局限性，难以全面表征复杂动作，需通过多模态融合提升性能。 Method: 采用深度神经网络结合门控机制的多模态融合策略，对RGB、光流、音频和深度信息进行自适应加权融合，选择性整合关键特征。 Result: 在动作识别、暴力检测和自监督学习任务的基准数据集上均取得精度提升，优于传统单模态方法。 Conclusion: 该方法通过有效的多模态融合显著增强了动作识别的准确性与鲁棒性，具有在监控、人机交互及辅助生活等场景中广泛应用的潜力。 Abstract: This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.

[128] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

Yicheng Liu,Shiduo Zhang,Zibin Dong,Baijun Ye,Tianyuan Yuan,Xiaopeng Yu,Linqi Yin,Chenhao Lu,Junhao Shi,Luca Jiang-Tao Yu,Liangtao Zheng,Tao Jiang,Jingjing Gong,Xipeng Qiu,Hang Zhao

Main category: cs.CV

TL;DR: 本文提出FASTer框架，通过可学习的分词器FASTerVQ和基于其构建的FASTerVLA策略，在保持高效推理的同时提升了机器人操作中视觉-语言-动作模型的性能和泛化能力。

Details

Motivation: 现有自回归视觉-语言-动作（VLA）模型在动作分词过程中面临重建保真度与推理效率之间的权衡问题，限制了其实际应用。 Method: 提出FASTer框架，其中FASTerVQ将动作块编码为单通道图像以捕捉全局时空依赖并实现高压缩比；FASTerVLA在此基础上采用分块自回归解码和轻量级动作专家进行策略建模。 Result: 实验表明，FASTerVQ在重建质量、令牌利用率及跨任务、跨形态泛化方面表现优异；FASTerVLA进一步提升了推理速度和任务性能，超越了现有的最先进VLA模型。 Conclusion: FASTer框架有效解决了动作分词中的效率与性能权衡问题，为通用且高效的机器人学习提供了新思路。 Abstract: Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

[129] GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

Yupu Yao,Bowen Yang

Main category: cs.CV

TL;DR: 提出了一种基于四元数的三维几何位置编码GeoPE，通过在李代数中计算几何平均来构建统一旋转算子，有效恢复了2D图像的空间流形结构，优于现有2D RoPE方法。

Details

Motivation: 标准Vision Transformer和RoPE将2D图像展平为1D序列，破坏了空间拓扑结构，导致空间上远离的图像块被错误地视为序列邻居；现有2D位置编码方法未能解耦虚假的序列邻近性与真实的空间距离。 Method: 引入基于四元数的三维几何位置编码（GeoPE），将旋转扩展到3D欧氏空间，并在李代数中通过几何平均构造对称且统一的旋转算子，实现空间维度间的几何耦合编码。 Result: 在图像分类、目标检测和3D语义分割任务中，GeoPE consistently 优于现有的2D RoPE变体，并显著增强模型的形状偏差，验证了其对真实几何结构的建模能力。 Conclusion: GeoPE通过几何上更合理的位置编码方式，成功恢复了图像的2D空间流形结构，解决了传统方法中空间邻近性误判的问题，提升了Transformer在多种视觉任务中的性能。 Abstract: Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

[130] Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis

Jasmaine Khale,Ravi Prakash Srivastava

Main category: cs.CV

TL;DR: 提出一种平衡的少样本学习框架，用于视网膜多疾病图像数据集（RFMiD）的疾病诊断，通过平衡 episodic 采样、针对性增强和预训练 ResNet-50 编码器，在数据受限条件下提升少数类疾病的诊断性能。

Details

Motivation: 传统深度学习需要大量标注数据，且在疾病类别不平衡时表现不佳，限制了其在临床中的可靠性；而视网膜疾病如糖尿病视网膜病变等日益普遍，亟需能在小样本下稳定泛化的诊断模型。 Method: 提出一种平衡的少样本 episodic 学习框架，包含：(i) 平衡的 episodic 采样策略，确保每个5-way 5-shot任务中各类别均衡参与；(ii) 针对性数据增强，包括CLAHE和颜色/几何变换以提升少数类多样性；(iii) 使用ImageNet预训练的ResNet-50作为编码器，并在嵌入空间中采用余弦相似度进行原型分类。 Result: 在100个训练episode和1000个测试episode上评估，该方法显著提高了整体准确率，减少了对多数类的偏差，尤其改善了罕见病（如视盘水肿、分支静脉阻塞）的诊断效果。 Conclusion: 结合数据集感知的少样本流程、平衡采样与CLAHE增强预处理，可在数据受限情况下实现更鲁棒、临床更公平的视网膜疾病诊断。 Abstract: Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.

[131] Rethinking the Use of Vision Transformers for AI-Generated Image Detection

NaHyeon Park,Kunhee Kim,Junsuk Choe,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一种名为MoLD的新方法，通过动态融合CLIP-ViT中多层特征来提升AI生成图像的检测性能，实验表明其在多种生成模型和实际场景中具有更好的泛化性和鲁棒性。

Details

Motivation: 现有AI生成图像检测方法主要使用CLIP-ViT最后一层特征，忽略了其他层可能提供的更局部、更具泛化性的信息，因此需要系统分析各层特征的贡献并加以利用。 Method: 提出MoLD方法，采用基于门控机制的自适应策略，动态整合ViT模型中多个层次的特征，以增强检测能力。 Result: 实验显示MoLD在GAN和扩散模型生成图像上均显著提升检测性能，具备良好的跨模型泛化能力和实际场景鲁棒性，并可扩展至DINOv2等其他预训练ViT模型。 Conclusion: 多层特征融合优于单一最终层特征，MoLD通过自适应集成策略有效提升了AI生成图像检测的整体性能和适用范围。 Abstract: Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

[132] Stable Single-Pixel Contrastive Learning for Semantic and Geometric Tasks

Leonid Pogorelyuk,Niels Bracher,Aaron Verkleeren,Lars Kühmichel,Stefan T. Radev

Main category: cs.CV

TL;DR: 提出了一种稳定的对比损失方法，用于学习像素级表示，能够同时捕捉语义和几何信息，无需动量教师-学生训练即可实现跨图像的精确点对应。

Details

Motivation: 为了克服现有方法在学习像素级表示时对动量机制的依赖，并更好地联合建模语义与几何信息。 Method: 设计了一族稳定的对比损失函数，将每个像素映射到一个完备且视图不变、语义明确的描述符上。 Result: 在合成2D和3D环境中验证了该方法的有效性，展示了其在点对应匹配上的优异性能。 Conclusion: 所提方法能有效学习兼具语义意义和几何一致性的像素级表示，且不依赖复杂的教师-学生架构。 Abstract: We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.

NaHyeon Park,Namin An,Kunhee Kim,Soyeon Yoon,Jiahao Huo,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文研究了基于大视觉语言模型（LVLM）的文本到图像生成系统中的社会偏见问题，发现其比非LVLM模型产生更明显的偏见，并指出系统提示是主要驱动因素。作者提出了一种无需训练的元提示框架FairPro，可在测试时实现自我审计并构建公平感知的系统提示，有效减少偏见同时保持图文对齐。

Details

Motivation: 探究LVLM-based文本到图像模型是否加剧社会偏见，理解偏见来源并提升生成系统的社会公平性。 Method: 构建包含1024个提示的基准，覆盖四种语言复杂度层级，系统评估多个属性上的群体偏见；通过解码中间表示、标记概率诊断和嵌入关联分析，揭示系统提示如何编码并传播偏见；提出FairPro框架以动态生成公平感知的系统提示。 Result: LVLM-based模型比非LVLM模型生成更多带有社会偏见的图像；系统提示被识别为偏见传播的主要驱动因素；FairPro在SANA和Qwen-Image两个模型上显著降低了群体偏见，同时保持了文本与图像的一致性。 Conclusion: 系统提示在LVLM-based图像生成中的偏见传播中起核心作用，FairPro提供了一种实用且可部署的方法来构建更公平的文本到图像系统。 Abstract: Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.

[134] A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs

Qiong Chang,Weimin Wang,Junpei Zhong,Jun Miyazaki

Main category: cs.CV

TL;DR: 提出了一种内存高效的优化策略，用于轻量化高性能点云配准算法VANICP，实现在资源受限的嵌入式GPU上运行。

Details

Motivation: 原始VANICP算法虽然提升了计算效率，但内存占用高，难以部署在嵌入式等资源受限系统中。 Method: 提出一种面向GPU的动态内存分配策略，优化膨胀操作的内存使用，并构建改进版VANICP框架。 Result: 实现了超过97%的内存消耗降低，同时保持了原有性能。 Conclusion: 该优化策略显著降低了VANICP的内存需求，使其适用于嵌入式GPU等低资源环境。 Abstract: This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: https://github.com/changqiong/VANICP4Em.git.

[135] Reflection Removal through Efficient Adaptation of Diffusion Transformers

Daniyar Zakarin,Thiemo Wandel,Anton Obukhov,Dengxin Dai

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散Transformer（DiT）的单图像去反射框架，利用预训练基础模型并结合物理合成数据，在域内和零样本基准上实现了最先进的性能。

Details

Motivation: 现有方法依赖任务特定架构且缺乏高质量、多样化的训练数据，限制了去反射模型的泛化能力和恢复质量。 Method: 采用预训练的DiT基础模型，通过条件输入含反射图像并引导输出清晰透射层；构建基于Blender中Principled BSDF的物理渲染管线生成逼真反射数据，并使用LoRA进行高效微调。 Result: 在多个基准测试中达到最先进水平，尤其在零样本场景下表现优异，验证了方法的泛化性与鲁棒性。 Conclusion: 预训练扩散Transformer结合物理真实的数据合成和高效适配，可为去反射任务提供可扩展且高保真的解决方案。 Abstract: We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

[136] Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects

Xianghui Fan,Zhaoyu Chen,Mengyang Pan,Anping Deng,Hang Yang

Main category: cs.CV

TL;DR: 提出了一种自监督方法用于透明物体的深度补全，通过在非透明区域模拟透明物体的深度缺陷，并利用原始深度图作为监督信号，实现了与有监督方法相当的性能。

Details

Motivation: 传统深度传感器因光的折射和反射难以感知透明物体的深度，且现有方法依赖大量标注数据，标注成本高。 Method: 提出一种新的自监督深度补全网络训练方法，通过在非透明区域模拟透明物体的深度缺陷，并使用原始深度图作为监督真值。 Result: 实验表明该方法性能与有监督方法相当，且用该方法预训练可在小样本情况下提升模型性能。 Conclusion: 该自监督方法有效减少了对标注数据的依赖，在透明物体深度补全任务中具有良好的应用潜力。 Abstract: The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.

[137] Generative Neural Video Compression via Video Diffusion Prior

Qi Mao,Hao Cheng,Tinghan Yang,Libiao Jin,Siwei Ma

Main category: cs.CV

TL;DR: GNVC-VD是首个基于DiT的生成式神经视频压缩框架，结合视频扩散Transformer实现时空潜在压缩与序列级生成优化的统一，显著提升感知质量并抑制闪烁伪影。

Details

Motivation: 现有基于图像生成先验的感知视频压缩方法缺乏时间建模，导致帧间不一致和感知闪烁问题，难以满足高质量视频压缩需求。 Method: 提出GNVC-VD框架，采用统一的流匹配潜在优化模块，利用视频扩散Transformer对解码后的时空潜变量进行序列级去噪优化，并通过条件适配器向中间层注入压缩感知信号，以适应压缩退化并保持时间一致性。 Result: 实验表明，GNVC-VD在传统和学习型编解码器中均实现了更优的感知质量，尤其在低于0.01 bpp的极端码率下仍能有效消除闪烁伪影，表现出卓越的压缩性能。 Conclusion: 将视频原生生成先验融入神经编解码器可有效提升感知视频压缩性能，GNVC-VD为下一代高效视频压缩提供了新方向。 Abstract: We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

[138] HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

Pham Thach Thanh Truc,Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy

Main category: cs.CV

TL;DR: 提出HTR-ConvText模型，结合CNN与MobileViT提取局部与全局特征，提升手写文本识别在数据少、风格多变场景下的性能。

Details

Motivation: 现有手写文本识别方法在数据有限、书写风格多样和复杂符号情况下泛化能力不足，且依赖大量合成数据。 Method: 设计HTR-ConvText模型：采用残差CNN与带位置编码的MobileViT进行特征提取；引入ConvText编码器融合局部与全局信息，并通过层次结构缩短序列长度；添加辅助模块增强文本上下文以改进CTC损失函数的不足。 Result: 在IAM、READ2016、LAM和HANDS-VNOnDB数据集上验证，本方法在识别准确率和泛化能力上优于现有方法，尤其在训练样本少和书写差异大的场景下表现更优。 Conclusion: HTR-ConvText通过融合细粒度局部特征与全局上下文，在减少对合成数据依赖的同时，显著提升了手写文本识别的性能与鲁棒性。 Abstract: Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.

[139] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Nicolas Houdré,Diego Marcos,Hugo Riffaud de Turckheim,Dino Ienco,Laurent Wendling,Camille Kurtz,Sylvain Lobry

Main category: cs.CV

TL;DR: 本文提出了RAMEN，一种可调节分辨率的多模态编码器，能够以传感器无关的方式学习地球观测数据的共享视觉表示，并在多传感器、多分辨率任务中优于现有模型。

Details

Motivation: 现有的多模态基础模型通常要求固定输入分辨率或依赖特定传感器编码器，难以泛化到异构的地球观测（EO）模态，限制了跨模态学习的能力。 Method: 提出RAMEN，一种统一的变换器编码器，将模态、空间和时间分辨率作为输入特征，通过掩码重建多源EO数据进行预训练，实现传感器无关的共享表示学习，并将空间分辨率设为可调输出参数。 Result: 在PANGAEA基准测试中，RAMEN在多种多传感器、多分辨率下游任务上优于更大的先进模型，并能有效迁移到已知和未见过的传感器配置。 Conclusion: RAMEN实现了对异构地球观测数据的高效、灵活与可扩展的多模态学习，提供了在推理时权衡空间精度与计算成本的能力。 Abstract: Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

[140] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Abhigyan Bhattacharya,Hiranmoy Roy

Main category: cs.CV

TL;DR: 本文提出了一种语义引导的分层生成架构，用于解决人脸图像修复中大区域遮挡导致的结构不一致和纹理模糊问题，通过结合CNN与Vision Transformer以及多模态纹理生成器，在CelebA-HQ和FFHQ数据集上实现了优于现有方法的修复效果。

Details

Motivation: 现有方法在处理大面积不规则遮挡时，常因直接像素级生成和面部先验利用不足而导致语义不一致、结构失真或边缘模糊，难以兼顾身份保持与视觉真实感。 Method: 提出一种语义引导的分层修复架构：第一阶段融合CNN（局部特征）与Vision Transformer（全局特征）生成清晰的语义布局；第二阶段采用多模态纹理生成器，结合多尺度信息进行纹理细化，并通过动态注意力机制适应任意形状的遮挡区域。 Result: 在CelebA-HQ和FFHQ数据集上实验表明，该方法在LPIPS、PSNR和SSIM等指标上优于现有最先进方法，尤其在大区域修复任务中表现出更优的语义保持和视觉质量。 Conclusion: 所提出的分层语义引导生成框架有效提升了人脸图像修复的质量，特别是在处理复杂遮挡时能更好保持面部结构合理性和纹理真实性，无需针对特定遮挡训练即可泛化至任意掩码配置。 Abstract: Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

[141] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang,Ziyi Wang,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 提出MoRe4D框架，联合生成几何一致且运动合理的4D点轨迹，实现从单张图像生成高质量、多视角一致的动态4D场景。

Details

Motivation: 现有方法将几何与运动分离导致时空不一致和泛化能力差，且缺乏高质量4D场景数据。 Method: 构建大规模TrajScene-60K数据集，提出扩散模型4D-STraG联合生成轨迹，结合深度引导的运动归一化和运动感知模块，并设计4D-ViSM模块进行视频渲染。 Result: 在单图输入下生成具有多视角一致性和丰富动态细节的高质量4D场景视频。 Conclusion: MoRe4D通过联合建模几何与运动，显著提升了单图生成4D动态场景的质量与一致性。 Abstract: Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

[142] 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Xianfeng Wu,Yajing Bai,Minghan Li,Xianzu Wu,Xueqi Zhao,Zhongyuan Lai,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 提出4DLangVGGT，首个基于Transformer的前馈统一框架，用于4D语言定位，实现跨场景高效训练与强泛化。

Details

Motivation: 现有4D语义场构建方法依赖于特定场景的高斯溅射，需逐场景优化，泛化能力差且难以规模化应用。 Method: 设计了4DLangVGGT，包含4D视觉几何Transformer（StreamVGGT）捕捉动态场景时空几何特征，以及语义桥接解码器（SBD）将几何感知特征映射到语言对齐的语义空间。 Result: 在HyperNeRF和Neu3D数据集上实验表明，该方法在单场景训练下性能提升达2%，多场景训练下提升1%，且具备良好泛化能力。 Conclusion: 4DLangVGGT实现了无需逐场景优化的高效部署，为开放词汇4D场景理解建立了新范式。 Abstract: Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

[143] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang,Qihang Zhang,Shengqu Cai,Tong Wu,Jan Ackermann,Zhengfei Kuang,Yang Zheng,Frano Rajič,Siyu Tang,Gordon Wetzstein

Main category: cs.CV

TL;DR: 提出了一种4D可控视频扩散框架，显式解耦场景动态与相机姿态，实现对场景动态和摄像机视角的细粒度操控。

Details

Motivation: 现有视频扩散模型将场景动态与相机运动耦合，限制了时空控制的精确性。 Method: 引入4D位置编码和自适应归一化，以世界-时间序列和相机轨迹为条件输入，解耦场景动态与相机姿态。使用独立参数化的数据集进行训练。 Result: 在多种时序模式和相机轨迹下实现了鲁棒的真实世界4D控制，生成质量高，可控性优于先前方法。 Conclusion: 该框架有效提升了视频生成中的空间和时间控制能力，推动了可控视频生成的发展。 Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

[144] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Minghan Zhu,Zhiyi Wang,Qihang Sun,Maani Ghaffari,Michael Posa

Main category: cs.CV

TL;DR: 本文提出了一种结合生成模型和接触信息的接触引导3D生成方法，用于改善机器人操作中的物体几何重建。

Details

Motivation: 由于相机只能捕捉到部分物体观测，尤其是在发生遮挡时，物体重建存在较大模糊性，因此需要利用额外信息来减少视觉信号的不确定性。 Method: 利用生成模型学习常见物体形状的先验，并结合来自视频和物理交互的接触信息作为几何边界的稀疏约束，通过受拖拽编辑启发的接触引导3D生成框架融合两者。 Result: 在合成和真实世界数据上的实验表明，该方法相比纯3D生成和基于接触优化的方法，在重建精度上有所提升。 Conclusion: 结合生成先验与接触约束能有效提升部分观测下的物体几何重建质量，为机器人操作提供了更可靠的几何感知方案。 Abstract: Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

[145] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Jung Yi,Wooseok Jang,Paul Hyunbin Cho,Jisu Nam,Heeji Yoon,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出Deep Forcing，一种无需训练的KV缓存管理方法，通过Deep Sink和Participative Compression机制显著提升自回归视频扩散模型在长序列生成中的时间一致性与动态表现，实现12倍以上的时长外推并保持实时生成。

Details

Motivation: 现有视频扩散模型在实时流式生成中存在时间重复、漂移和运动减速问题，直接应用类似StreamingLLM的注意力sink方法会导致图像质量下降和运动停滞。 Method: 提出Deep Forcing，包含两个无需训练的机制：1) Deep Sink将滑动窗口一半设为持久化sink token，并重对齐其RoPE时间相位以稳定长期上下文；2) Participative Compression基于重要性进行KV缓存剪枝，保留近期参与注意力的关键token，剔除冗余历史。 Result: 实现了超过12倍的外推能力（如从5秒训练到60秒以上生成），图像质量优于LongLive，美学质量优于RollingForcing，整体一致性保持良好，并显著提升动态程度，同时维持实时生成。 Conclusion: 无需训练的KV缓存管理策略可在长视频流式生成中媲美甚至超越基于训练的方法，为高效长序列生成提供了新思路。 Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

[146] Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Haobo Yuan,Yueyi Sun,Yanwei Li,Tao Zhang,Xueqing Deng,Henghui Ding,Lu Qi,Anran Wang,Xiangtai Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 本文提出了视觉推理追踪（VRT）任务，旨在让多模态大模型不仅输出最终答案，还能显式地展示中间推理过程和依据的视觉证据。为此，作者构建了VRT-Bench评测基准、新的评估指标以及大规模训练数据集VRT-80k。实验表明，现有模型在中间推理定位上表现不佳，而使用VRT-80k训练的模型能显著提升推理路径的可解释性与准确性。

Details

Motivation: 现有的多模态大语言模型虽然在视觉问答等任务上表现良好，但其推理过程不透明，缺乏对中间推理步骤和视觉证据（如像素、位置）的显式表达，限制了模型的可解释性和可靠性。受人类链式视觉推理能力启发，本文希望使模型能够像人一样逐步推理并展示推理路径。 Method: 提出视觉推理追踪（VRT）任务，要求模型在完成目标定位的同时，预测出构成推理路径的中间对象；构建VRT-Bench人工标注评测集用于评估；设计新的评估指标来衡量推理轨迹的质量；构建VRT-80k大规模训练数据集用于训练推理模型，并基于该数据集训练具备显式推理能力的模型进行实验验证。 Result: 实验结果显示，当前主流模型虽能给出正确最终答案，但在中间推理步骤的视觉定位上表现较差；而使用VRT-80k训练的模型在推理路径追踪方面有显著提升，证明了所提方法的有效性。 Conclusion: 通过引入VRT任务及相关数据集与评估机制，推动了多模态模型从“黑箱”决策向可解释的链式视觉推理发展，提升了模型推理过程的透明度和可信度。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

[147] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Yuan Gao,Jin Song

Main category: cs.CV

TL;DR: 本文提出了空间美学（Spatial Aesthetics）这一新范式，用于评估AI生成室内场景图像的美学质量，并构建了首个大规模基准SA-BENCH。基于此，提出SA-IQA模型，可作为奖励机制优化生成流程，在多个任务中显著优于现有方法。

Details

Motivation: 现有图像质量评估方法主要集中于人像和艺术图像，缺乏对室内场景的空间美学系统性评估，限制了AI生成内容在实际应用中的质量提升。 Method: 提出包含布局、协调性、光照和畸变四个维度的空间美学评估范式；构建包含18,000张图像和50,000条标注的SA-BENCH基准；通过MLLM微调和多维度融合策略训练SA-IQA模型，并将其作为奖励模型应用于GRPO强化学习和Best-of-N筛选。 Result: SA-IQA在SA-BENCH上显著优于现有IQA方法；在下游任务中有效提升了AI生成室内图像的质量，验证了其作为奖励信号的有效性。 Conclusion: SA-IQA为AI生成室内场景提供了可靠的美学质量评估框架，推动了图像质量评估从传统图像向空间语义理解的转变，具备广泛的应用潜力。 Abstract: In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.

[148] EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Jiaqi Ma,Shengkai Hu,Jun Wan,Jiaxing Huang,Lefei Zhang,Salman Khan

Main category: cs.CV

TL;DR: 本文提出了一种名为EvoIR的全合一图像恢复框架，通过引入进化频率调制和进化优化策略，显式建模频率信息并动态调整目标，从而在多种退化条件下实现更优的结构保真与感知质量平衡。

Details

Motivation: 现有全合一图像恢复方法通常缺乏显式的频率建模，且依赖固定的优化策略，难以泛化到多样化的退化类型。因此需要一种更具适应性和鲁棒性的方法。 Method: 提出EvoIR框架，包含频率调制模块（FMM），将特征显式分解为高低频分支并自适应调节；同时设计进化优化策略（EOS），通过基于种群的进化过程动态调整频率感知目标，缓解梯度冲突并加速收敛。 Result: 在多个基准上的实验表明，EvoIR优于当前最先进的全合一图像恢复方法，且FMM与EOS组件具有互补性，联合使用效果更佳。 Conclusion: EvoIR通过显式频率建模与进化优化实现了更强大的图像恢复性能，为处理复杂退化提供了新思路。 Abstract: All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

[149] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Yu Zeng,Charles Ochoa,Mingyuan Zhou,Vishal M. Patel,Vitor Guizilini,Rowan McAllister

Main category: cs.CV

TL;DR: 本文提出了Phase-Preserving Diffusion（φ-PD），一种保持输入相位、仅随机化幅度的扩散模型重构方法，适用于需要几何一致性的图像到图像和视频到视频生成任务，并引入FSS噪声实现对结构刚性的连续控制，在无需额外参数或架构修改的情况下显著提升模拟到真实场景的迁移性能。

Details

Motivation: 标准扩散模型通过添加具有随机幅度和相位的高斯噪声破坏数据的相位成分，导致空间结构丢失，难以适用于需保持几何一致性的任务，如重渲染、仿真增强和图像转换。因此需要一种能保留输入结构信息的扩散方法。 Method: 提出φ-PD，保持输入图像的相位不变，仅对频域中的幅度进行随机化；同时设计频率选择性结构化（FSS）噪声，通过单一频率截止参数控制结构刚性；该方法不改变模型架构，无推理开销，可兼容现有扩散模型。 Result: 在真实感与风格化重渲染、自动驾驶仿真到真实场景的 planner 增强等任务中，φ-PD 生成了可控且空间对齐的结果；应用于CARLA模拟器时，使规划器在Waymo上的性能提升50%。 Conclusion: φ-PD 是一种通用、高效且即插即用的扩散框架，能够在保持空间结构的同时实现高质量生成，广泛适用于图像和视频的条件生成任务，且与现有条件控制方法互补。 Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

[150] ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Rundong Luo,Noah Snavely,Wei-Chiu Ma

Main category: cs.CV

TL;DR: ShadowDraw是一个将普通3D物体转化为投影绘画艺术的框架，通过优化场景参数和生成线条画，使投射阴影补全为可识别图像。

Details

Motivation: 探索计算视觉艺术的设计空间，结合算法设计与艺术叙事，实现3D对象到投影艺术的自动转换。 Method: 优化场景配置（如物体姿态和光照），利用阴影笔触引导线条画生成，并采用自动评估保证阴影与绘画的一致性和视觉质量。 Result: 在真实扫描、数据集和生成资产等多种输入上生成 compelling 的结果，支持多物体场景、动画和物理部署。 Conclusion: ShadowDraw提供了一个实用的投影绘画艺术生成管线，拓展了计算视觉艺术的可能性。 Abstract: We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!

[151] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Shengyuan Ding,Xinyu Fang,Ziyu Liu,Yuhang Zang,Yuhang Cao,Xiangyu Zhao,Haodong Duan,Xiaoyi Dong,Jianze Liang,Bin Wang,Conghui He,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种具有自主调用外部工具能力的智能多模态奖励模型ARM-Thinker，通过引入可验证证据来增强视觉-语言系统的对齐能力，在多个复杂任务上显著优于现有方法。

Details

Motivation: 现有的多模态奖励模型存在幻觉、视觉基础薄弱以及无法使用工具进行验证的问题，限制了其在复杂推理任务中的可靠性。 Method: 提出ARM-Thinker，一种能自主调用外部工具（如图像裁剪、文档页检索）的智能多模态奖励模型，并通过多阶段强化学习联合优化工具调用决策和判断准确性。 Result: 在新构建的ARMBench-VL基准上，ARM-Thinker平均提升16.2%的奖励建模性能，工具使用任务上提升9.6%，并在多模态数学与逻辑推理任务中超越基线模型。 Conclusion: 赋予奖励模型智能体能力可显著提升其准确性和可解释性，为未来可靠多模态对齐系统提供了新方向。 Abstract: Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

[152] Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Hao-Jen Chien,Yi-Chuan Huang,Chung-Ho Wu,Wei-Lun Chao,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出Splannequin方法，通过动态高斯点阵的时序锚定正则化，提升单目MC视频生成高质量冻结3D场景的效果。

Details

Motivation: 从单目Mannequin-Challenge视频中合成高保真冻结3D场景存在因稀疏时序监督导致的鬼影和模糊问题，需保留细微动态以支持用户可控选择。 Method: 提出Splannequin，检测高斯图元的隐藏与缺陷状态，并分别锚定到近期良好观测的过去或未来强监督状态；通过简单损失项集成到现有动态高斯管线，无需架构修改且无推理开销。 Result: 显著提升视觉质量，减少鬼影和模糊，实现用户可选的高保真冻结时间渲染，在用户研究中获得96%的偏好。 Conclusion: Splannequin是一种有效、通用且轻量的正则化方法，能显著改善单目动态场景中冻结3D视图的重建质量。 Abstract: Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/

[153] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Tianqi Liu,Zhaoxi Chen,Zihao Huang,Shaocong Xu,Saining Zhang,Chongjie Ye,Bohan Li,Zhiguo Cao,Wei Li,Hao Zhao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了Light-X，一种从单目视频中实现视角和光照联合控制的视频生成框架，通过解耦几何与光照信号并引入合成数据集Light-Syn，实现了高质量、时序一致的可控渲染。

Details

Motivation: 现有基于图像的光照控制方法在扩展到视频时面临光照保真度与时间一致性之间的权衡，且缺乏对摄像机轨迹与光照联合控制的能力，限制了真实场景的生成建模。 Method: 提出了一种解耦设计：利用沿用户定义相机轨迹投影的动态点云捕捉几何与运动，同时通过重光照帧提供一致的光照线索；并构建Light-Syn合成管线，通过逆向映射生成多视角-多光照配对训练数据。 Result: 实验表明，Light-X在联合相机-光照控制任务上优于基线方法，在文本和背景条件下的视频重光照任务中也超越先前方法，具备良好的泛化性和时序一致性。 Conclusion: Light-X通过显式解耦几何与光照信号，并借助合成数据实现鲁棒训练，为真实场景的可控视频生成提供了有效解决方案。 Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

Table of Contents

cs.CL [Back]

[1] On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

[2] Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

[3] SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

[4] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

[5] ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

[6] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

[7] MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

[8] Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

[9] RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

[10] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

[11] UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

[12] EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

[13] AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

[14] ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

[15] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

[16] Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

[17] OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models

[18] SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

[19] Model Whisper: Steering Vectors Unlock Large Language Models' Potential in Test-time

[20] EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

[21] Challenging the Abilities of Large Language Models in Italian: a Community Initiative

[22] AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages

[23] DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

[24] DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

[25] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

[26] SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

[27] LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

[28] Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

[29] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

[30] Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

[31] Structured Document Translation via Format Reinforcement Learning

[32] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

cs.CV [Back]

[33] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

[34] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology

[35] Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

[36] Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

[37] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

[38] ReasonX: MLLM-Guided Intrinsic Image Decomposition

[39] 6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

[40] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

[41] UniLight: A Unified Representation for Lighting

[42] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

[43] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

[44] Learning Single-Image Super-Resolution in the JPEG Compressed Domain

[45] Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

[46] How (Mis)calibrated is Your Federated CLIP and What To Do About It?

[47] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

[48] Real-time Cricket Sorting By Sex

[49] Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

[50] DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

[51] SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

[52] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

[53] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

[54] Open Set Face Forgery Detection via Dual-Level Evidence Collection

[55] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

[56] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

[57] FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

[58] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

[59] Performance Evaluation of Transfer Learning Based Medical Image Classification Techniques for Disease Detection

[60] Dual-Stream Spectral Decoupling Distillation for Remote Sensing Object Detection

[61] UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes

[62] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

[63] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

[64] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

[65] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

[66] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

[67] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

[68] UniTS: Unified Time Series Generative Model for Remote Sensing

[69] DeRA: Decoupled Representation Alignment for Video Tokenization

[70] Not All Birds Look The Same: Identity-Preserving Generation For Birds

[71] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

[72] Controllable Long-term Motion Generation with Extended Joint Targets

[73] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

[74] Shift-Window Meets Dual Attention: A Multi-Model Architecture for Specular Highlight Removal

[75] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

[76] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

[77] DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance