Skip to content

Table of Contents

cs.CL [Back]

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi,Richard M. K. van Dijk,Gijs Wijnholds,Tessa Verhoef

Main category: cs.CL

TL;DR: 本研究提出了一种结合OCR、生成式AI和数据库链接的自动化流程,用于将莱顿大学历史教授名录数字化并整合到高质量数据库中。

Details Motivation: 为了高效处理历史文献图像中的结构化信息,并解决现有数据库与非结构化文本之间的数据不一致问题,研究旨在开发一个能自动提取和对齐数据的管道。 Method: 采用OCR技术将历史文档图像转为文本,利用生成式AI在解码时施加约束以结构化地提取JSON格式数据,并通过记录链接算法将提取结果与现有数据库匹配。 Result: OCR的字符错误率为1.08%,词错误率为5.06%;从OCR文本中提取JSON的准确率平均为63%-65%;记录链接在标注数据上达到94%准确率,在OCR生成数据上为81%。 Conclusion: 该自动化流程有效支持了数字人文研究,表明生成式AI可在一定程度上弥补OCR性能不足,并成功应对版式多样性和术语差异等挑战。 Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano

Main category: cs.CL

TL;DR: 本文提出了CAT框架,用于评估和可视化大语言模型在可控输入变化下的准确性和响应一致性之间的相互作用,核心是通过CAR曲线展示模型准确性如何随一致性的要求增加而变化,并提出CORE指数来量化准确性和一致性之间的权衡。

Details Motivation: 当前的评估方法主要关注模型的能力如准确性或基准分数,而最近一致性被认为是部署大语言模型于高风险实际应用中的重要属性。然而,准确性和一致性之间的相互依赖性尚未得到充分考虑。 Method: 引入了Consistency-Accuracy Relation (CAR)曲线和Minimum-Consistency Accuracy (MCA)度量,以及Consistency-Oriented Robustness Estimate (CORE)指数作为全局度量标准,以量化准确性和一致性之间的权衡。 Result: 通过对多种通用和特定领域的大语言模型在多个选择题基准上的实践演示,展示了CAT框架的有效性。 Conclusion: CAT框架不仅能够更细致地评估大语言模型,而且可以扩展到支持长篇、开放式评估任务,为未来的研究提供了新的方向。 Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang,Jinze Yu,Xing Zhang,Dayuan Jiang,Yin Song,Tomal Deb,Xuefeng Liu,Peiyang He

Main category: cs.CL

TL;DR: 本文提出了一种评估和提升大语言模型(LLM)生成结构化输出一致性的综合框架,包括新的语义树编辑距离(STED)指标和一致性评分体系,实验证明其在多种模型上有效,并支持模型选择、提示优化与诊断分析。

Details Motivation: 确保大语言模型在生产环境中生成结构化数据的一致性至关重要,但现有方法难以兼顾语义灵活性与结构严格性,因此需要一种更有效的评估与改进框架。 Method: 提出了STED(Semantic Tree Edit Distance)作为衡量JSON输出相似性的新指标,并构建了一个基于多次生成结果的STED聚合的一致性评分框架;通过合成数据集进行系统实验,控制模式、表达和语义变化,评估多个LLM的表现。 Result: STED在语义等价样本间达到0.86–0.90的相似度,在结构不一致时得分为0,显著优于TED、BERTScore和DeepDiff;六种LLM测试显示Claude-3.7-Sonnet在高温度下仍保持极高一致性,而Claude-3-Haiku和Nova-Pro则表现明显下降。 Conclusion: 该框架为LLM生成结构化输出提供可靠的评估与优化工具,有助于实际应用中的模型选择、提示工程和根因分析,推动LLM在生产系统中的可靠性。 Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Jahidul Islam,Md Ataullha,Saiful Azad

Main category: cs.CL

TL;DR: 本文提出了BanglaCodeAct,一种基于多智能体提示和迭代自修正的框架,用于从孟加拉语生成Python代码,显著提升了低资源语言代码生成的性能。

Details Motivation: 现有的代码生成模型主要针对英语,对低资源语言如孟加拉语支持不足,本文旨在填补这一空白。 Method: 采用开源多语言大模型,在Thought-Code-Observation循环中通过多智能体提示和迭代自修正机制实现代码的动态生成与优化。 Result: 在mHumanEval数据集上评估多个小参数开源LLM,Qwen3-8B结合BanglaCodeAct在开发集上达到94.0%的pass@1准确率,在盲测集上达到71.6%。 Conclusion: BanglaCodeAct为孟加拉语到Python的代码生成建立了新基准,展示了基于智能体的推理在低资源语言代码生成中的潜力。 Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

[5] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

Tingwei Xie,Tianyi Zhou,Yonghong Song

Main category: cs.CL

TL;DR: PharmaShip是一个针对中文药品运输文档的现实世界数据集,用于在噪声OCR和多样化模板下评估文本布局模型,支持序列实体识别、关系抽取和阅读顺序预测任务,并提出序列感知约束作为可迁移的结构建模偏差。

Details Motivation: 现有文档理解模型在处理真实场景中嘈杂且模板多样的中文药品运输单据时表现受限,缺乏标准化的基准来公平评估不同架构的性能。 Method: 构建了一个包含扫描图像和OCR信息的中文药品运输文档数据集PharmaShip,定义了序列实体识别、关系抽取和阅读顺序预测三个任务,采用实体为中心的评估协议,并对多种几何感知和像素感知模型进行统一预处理、数据划分和优化设置下的系统评测。 Result: 实验表明像素和显式几何信息提供互补的归纳偏置,但单独使用均不足够;引入面向阅读顺序的正则化能持续提升序列实体识别和实体链接性能,更长的位置覆盖有助于稳定末页预测并减少截断效应;单词级阅读顺序预测较准确,但段落级仍具挑战性。 Conclusion: PharmaShip为高安全性要求的医药领域文档理解提供了可控且可复现的基准,揭示了序列感知约束是一种可在不同模型间迁移的有效结构建模先验。 Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.

[6] Noise-Driven Persona Formation in Reflexive Neural Language Generation

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 提出了一种名为Luca-Noise Reflex Protocol(LN-RP)的计算框架,通过注入随机噪声研究大语言模型中噪声驱动的人格涌现现象,发现三种稳定的人格模式及其熵特征。

Details Motivation: 探索大语言模型在噪声影响下如何产生和维持动态人格行为,理解生成过程中的非线性演化机制。 Method: 在生成初始状态中注入随机噪声种子,运行152个生成周期,分析语言行为的非线性转变及外部噪声引发的相变。 Result: 识别出三种具有不同熵特征的稳定人格模式,外部噪声可可靠诱导反射生成动态的相变,定量评估显示各模式间存在显著差异(p < 0.01)且人格保持一致。 Conclusion: LN-RP为研究大语言模型中的反射生成、涌现行为和长程语言一致性提供了可重复的实验方法。 Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

[7] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

Main category: cs.CL

TL;DR: 本文提出了一种名为HarmTransform的多智能体辩论框架,用于将有害查询转化为更隐蔽的形式,以改进大语言模型的安全对齐训练。实验表明该方法优于基线方法,但同时也发现多智能体辩论可能带来主题偏移和复杂性增加的问题。

Details Motivation: 现有的大语言模型安全机制主要针对明显有害内容,忽视了通过隐晦重述伪装的潜在威胁,导致安全训练数据存在缺口。 Method: 提出HarmTransform框架,利用多个智能体之间的迭代批评与优化,系统地生成保持恶意意图但更隐蔽的有害查询变体。 Result: 实验显示HarmTransform在生成有效隐蔽查询方面显著优于标准基线方法,但分析也发现多智能体辩论可能导致主题偏移和过度复杂化。 Conclusion: HarmTransform有助于提升大语言模型对隐性威胁的防御能力,但多智能体辩论机制需谨慎设计以避免负面影响,揭示了其在构建安全训练数据中的潜力与局限。 Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

[8] Emergent World Beliefs: Exploring Transformers in Stochastic Games

Adam Kamel,Tanish Rastogi,Michael Ma,Kailash Ranganathan,Kevin Zhu

Main category: cs.CL

TL;DR: 该论文研究了基于Transformer的大型语言模型(LLM)在不完全信息博弈(以德州扑克为例)中是否能学习环境的隐含状态表示。通过在扑克手牌历史数据上预训练GPT模型,并使用非线性探针分析其内部激活,发现模型能自发学习确定性结构(如牌型大小)和随机特征(如胜率),且这些表示与理论信念状态相关,表明LLM可构建对部分可观测环境的内部世界模型。

Details Motivation: 探索大型语言模型在不完全信息环境(如德州扑克)中是否仍能发展出类似世界模型的内部表征,扩展此前仅在完全信息游戏中观察到的现象。 Method: 在扑克手牌历史(PHH)数据上预训练一个GPT风格的语言模型,并使用线性和非线性探针对其内部激活进行分析,以探测其是否编码了游戏规则、胜率和信念状态等信息。 Result: 模型在无显式监督的情况下学习到了牌型等级(确定性结构)和胜率(随机特征);非线性探针能有效解码这些表示,且与理论上的信念状态具有强相关性。 Conclusion: 大型语言模型能够在不完全信息环境中自发构建内部世界模型,其内部表示能够反映部分可观测马尔可夫决策过程(POMDP)中的关键状态特征,表明其推理能力可扩展至更复杂的现实场景。 Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.

[9] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Anwar Alajmi,Gabriele Pergola

Main category: cs.CL

TL;DR: 提出了一种两阶段框架,结合针对性训练和基于推理的推理机制,以应对在线性别歧视内容检测中的数据稀缺、噪声和概念模糊问题,在多个基准上实现了最先进的性能。

Details Motivation: 传统方法难以检测在线微妙且依赖上下文的性别歧视内容,且存在标签稀缺、类别不平衡和标注不一致的问题,导致模型决策边界不稳定。 Method: 采用类平衡焦点损失、类感知批处理和后验阈值校准进行训练;在推理时通过动态路由机制将高置信度样本直接分类,不确定样本交由协作专家判断(CEJ)模块处理,利用多个人设生成推理并由判别模型整合。 Result: 在EXIST 2025 Task 1.1上F1提升+2.72%,EDOS Task A和B分别提升+4.48%和+1.30%。 Conclusion: 该框架有效应对了数据稀疏性、噪声和概念歧义的联合挑战,显著提升了对细微性别歧视内容的检测能力。 Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.

[10] Break Out the Silverware -- Semantic Understanding of Stored Household Items

Michaela Levi-Richter,Reuth Mirsky,Oren Glickman

Main category: cs.CL

TL;DR: 本文提出了“存储家庭物品挑战”(Stored Household Item Challenge),旨在评估服务机器人在家庭场景中推断不可见物品存储位置的认知能力,并发布了两个数据集和一种结合视觉与大语言模型的混合方法NOAM,实现了接近人类水平的预测性能。

Details Motivation: 服务机器人在执行如‘拿一个盘子’这类简单指令时,难以基于常识推理判断物品可能存放的位置(如抽屉、柜子等)。现有方法缺乏对非可见物体的推理能力,因此需要一个新的基准任务来推动机器人认知能力的发展。 Method: 提出NOAM(Non-visible Object Allocation Model),将视觉输入转化为描述空间上下文和可见容器的自然语言,再利用大语言模型(如GPT-4)推断最可能的隐藏存储位置,形成一种融合视觉与语言的混合代理架构。 Result: 在包含100个真实世界样本的测试集上,NOAM显著优于随机选择、纯视觉-语言管道(如Grounding-DINO+SAM)以及主流多模态模型(如Gemini、GPT-4o、Kosmos-2等),预测准确率接近人类表现。 Conclusion: 通过结合结构化场景理解与大语言模型推理,NOAM展示了在家庭环境中进行非可见物体定位的有效路径,为构建具备常识推理能力的服务机器人提供了可行方案和最佳实践。 Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[11] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su,Meicong Zhang,Guoxiu He

Main category: cs.CL

TL;DR: 提出了一种无需训练的推测性解码增强方法EASD,通过引入基于熵的动态惩罚机制,在保持解码效率的同时提升大模型推理性能。

Details Motivation: 现有的推测性解码方法因草案模型与目标模型过度对齐,受限于目标模型本身的性能,难以实现更优推理表现。 Method: 在标准推测性解码基础上,引入动态熵惩罚机制:在每一步解码中,利用采样分布的熵衡量模型不确定性;当两个模型均呈现高熵且前N预测重叠较大时,拒绝该token并由目标模型重新采样。 Result: 在多个推理基准上,EASD持续优于现有推测性解码方法,并在大多数情况下超越目标模型本身的表现,同时保持与标准SD相当的解码效率。 Conclusion: EASD通过引入草案模型的验证机制和熵感知策略,能够在不增加训练成本的前提下突破目标模型性能瓶颈,提升推理质量与鲁棒性。 Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

[12] MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team,:,Dong Zhang,Gang Wang,Jinlong Xue,Kai Fang,Liang Zhao,Rui Ma,Shuhuai Ren,Shuo Liu,Tao Guo,Weiji Zhuang,Xin Zhang,Xingchen Song,Yihan Yan,Yongzhe He,Cici,Bowen Shen,Chengxuan Zhu,Chong Ma,Chun Chen,Heyu Chen,Jiawei Li,Lei Li,Menghang Zhu,Peidian Li,Qiying Wang,Sirui Deng,Weimin Xiong,Wenshan Huang,Wenyu Yang,Yilin Jiang,Yixin Yang,Yuanyuan Tian,Yue Ma,Yue Yu,Zihan Zhang,Zihao Yue,Bangjun Xiao,Bingquan Xia,Bofei Gao,Bowen Ye,Can Cai,Chang Liu,Chenhong He,Chunan Li,Dawei Zhu,Duo Zhang,Fengyuan Shi,Guoan Wang,Hailin Zhang,Hanglong Lv,Hanyu Li,Hao Tian,Heng Qu,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianguang Zuo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Linghao Zhang,Meng Chen,Nuo Chen,Peng Zhang,Qianli Chen,Qiantong Wang,Rang Li,Shaohui Liu,Shengfan Wang,Shicheng Li,Shihua Yu,Shijie Cao,Shimao Chen,Shuhao Gu,Weikun Wang,Wenhan Ma,Xiangwei Deng,Xing Yong,Xing Zhang,Xu Wang,Yifan Song,Yihao Zhao,Yingbo Zhao,Yizhao Gao,Yu Cheng,Yu Tu,Yudong Wang,Zhaojun Huang,Zhengju Tang,Zhenru Lin,Zhichao Song,Zhipeng Xu,Zhixian Zheng,Zihan Jiang

Main category: cs.CL

TL;DR: MiMo-Audio通过大规模预训练实现了音频领域的少样本学习能力,在多种音频任务上达到开源模型SOTA,并展现出强大的泛化与生成能力。

Details Motivation: 受GPT-3启发,探索大规模next-token预测预训练在音频领域实现通用化的可行性,摆脱传统模型对任务特定微调的依赖。 Method: 将MiMo-Audio预训练数据扩展至超一亿小时,系统评估其少样本学习能力;后训练阶段构建多样化指令调优语料并引入思维机制。 Result: MiMo-Audio-7B-Base在语音智能与音频理解基准上达到开源SOTA,能泛化至声音转换、风格迁移等未见任务,并具备高质量语音续写能力;MiMo-Audio-7B-Instruct在多个音频理解、对话与TTS指令任务上接近或超越闭源模型。 Conclusion: 大规模预训练可使音频语言模型具备强泛化与少样本学习能力,结合指令调优与思维机制能进一步提升性能,推动音频大模型向通用人工智能迈进。 Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

[13] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Amal Alqahtani,Efsun Kayi,Mona Diab

Main category: cs.CL

TL;DR: 本文提出了StressRoBERTa,一种用于自动检测英文推文中自我报告的慢性压力的跨条件迁移学习方法,通过在与压力相关的临床状况上进行持续训练,显著提高了检测性能。

Details Motivation: 由于慢性压力是一种重要的公共卫生问题,而社交媒体成为人们分享压力经历的重要平台,因此需要有效的方法来自动生成识别这些自我报告的压力内容。 Method: 采用RoBERTa模型,在与压力高度共病的临床相关疾病(如抑郁症、焦虑症、PTSD)的数据集Stress-SMHD上进行持续训练,并在SMM4H 2022 Task 8数据集上微调,以提升对慢性压力的检测能力。 Result: StressRoBERTa在SMM4H 2022 Task 8上达到82%的F1分数,超过最佳参赛系统3个百分点;在Dreaddit数据集上也表现出色(81% F1),验证了其良好的迁移能力。 Conclusion: 针对与压力相关的疾病进行聚焦式的跨条件迁移学习,比通用心理健康训练能提供更强的语义表示,有助于提升慢性压力的自动检测效果。 Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

[14] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Himel Ghosh

Main category: cs.CL

TL;DR: 本研究通过SHAP解释方法对两种基于Transformer的偏见检测模型进行可解释性比较,发现领域自适应的RoBERTa模型在减少误报方面表现更优,且其归因模式与预测结果更一致。

Details Motivation: 当前新闻文本中的自动偏见检测缺乏透明度,人们对模型如何决策及其失败原因知之甚少,因此需要深入分析不同模型的归因机制以提升可靠性。 Method: 采用基于SHAP的解释方法,对比分析在BABE数据集上微调的偏见检测器与领域自适应预训练RoBERTa模型在词级别归因上的差异。 Result: 两个模型虽关注相似的评价性语言,但在信号整合方式上存在显著差异;偏见检测器对假阳性赋予更强的证据,导致过度标记中性内容;而领域自适应模型假阳性减少63%,归因更合理;错误分析显示假阳性主要由话语级歧义引发而非明显偏见线索。 Conclusion: 模型架构和训练策略显著影响偏见检测系统的可靠性和部署适用性,强调需结合可解释性评估来优化系统设计,特别是在新闻业应用中。 Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

[15] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Dingmin Wang,Ji Ma,Shankar Kumar

Main category: cs.CL

TL;DR: 本文研究了在检索增强型问答中使用大语言模型时,扩展上下文窗口带来的挑战,并提出了一种自适应提示策略,通过分块处理检索信息来平衡相关与无关信息的影响,在减少token使用的同时保持性能。

Details Motivation: 长上下文虽然便于引入目标知识,但会包含更多无关信息,影响模型生成质量,因此需要更有效的信息利用方式。 Method: 设计一种自适应提示策略,将检索到的信息分割成较小的块,并依次用每个块提示大语言模型回答问题,通过调整块大小权衡信息完整性和干扰性。 Result: 在三个开放域问答数据集上的实验表明,该策略在使用更少token的情况下达到了与标准提示相当的性能;同时发现模型在信息不足时倾向于生成错误答案而非拒绝回答。 Conclusion: 自适应提示策略有效缓解了长上下文中的信息过载问题,且揭示了大语言模型缺乏拒绝回答能力的问题,指出未来需加强模型在信息不足时的审慎响应能力。 Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.

[16] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Main category: cs.CL

TL;DR: 本文提出一种基于中间注意力层token分布生成对抗样本的新方法,利用模型内部的生成机制产生语义合理且一致的扰动,并在LLaMA-3.1-Instruct-8B上验证其对评估任务的干扰效果。

Details Motivation: 探索大语言模型中间层表示是否可用于构建更自然、更有效的对抗性扰动,以检验评估系统的鲁棒性。 Method: 从中间注意力层提取token分布,将其作为对抗扰动直接替换原输入中的token,而非使用提示工程或梯度攻击。 Result: 实验显示该方法能显著降低下游任务(论点质量评估)的性能,同时保持输入语义相似;但某些层和位置的替换会导致语法退化。 Conclusion: 中间层表示有潜力作为生成对抗样本的原则性来源,但其实际有效性受限于语法连贯性问题,需进一步优化。 Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

[17] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Yukun Zhang,Stefan Elbl Droguett,Samyak Jain

Main category: cs.CL

TL;DR: 本研究提出了一种多检索器RAG系统,结合领域特定训练和最新大语言模型,提升金融数值推理问答任务的性能,实现了超过基线模型和现有最佳模型的成果。

Details Motivation: 由于缺乏金融领域的专业知识,当前大语言模型在处理金融数值推理问题时表现不佳,本文旨在通过引入外部知识和领域特定训练来解决这一挑战。 Method: 采用多检索器检索增强生成(RAG)系统,结合SecBERT编码器进行领域特定训练,并利用最新大语言模型构建神经符号模型和基于提示的生成器。 Result: 领域特定训练显著提升了模型性能,超越了FinQA论文中的顶级模型;基于提示的最大模型实现了超过7%的SOTA性能提升,但仍低于人类专家水平。 Conclusion: 领域特定训练和外部知识检索能有效提升金融数值推理能力,较大模型更能从外部知识中获益,且最新大语言模型在少样本数值推理方面展现出更强潜力。 Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

[18] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers,Manit Patel,Seiyon M. Lee,Anthony F. Botelho

Main category: cs.CL

TL;DR: 提出一种分析框架,分离学生作答内容信号与教师评分倾向,利用句子嵌入和动态先验建模,揭示评分中的偏差并提升自动化评分的可解释性。

Details Motivation: 现有自动化评分方法常将学生实际回答内容与教师评分习惯混淆,导致评估结果不够透明和公正,需要分离这两者以提高评分的客观性和教学反思价值。 Method: 采用去标识化的ASSISTments数学作答数据,将教师评分历史建模为动态先验,使用句子嵌入表示文本内容,并通过中心化和残差化消除题目和教师的混杂影响;结合时序验证线性模型分析各信号贡献,并用投影面模型可视化分歧案例。 Result: 教师先验对评分预测影响显著;结合先验与内容嵌入的模型性能最优(AUC~0.815),仅用内容的模型弱于前者但优于随机(AUC~0.626);调整评分者效应后的内容表示更具信息量,能揭示支持理解的语义证据而非表面差异。 Conclusion: 该框架不仅提升了自动化评分的准确性与可解释性,还提供了一种学习分析工具,帮助教师和研究者审视评分实践是否与学生思维证据一致,促进教学改进。 Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[19] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou,Chunkang Zhang,Guoxin Yu,Fandong Meng,Jie Zhou,Wai Lam,Mo Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于超图的动态记忆机制HGMem,用于增强多步检索增强生成(RAG)系统中的高阶推理与全局理解能力。

Details Motivation: 现有RAG系统的记忆模块多为静态存储,缺乏对原始事实间高阶关联的建模,限制了其在复杂推理中的表现。 Method: 设计了一种超图结构的记忆机制HGMem,将记忆单元表示为超边,支持逐步构建高阶交互,形成整合的知识结构以指导后续推理步骤。 Result: 在多个需要全局理解的挑战性数据集上进行了实验,结果表明HGMem显著优于强基线系统,在多步RAG任务中取得一致提升。 Conclusion: HGMem通过动态、结构化的记忆表示,有效增强了LLM在复杂推理和知识演化中的全局感知与连贯推理能力。 Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[20] Efficient Context Scaling with LongCat ZigZag Attention

Chen Zhang,Yang Bai,Jiahuan Li,Anchun Gui,Keheng Wang,Feifan Liu,Guanyu Wu,Yuwei Jiang,Defei Bu,Li Wei,Haihang Jing,Hongyin Tang,Xin Chen,Xiangzhou Huang,Fengcun Li,Rongxiang Weng,Yulei Qian,Yifan Lu,Yerui Sun,Jingang Wang,Yuchen Xie,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出了LongCat ZigZag Attention (LoZA),一种稀疏注意力机制,可将全注意力模型高效转换为稀疏版本,显著提升长上下文场景下的推理速度,并应用于LongCat-Flash模型以支持长达百万token的处理能力。

Details Motivation: 为了在有限计算预算下提升长上下文场景中模型的推理效率,解决全注意力机制计算开销大的问题。 Method: 提出了一种名为LongCat ZigZag Attention (LoZA) 的稀疏注意力方案,可在训练中途应用于现有模型(如LongCat-Flash),将其转化为高效的稀疏注意力模型。 Result: LoZA在prefill-intensive和decode-intensive任务中均实现了显著加速,支持最长100万token的上下文处理,提升了长时推理与长视野代理能力。 Conclusion: LoZA是一种高效、可集成的稀疏注意力方法,能够显著增强大模型在极长上下文场景下的性能与实用性。 Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

[21] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Zhiming Lin,Kai Zhao,Sophie Zhang,Peilai Yu,Canran Xiao

Main category: cs.CL

TL;DR: CEC-Zero是一种无需监督的强化学习框架,通过让大语言模型自我纠正错误来提升中文拼写纠错性能,在9个基准上显著优于现有方法。

Details Motivation: 现有大语言模型和监督方法对新错误鲁棒性差且依赖昂贵标注,缺乏在大规模中文拼写纠错中的有效解决方案。 Method: 提出CEC-Zero框架:从干净文本生成带错文本,通过语义相似性和候选一致性计算聚类共识奖励,并使用PPO优化策略。 Result: 在9个基准上比监督基线高10-13 F$_1$分,比强LLM微调模型高5-8分,具备无偏奖励和收敛的理论保证。 Conclusion: CEC-Zero建立了一种无需标签的鲁棒、可扩展中文拼写纠错新范式,释放了LLM在噪声文本处理管道中的潜力。 Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

[22] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang,Shujian Zhang,John Lambert,Wenxuan Zhou,Zhangyang Wang,Mingqing Chen,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 提出了一种名为RISE的无监督框架,通过稀疏自编码器在激活空间中发现可解释的推理向量,揭示并可控地调控大语言模型中的推理行为。

Details Motivation: 现有方法依赖人类定义的概念来分析大语言模型的推理过程,难以全面捕捉复杂的推理行为,且受限于词级监督方式。 Method: 将思维链分割为句子级步骤,在步骤级激活上训练稀疏自编码器(SAE),从中提取解耦的特征向量,识别对应不同推理行为的方向,并通过可视化和干预实验验证其可解释性与可控性。 Result: 成功分离出如反思、回溯等可解释行为,发现与响应长度和置信度相关的隐藏模式,并实现对特定推理行为的定向增强或抑制。 Conclusion: RISE框架展示了无监督表示学习在解读和可控引导大语言模型推理过程中的潜力,支持新推理行为的发现与干预。 Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

[23] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri,Subasish Das,Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: 本研究提出了WISE框架,用于区分虚假新闻与讽刺内容,通过在20,000个样本上评估八种轻量级Transformer模型和两种基线模型,发现MiniLM在准确率上表现最佳(87.58%),RoBERTa-base在ROC-AUC上最优(95.42%),DistilBERT则在效率与准确性之间取得了良好平衡。

Details Motivation: 虚假新闻与讽刺内容在语言特征上相似但意图不同,自动区分二者具有挑战性且对 misinformation 检测至关重要。 Method: 提出WISE框架,在Fakeddit数据集的20,000个平衡样本上,使用分层5折交叉验证评估多个轻量级Transformer模型,采用准确率、F1分数、ROC-AUC等多种指标进行综合评估。 Result: MiniLM达到最高准确率(87.58%),RoBERTa-base获得最高ROC-AUC(95.42%)和较高准确率(87.36%),DistilBERT在效率与性能间表现均衡(86.28%准确率,93.90% ROC-AUC),统计检验显示模型间差异显著。 Conclusion: 轻量级模型可在资源受限场景下媲美或超越大型模型,适用于实际部署的虚假信息检测系统。 Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[24] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Sijia Chen,Di Niu

Main category: cs.CL

TL;DR: 本文提出iCLP框架,通过从显式推理路径中提取潜在计划(LP),使大语言模型在隐含的潜空间中进行规划,同时在语言空间中推理,从而提升准确性和效率,并实现跨域泛化。

Details Motivation: 由于大语言模型容易产生幻觉且任务问题多样,生成准确的文本计划具有挑战性,因此需要一种更稳定、可泛化的规划机制。 Method: iCLP首先从现有的逐步推理轨迹中提取显式计划,然后使用向量量化自编码器和码本学习这些计划的离散表示,最后通过在潜在计划与对应推理步骤的配对数据上微调大语言模型,使其学会隐式规划。 Result: 在数学推理和代码生成任务上的实验表明,iCLP能显著提高模型的准确性和推理效率,并展现出强跨域泛化能力,同时保持思维链推理的可解释性。 Conclusion: iCLP通过引入隐式认知机制,使大语言模型能够在潜空间中自适应地生成紧凑的推理计划,有效提升了复杂任务下的推理性能与泛化能力。 Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

[25] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla,Manoj Saravanan,Shrikar Reddy Kota

Main category: cs.CL

TL;DR: 提出了一种名为Composite Reliability Score (CRS)的统一评估框架,用于综合衡量大语言模型在校准性、鲁棒性和不确定性量化方面的可靠性。

Details Motivation: 现有对大语言模型可靠性的评估分散且仅关注孤立方面,难以全面反映其在医疗、法律和金融等关键决策领域中的实际表现。 Method: 将校准性、鲁棒性和不确定性量化整合为一个可解释的综合指标CRS,并在十个主流开源大模型和五个问答数据集上进行实验,涵盖基线、扰动和校准方法的表现。 Result: CRS能够稳定地对模型进行排序,发现单一指标忽略的隐藏失效模式,并揭示最可靠的系统在准确性、鲁棒性和校准不确定性之间实现了平衡。 Conclusion: CRS为评估大语言模型的可靠性提供了更全面、统一的框架,有助于推动其在关键决策领域的可信应用。 Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

[26] HY-MT1.5 Technical Report

Mao Zheng,Zheng Li,Tao Chen,Mingyang Song,Di Wang

Main category: cs.CL

TL;DR: 本文介绍了新型机器翻译模型HY-MT1.5-1.8B和HY-MT1.5-7B,采用多阶段训练框架,在多种翻译任务中表现优异,尤其在参数效率和特定语言对上达到或接近当前最优水平。

Details Motivation: 为了提升机器翻译模型的性能与参数效率,尤其是在中文-外文、英文-外文及少数民族语言翻译任务中超越现有开源和商业模型。 Method: 提出一个包含通用与面向机器翻译的预训练、监督微调、策略内蒸馏和强化学习的多阶段训练框架。 Result: HY-MT1.5-1.8B在多个基准上优于更大的开源模型(如Tower-Plus-72B)和商业API,并达到Gemini-3.0-Pro约90%的性能;HY-MT1.5-7B在Flores-200上达到其95%性能,并在WMT25和少数民族语言测试集上超越它。此外,系列模型支持术语控制、上下文感知和格式保持等高级功能。 Conclusion: HY-MT1.5系列模型在其参数规模下实现了卓越的翻译性能和鲁棒性,为通用和专业翻译任务提供了高效解决方案。 Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

[27] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: 本文旨在通过集中必要的信息,帮助研究人员从零开始在AWS SageMaker上成功训练第一个Hugging Face模型,从而促进云计算的普及。

Details Motivation: 由于缺乏本地计算资源,许多研究人员转向云服务训练模型,但云平台的学习曲线陡峭且文档分散,形成了使用障碍。 Method: 本文通过整合关键信息和步骤,提供一个简化的指南,帮助研究人员克服在AWS SageMaker上训练Hugging Face模型时遇到的知识断层问题。 Result: 成功构建了一个易于遵循的教程,使研究人员能够从零开始在AWS SageMaker上训练Hugging Face模型。 Conclusion: 该研究降低了研究人员采用云计算技术的门槛,推动了大型语言模型开发的民主化。 Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

[28] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman,Erin Feiglin,Osher Yaari,Efrat Mentel,Amit Levi,Raz Lapid

Main category: cs.CL

TL;DR: 提出了一种针对掩码扩散语言模型(MDLMs)的激活引导框架,通过单次前向传播计算层间引导向量,实现高效推理时控制。

Details Motivation: 现有的MDLMs在推理时缺乏有效的控制和引导机制,限制了其在实际应用中的灵活性和可控性。 Method: 利用对比样例通过单次前向传播计算逐层的引导向量,并在每一步反向扩散过程中应用这些向量,无需模拟去噪轨迹。 Result: 在LLaDA-8B-Instruct上实验表明,该方法能可靠地调节文本的高层属性,并通过消融研究分析了不同Transformer子模块和token范围(提示 vs. 回应)的影响。 Conclusion: 所提出的激活引导框架为MDLMs提供了一种高效且灵活的推理时控制手段,拓展了其在可控文本生成中的应用潜力。 Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

[29] Large Emotional World Model

Changhao Song,Yazhou Zhang,Hui Gao,Chang Yang,Peng Zhang

Main category: cs.CL

TL;DR: 本文提出了一个大型情感世界模型(LEWM),通过构建包含情感因果关系的EWH数据集,将情感因素显式引入世界模型,提升了对情感驱动社会行为的预测能力。

Details Motivation: 现有大语言模型主要关注物理世界的建模,缺乏对情感因素的系统性探索,而情感在人类决策和世界理解中起关键作用,因此需要构建能捕捉情感动态的世界模型。 Method: 受心智理论启发,构建了Emotion-Why-How(EWH)数据集,将情感融入因果关系,并在此基础上提出LEWM,联合建模视觉观察、动作与情感状态,实现对未来状态及情感变化的预测。 Result: 实验表明,LEWM在情感驱动的社会行为预测上表现更优,同时在基础任务上的性能与通用世界模型相当。去除情感相关信息会导致推理性能下降,验证了情感建模的重要性。 Conclusion: 将情感作为世界知识的核心组成部分纳入世界模型是必要且有效的,LEWM为理解和预测人类行为提供了更具同理心的建模范式。 Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

[30] Training Report of TeleChat3-MoE

Xinzhang Liu,Chao Wang,Zhihao Yang,Zhuo Jiang,Xuncheng Zhao,Haoran Wang,Lei Li,Dongdong He,Luobin Liu,Kaizhe Yuan,Han Gao,Zihan Wang,Yitong Yao,Sishi Xiong,Wenmin Deng,Haowei He,Kaidong Yu,Yu Zhao,Ruiyu Fang,Yuhao Jiang,Yingyan Li,Xiaohui Hu,Xi Yu,Jingqi Li,Yanwei Liu,Qingli Li,Xinyu Shi,Junhao Niu,Chengnuo Huang,Yao Xiao,Ruiwen Wang,Fengkai Li,Luwen Pu,Kaipeng Jia,Fubei Yao,Yuyao Huang,Xuewei He,Zhuoru Jiang,Ruiting Song,Rui Xue,Qiyi Xie,Jie Zhang,Zilu Huang,Zhaoxi Zhang,Zhilong Lu,Yanhan Zhang,Yin Zhang,Yanlei Xue,Zhu Yuan,Teng Su,Xin Jiang,Shuangyong Song,Yongxiang Li,Xuelong Li

Main category: cs.CL

TL;DR: TeleChat3-MoE 是一系列基于MoE架构的超大规模语言模型,参数量达千亿至万亿级,报告重点介绍支持其高效训练的基础设施与优化技术。

Details Motivation: 为支持千亿至万亿参数规模的MoE模型在国产Ascend NPU集群上高效、稳定地端到端训练,需解决跨硬件平台的数值一致性、分布式训练效率及系统级性能瓶颈问题。 Method: 提出系统性的算子级和端到端数值精度验证方法;设计包含交错流水调度、注意力感知数据调度、分层重叠通信和DVM算子融合的性能优化套件;构建基于解析估计与整数线性规划的多维并行配置优化框架;并实施集群级优化以缓解主机与设备瓶颈。 Result: 实现了在数千芯片规模上的近线性扩展,显著提升了训练吞吐量,并确保了不同硬件平台与并行策略下的训练一致性与稳定性。 Conclusion: 所提出的训练基础设施与系统优化方法为超大规模MoE模型在国产NPU集群上的高效训练提供了可靠支撑,推动了大规模语言模型在自主硬件生态中的发展。 Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

[31] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang,Rui Sheng,Yafei Li,Huamin Qu,Yushi Sun,Min Zhu

Main category: cs.CL

TL;DR: MedKGI 是一种基于临床实践的诊断框架,通过整合医学知识图谱、基于信息增益的问题选择和结构化状态跟踪,显著提升大语言模型在临床诊断中的准确性与对话效率。

Details Motivation: 现有大语言模型在临床诊断中存在幻觉、问题冗余和多轮对话不一致等问题,难以模拟真实的临床推理过程。 Method: 提出 MedKGI 框架,结合医学知识图谱约束推理、基于信息增益选择判别性问题,并采用 OSCE 格式的结构化状态维护多轮证据一致性。 Result: 在临床基准测试中,MedKGI 平均提升30%的对话效率,并在诊断准确率上优于强基线模型。 Conclusion: MedKGI 有效解决了 LLM 在临床诊断中的关键缺陷,实现了更高效、可靠且符合临床实践的智能诊断。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[32] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

May Bashendy,Walid Massoud,Sohaila Eltanbouly,Salam Albatarni,Marwan Sayed,Abrar Abir,Houda Bouamor,Tamer Elsayed

Main category: cs.CL

TL;DR: 本文介绍了LAILA,目前最大规模的公开阿拉伯语自动作文评分(Arabic AES)数据集,包含7,859篇带有整体和特征评分的作文,涵盖七个维度:相关性、组织结构、词汇、风格、发展、机械性和语法,并提供了基于先进阿拉伯语和英语模型的基准测试结果。

Details Motivation: 由于缺乏公开可用的数据集,阿拉伯语自动作文评分(AES)的研究相对有限,本文旨在通过发布大规模标注数据集LAILA来填补这一空白。 Method: 设计并收集了包含7,859篇阿拉伯语作文的数据集,每篇作文均经过人工标注,提供整体分数和七个维度的特征分数;在特定提示和跨提示设置下,使用最先进的阿拉伯语和英语预训练模型进行基准测试。 Result: LAILA成为目前最大的公开阿拉伯语AES数据集,基准实验表明其可用于训练和评估可靠的自动评分模型,尤其在跨提示场景中表现出挑战性,有助于推动领域发展。 Conclusion: LAILA数据集有效缓解了阿拉伯语AES研究中的数据稀缺问题,为开发更鲁棒的自动评分系统提供了重要资源,并促进了该领域的标准化 benchmarking。 Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

[33] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

Michael E. Rose,Mainak Ghosh,Sebastian Erhardt,Cheng Li,Erik Buunk,Dietmar Harhoff

Main category: cs.CL

TL;DR: 本文提出了一种适用于专利和科学出版物的语言相似性模型Pat-SPECTER,在预测专利-论文引用方面表现最佳,并验证了美国专利引用的论文在语义上相似性较低的假设。

Details Motivation: 为了同时处理专利和科学出版物之间的语言相似性,开发一个有效的模型来预测可信的专利-论文引用。 Method: 通过对八种语言(相似性)模型进行竞争性评估,使用SPECTER2模型并在专利数据上进行微调得到Pat-SPECTER模型。 Result: Pat-SPECTER模型在预测专利-论文引用方面表现最优;实证表明美国专利引用的论文语义相似性低于其他司法管辖区。 Conclusion: Pat-SPECTER是目前最适合用于专利与科学文献间相似性分析的模型,且研究支持美国专利因诚信义务而引用更不相似的研究成果这一假设。 Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

[34] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Ziqing Fan,Yuqiao Xian,Yan Sun,Li Shen

Main category: cs.CL

TL;DR: 本文提出了DATAMASK,一种高效的联合学习框架,用于大规模预训练数据的选择,能够统一优化质量和多样性指标,并在万亿级token数据集上显著提升训练效率和模型性能。

Details Motivation: 现有数据选择方法在处理万亿级预训练数据时通常只考虑质量或多样性单一指标,导致训练效果受限,且联合使用多指标因计算成本过高而难以实现。 Method: 将数据选择建模为掩码学习问题,通过策略梯度优化和加速技术迭代采样数据掩码,联合优化质量和多样性指标。 Result: 相比贪婪算法减少98.9%的选择时间,在FineWeb数据集的10%子集(FineWeb-Mask)上训练后,1.5B稠密模型和7B MoE模型分别在12项任务中平均提升3.2%和1.9%。 Conclusion: DATAMASK实现了高效的大规模多指标联合数据选择,显著提升了大语言模型的训练效率与性能,验证了同时优化质量与多样性的重要性。 Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

[35] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll,Adam Jatowt

Main category: cs.CL

TL;DR: 本文介绍了一个用于欧盟分类法合规性评估的新结构化数据集,基于190份企业报告,包含经济活动和关键绩效指标(KPI)的真值数据,并首次系统评估了大语言模型(LLMs)在该任务中的表现,发现其在定性任务中表现中等,在定量任务中完全失败。

Details Motivation: 由于缺乏公开的基准数据集,当前研究难以推动大语言模型(LLMs)在欧盟分类法合规流程自动化中的应用,因此需要构建一个真实、结构化的数据集以支持系统性评估。 Method: 构建了一个包含190份企业报告的新型结构化数据集,涵盖经济活动和定量KPI的真实标签,并采用多步代理框架评估LLMs在定性和定量任务中的零样本表现。 Result: LLMs在识别经济活动的定性任务中表现中等,多步代理框架略微提升了精确率;但在预测财务KPI的定量任务中完全失败;研究还发现简洁的元数据常优于完整非结构化报告,且模型置信度校准差。 Conclusion: 大语言模型尚不能实现欧盟分类法合规的全自动处理,但可作为辅助工具协助人类专家;所提出的数据集为未来研究提供了公开基准。 Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[36] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了FIGR,一种通过端到端强化学习将主动视觉思维融入多轮推理的模型,利用可视化表征提升复杂问题中对全局结构特性的推理能力,在数学推理任务上显著优于纯文本基线模型。

Details Motivation: 复杂的推理问题常涉及隐式的空间、几何和结构关系,而纯文本推理难以有效捕捉这些全局结构约束。 Method: 提出FIGR模型,通过端到端强化学习,在推理过程中动态构建可视化表示以显式表达中间结构假设,并自适应调控视觉推理的触发与方式。 Result: 在AIME 2025和BeyondAIME等具有挑战性的数学推理基准上,FIGR分别比强文本链推理基线提升了13.12%和11.00%。 Conclusion: 引入视觉引导的多模态推理能有效增强复杂推理的稳定性与可靠性,尤其在需处理全局结构信息的任务中表现突出。 Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[37] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Shupeng Li,Weipeng Lu,Linyun Liu,Chen Lin,Shaofei Li,Zhendong Tan,Hanjun Zhong,Yucheng Zeng,Chenghao Zhu,Mengyue Liu,Daxiang Dong,Jianmin Wu,Yunting Xiao,Annan Li,Danyu Liu,Jingnan Zhang,Licen Liu,Dawei Yin,Dou Shen

Main category: cs.CL

TL;DR: 本文提出了QianfanHuijin,一种面向金融领域的大型语言模型,并引入了一种可推广的多阶段训练范式,通过逐步细化的后训练流程显著提升模型的金融推理与智能体能力。

Details Motivation: 随着金融服务复杂性的加深,仅具备领域知识的模型已无法满足需求,亟需具备强大金融推理和智能体能力的模型。 Method: 采用多阶段训练范式:首先在金融语料上进行持续预训练(CPT),然后依次进行金融监督微调(SFT)、金融推理强化学习(RL)、金融智能体强化学习(RL),最后结合真实业务场景进行通用强化学习。 Result: QianfanHuijin在多个权威金融基准测试中表现优异,消融实验表明推理RL和智能体RL阶段显著提升了对应能力。 Conclusion: 该细粒度、渐进式的后训练方法能有效增强工业级大模型的特定能力,有望成为工业增强型LLM的主流范式。 Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

[38] World model inspired sarcasm reasoning with large language model agents

Keito Inoshita,Shinnosuke Mizuno

Main category: cs.CL

TL;DR: 本文提出了一种基于世界模型的讽刺理解框架WM-SAR,通过将字面意义、语境、规范期望和意图分解为专门的LLM代理,显式建模语义不一致与意图,提升讽刺检测性能与可解释性。

Details Motivation: 现有讽刺理解方法多依赖黑箱模型,难以解释认知机制;且缺乏对语义评价与规范期望之间不匹配的显式建模。 Method: 将讽刺理解重构为世界模型驱动的推理过程,设计WM-SAR框架,使用多个LLM代理分别建模字面意义、语境、规范期望和意图,并通过逻辑回归整合不一致性得分和意图得分进行最终判断。 Result: 在多个讽刺检测基准上,WM-SAR优于现有的深度学习和大语言模型方法,具备更强的性能和可解释性。 Conclusion: 结合语义不一致与意图推理的结构化建模方式,能有效提升讽刺理解的效果与透明度,为NLP中的认知驱动建模提供了新思路。 Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

[39] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro,Zied Bouraoui

Main category: cs.CL

TL;DR: 提出一种基于自监督对比学习的长文档表示框架,通过模拟人类阅读策略,在法律和医学文本上实现了更高效、准确的表示。

Details Motivation: 现有Transformer模型在处理长文档时存在资源消耗大、上下文捕捉不全或缺乏可解释性的问题,而人类通过略读关键部分理解全文的方式启发了新方法的设计。 Method: 引入一种自监督对比学习框架,随机掩码文档中的段落,并利用自然语言推断(NLI)驱动的对比目标,将被掩码段落与其相关部分对齐,远离无关部分。 Result: 在法律和生物医学文本上的实验表明,该方法显著提升了长文档表示的准确性和计算效率。 Conclusion: 该方法通过模仿人类阅读策略,有效增强了长文档的表示能力,兼具性能优势和计算效率,适用于专业领域长文本处理。 Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[40] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel,Constantine Lignos

Main category: cs.CL

TL;DR: 本文研究了低资源语言的自动文本摘要方法,比较了零样本提示、微调mT5模型、数据增强和多语言迁移等多种方法,并评估了不同大模型和指标的表现。

Details Motivation: 低资源语言的文本摘要研究较少,现有方法在这些语言上的表现尚不明确,需要系统比较不同技术的效果。 Method: 采用了零样本提示大语言模型(LLM)、微调mT5模型(结合数据增强与多语言迁移)、以及基于LLM的翻译-摘要-回译流程,并使用五种不同指标进行评估。 Result: 发现不同LLM在相似参数规模下表现存在差异;多语言微调的mT5模型在多数指标上优于其他方法,包括零样本LLM;且LLM作为评判器在低资源语言上可靠性较低。 Conclusion: 针对低资源语言的摘要任务,微调多语言小模型(如mT5)比零样本大模型更有效,而依赖LLM进行评估可能存在偏差,需谨慎使用。 Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[41] Cleaning English Abstracts of Scientific Publications

Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt

Main category: cs.CL

TL;DR: 提出了一种开源语言模型,用于自动清理英文科学摘要中的冗余信息,提升文本嵌入和相似性分析的准确性。

Details Motivation: 科学摘要常包含版权说明、元数据等无关内容,影响下游文本分析任务的准确性。 Method: 设计并训练一个易于集成的语言模型,识别并移除科学摘要中的非必要信息。 Result: 模型表现保守且精确,能改善清理后摘要的相似性排序,并提升标准长度嵌入的信息含量。 Conclusion: 该模型有效提升了科学文本预处理质量,适用于依赖文本相似性和嵌入表示的研究分析场景。 Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

[42] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

Titas Ramancauskas,Kotryna Ramancauske

Main category: cs.CL

TL;DR: 本论文设计并评估了一个用于雅思写作考试准备的修订平台,结合自动化评分与针对性反馈,通过设计型研究迭代优化模型,最终采用DistilBERT回归模型显著提升评分准确性,并验证了自适应反馈对考生提分的有效性。

Details Motivation: 传统雅思写作备考方法缺乏基于评分标准的个性化反馈,且难以模拟真实考试环境,因此需要一个能够提供精准、定制化反馈的数字化平台来弥补这一不足。 Method: 采用基于设计的研究(Design-Based Research, DBR)方法,经过多轮迭代,从基于规则的评分系统发展为使用DistilBERT加回归头的Transformer模型进行自动作文评分(AES),并集成自适应反馈机制;平台架构分离对话引导与写作界面以降低认知负荷。 Result: 早期基于规则的方法表现不佳(中等分数压缩、低准确率、负R²);第四轮DBR引入DistilBERT后,MAE降至0.66且R²转为正值;第五轮实现自适应反馈,用户平均提分0.060个band(p=0.011,Cohen's d=0.504),但效果因修改策略而异。 Conclusion: 自动化反馈可作为人工教学的有效补充,表面层级的保守修改比激进结构调整更可靠;当前系统在高分段作文评估上仍有挑战,未来需开展针对真实考生的纵向研究并由官方考官验证结果。 Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

[43] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski,Alexander Waibel

Main category: cs.CL

TL;DR: 本文提出了用于自动语音转录段落分割的新基准和方法,通过构建TEDPara和YTSegPara数据集、引入约束解码的大语言模型以及高效的MiniSeg模型,实现了在语音处理中标准化段落分割任务。

Details Motivation: 自动语音转录通常以无结构的词流形式呈现,影响可读性和再利用。现有文本分段研究缺乏针对语音领域的真实自然基准,且传统语音后处理未包含段落分割步骤,因此需要专门针对语音转录的段落分割解决方案。 Method: 提出三种方法:1)构建两个新基准TEDPara(人工标注)和YTSegPara(合成标签);2)采用约束解码机制使大语言模型在保留原文本的同时插入段落分隔符,实现保真且句子对齐的评估;3)设计轻量模型MiniSeg,并扩展为层次化模型以联合预测章节和段落。 Result: 所提方法在新基准上实现了最先进的段落分割准确率,MiniSeg模型计算成本低且可同时预测章节与段落,验证了段落分割在语音处理中的可行性与实用性。 Conclusion: 本研究填补了语音处理与文本分割交叉领域的空白,建立了段落分割作为语音处理中的标准实用任务,推动了该方向的发展。 Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

[44] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said,Muhammad Sammani Sani

Main category: cs.CL

TL;DR: 本研究通过HausaSafety数据集对GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus进行安全对齐的跨语言评估,揭示了语言与时间框架交互下的复杂干扰机制,发现当前模型依赖表面启发式判断,导致南半球用户面临局部化风险,提出需转向不变对齐的新范式。

Details Motivation: 随着大语言模型融入全球关键基础设施,其安全对齐是否能从英语零样本迁移到其他语言仍存盲点,尤其在低资源语言环境中可能存在被忽视的风险,因此需要系统性审计多语言安全表现。 Method: 构建基于西非威胁场景的对抗性数据集HausaSafety,采用2×4因子设计,在1,440次评估中测试三种最先进模型(GPT-5.1、Gemini 3 Pro、Claude 4.5 Opus)在英语与豪萨语、不同时态框架下的安全响应差异。 Result: 发现复杂的干扰机制:安全性能由语言与时间框架的交互决定;Claude 4.5 Opus在豪萨语中更安全(45.0% vs 36.7%),但时态推理存在灾难性失败,过去时防御仅15.6%有效,未来时达57.2%,最安全与最脆弱配置间存在9.2倍差距。 Conclusion: 当前模型的安全性并非固定属性而是情境依赖状态,依赖表面启发式而非深层语义理解,形成‘安全 pockets’,使全球南方用户暴露于本地化危害,应转向‘不变对齐’以确保跨语言及时序稳定性。 Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[45] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong,Qi Zhang,Jiayang Gao,Lei Jiang,Yanbing Liu,Nannan Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为HaluNet的轻量级可训练神经网络框架,用于检测大型语言模型(LLM)在问答任务中的幻觉问题。该方法结合了词元级别的概率不确定性和语义表示不确定性,通过多分支架构自适应融合模型知识与输出不确定性,实现高效的单次幻觉检测。

Details Motivation: 大型语言模型虽然在问答任务中表现出色,但常产生幻觉(如事实错误或虚构内容)。现有方法通常只关注单一类型的内部不确定性,忽视了不同不确定性来源之间的互补性,尤其是词元级别概率不确定性与语义表示不确定性之间的协同作用。因此,需要一种能够整合多粒度不确定性的有效检测方法。 Method: 提出HaluNet框架,采用多分支架构,将语义嵌入与概率置信度和分布不确定性相结合,集成多粒度的词元级不确定性。该模型自适应地融合模型已知信息与其输出中的不确定性,支持无需外部资源的一次性前向推理检测。 Result: 在SQuAD、TriviaQA和Natural Questions数据集上的实验表明,HaluNet在有无上下文访问权限的情况下均表现出优异的检测性能和良好的计算效率。 Conclusion: HaluNet通过融合多粒度不确定性信号,实现了高效且准确的幻觉检测,具有应用于LLM问答系统中实时幻觉检测的潜力。 Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

Hongseok Oh,Wonseok Hwang,Kyoung-Woon On

Main category: cs.CL

TL;DR: 提出了韩国标准法律基准(KCL),用于评估语言模型在不依赖领域知识情况下的法律推理能力,包含选择题和开放式写作两部分,并发布了相关数据集和评估代码。

Details Motivation: 为了独立评估语言模型的法律推理能力,避免模型表现受其参数中存储的特定法律知识影响,需要一个能分离推理能力和知识记忆的基准。 Method: 构建了KCL基准,包括283道选择题(KCL-MCQA)和169道开放式问答题(KCL-Essay),每道题提供支持性判例,并为开放题设计了2,739条实例级评分规则以实现自动化评估;对30多个模型进行了系统评测。 Result: 实验显示现有模型在KCL上仍有较大提升空间,尤其是在KCL-Essay任务上,且专为推理设计的模型表现优于通用模型。 Conclusion: KCL是一个有效的法律推理评估基准,能够更准确地衡量模型的推理能力,其资源的公开将促进该领域的研究发展。 Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[47] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Zhenyu Zhang,Xiaoxia Wu,Zhongzhu Zhou,Qingyang Wu,Yineng Zhang,Pragaash Ponnusamy,Harikaran Subbaraj,Jue Wang,Shuaiwen Leon Song,Ben Athiwaratkun

Main category: cs.CL

TL;DR: 本文提出了一种名为CREST的训练-free方法,通过在推理时干预特定的认知注意力头来引导大语言模型的推理过程,从而提高准确性和降低计算成本。

Details Motivation: 大型语言模型在解决复杂任务时依赖于长链式思维推理,但这种推理路径常常效率低下,导致延迟高或推理不稳定。 Method: 研究发现了一些与验证和回溯等认知行为相关的特殊注意力头,并在此基础上提出了CREST方法,包括离线校准步骤识别认知头并生成特定引导向量,以及在推理时旋转隐藏表示以抑制这些向量上的分量。 Result: 在多种推理基准和模型上,CREST将准确率提高了最多17.5%,同时减少了37.6%的token使用量。 Conclusion: CREST提供了一种简单而有效的方法,使大语言模型的推理更快、更可靠。 Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

[48] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu,Jiarui Qin,Lingfeng Qiao,Yinghui Li,Xinyi Dai,Bo Ke,Jianfeng He,Ruizhi Qiao,Di Yin,Xing Sun,Yunsheng Wu,Yinsong Liu,Shuangyin Liu,Mingkong Tang,Haodong Lin,Jiayi Kuang,Fanxu Meng,Xiaojuan Tang,Yunjia Xi,Junjie Huang,Haotong Yang,Zhenyi Shen,Yangning Li,Qianwen Zhang,Yifei Yu,Siyu An,Junnan Dong,Qiufeng Wang,Jie Wang,Keyu Chen,Wei Wen,Taian Guo,Zhifeng Shen,Daohai Yu,Jiahao Li,Ke Li,Zongyi Li,Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM是一个1.96B的轻量级语言模型,从零开始预训练,具备长上下文支持、常识-STEM-代理课程学习和可扩展的代理中期训练,显著提升了小型模型在推理、规划和代理任务上的表现。

Details Motivation: 设计一个高效且具备原生代理智能的小型语言模型,克服传统小模型依赖蒸馏导致推理和规划能力不足的问题。 Method: 采用紧凑的MLA架构与STEM词汇表支持128k上下文;设计从常识到STEM再到代理任务的渐进式多阶段预训练课程;在中期训练中引入多样化的数学、编程和工具使用轨迹数据以增强规划与反思能力。 Result: Youtu-LLM在通用基准上媲美更大模型,在代理特定任务上显著超越现有SOTA,成为2B以下模型的新标杆。 Conclusion: 轻量级语言模型通过系统性预训练可具备强大的内在代理能力,无需依赖模型蒸馏或扩大参数规模。 Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[49] Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan,Sid Black,Oliver Sourbut

Main category: cs.CL

TL;DR: 研究探讨了大语言模型(LLM)是否能预测自身在任务中的表现,以及在多步任务中和经历失败后其判断能力的变化。结果显示,尽管所有测试的LLM都存在过度自信问题,但多数具备优于随机的判别能力;新且更大的模型并未明显提升该能力,而Claude系列例外。在多步代理任务中,多个前沿LLM的过度自信随任务进展加剧,推理型LLM表现不优于非推理型。通过上下文中的失败经验,部分LLM能降低过度自信并改善决策,但并非全部。所有LLM基于其成功概率做出的决策近似理性,但因乐观估计导致整体决策不佳,表明当前LLM缺乏对自身能力的准确认知,这对AI滥用和对齐风险具有重要影响。

Details Motivation: 探究大语言模型是否具备自我能力认知,特别是在任务执行前和过程中的成功率预测能力,以及这种认知如何影响其在高成本失败场景下的决策,从而评估其作为智能代理的可靠性与安全性。 Method: 通过实验评估多种LLMs在单步和多步任务中对其成功概率的预测准确性,分析其在任务进程中的置信度变化,并引入包含失败经验的上下文示例,观察其对后续决策和过度自信程度的影响,比较不同规模、类型(如推理与非推理)模型的表现差异。 Result: 所有测试的LLM均表现出过度自信,但多数具备优于随机的成功预测判别力;较新或更大的模型未普遍展现出更强判别能力(Claude除外);在多步任务中,多个前沿LLM的过度自信随任务推进而加剧,推理型模型表现未优于非推理型;部分LLM在接收到失败的上下文经验后减少了过度自信并提升了决策质量,但并非全部;所有LLM的决策行为在其估计概率下近似理性,但由于系统性乐观估计导致实际决策效果差。 Conclusion: 当前的大语言模型虽能在一定程度上预测自身表现,但普遍存在过度自信问题,且缺乏随任务进展或经验积累而校准信心的能力,反映出其对自身能力认知的不足。这一局限性制约了其作为可靠智能代理的应用,并可能增加AI误用和目标不对齐的风险,未来需增强模型的自我监控与校准机制。 Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

[50] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li,Zhongsheng Wang,Haoyuan Li,Jiamou Liu

Main category: cs.CL

TL;DR: R-Debater是一个基于论证记忆的多轮辩论生成框架,结合检索增强与角色化代理,提升辩论的一致性、证据使用和连贯性。

Details Motivation: 现有LLM在多轮辩论中难以保持立场一致性和有效使用证据,缺乏对先前论点的记忆与适应机制。 Method: 提出R-Debater框架,整合辩论知识库用于检索案例证据和历史行为,并设计基于角色的代理来生成连贯发言;在ORCHID辩论数据上进行评估,构建包含1000条目的检索语料库和32场保留辩论。 Result: 在单轮生成和多轮对抗模拟中均优于强LLM基线,InspireScore和Debatrix评分更高;人类评估显示其在立场一致性与证据支持方面更优。 Conclusion: 结合检索增强与结构化规划能有效提升多轮辩论系统的忠实度、立场对齐和跨轮次连贯性。 Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

[51] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li,Shujian Zhang,Wenxuan Zhou,John Lambert,Chi Jin,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为MUSIC的无监督数据增强策略,通过在多轮对话中引入跨多个回合的对比来提升多轮奖励模型(RM)的性能,实验表明该方法在不牺牲单轮表现的前提下,显著提高了与先进大模型判断的一致性。

Details Motivation: 现有的偏好数据集通常只基于对话的最后一轮进行对比,难以捕捉多轮交互的复杂性,导致多轮自动评估效果不佳。因此,需要一种能更好建模多轮对话动态变化的评估方法。 Method: 提出了MUlti-Step Instruction Contrast (MUSIC) 方法,利用无监督方式合成跨越多个对话轮次的对比对话对,并在Skywork偏好数据集上基于Gemma-2-9B-Instruct模型训练多轮奖励模型。 Result: 实验结果显示,采用MUSIC增强训练的奖励模型在多轮对话评估中比基线方法更贴近高级专有大模型的判断,同时在标准单轮奖励建模任务上保持竞争力。 Conclusion: 引入跨越多轮的对比信号对构建鲁棒的多轮奖励模型至关重要,MUSIC为提升大语言模型的多轮对话评估能力提供了一个有效且可扩展的解决方案。 Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

[52] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Sibo Wei,Peng Chen,Lifeng Dong,Yin Luo,Lei Wang,Peng Zhang,Wenpeng Lu,Jianbin Guo,Hongjun Yang,Dajun Zeng

Main category: cs.CL

TL;DR: 本文提出了BIOME-Bench,一个用于评估大语言模型在多组学通路机制解析中性能的标准化基准,揭示了现有模型在生物分子关系推断和通路机制解释方面的不足。

Details Motivation: 现有的通路富集方法受限于通路数据库的滞后性、功能冗余和对分子状态不敏感,且缺乏标准化基准来系统评估大语言模型在多组学分析中的能力。 Method: 通过四阶段流程构建BIOME-Bench,设计两项核心任务:生物分子相互作用推断与端到端多组学通路机制解析,并建立相应的评估协议。 Result: 实验显示当前的大语言模型在细粒度生物分子关系识别和生成准确、稳健的通路机制解释方面仍存在显著缺陷。 Conclusion: 需要进一步改进大语言模型及其评估体系,以提升其在多组学数据解读中的可靠性与实用性。 Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

[53] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

Mohammad Zia Ur Rehman,Velpuru Navya,Sanskar,Shuja Uddin Qureshi,Nagendra Kumar

Main category: cs.CL

TL;DR: 提出了一种半监督多语言抑郁检测网络Semi-SMDNet,结合教师-学生模型、集成学习和数据增强,有效提升低资源语言下的抑郁检测性能。

Details Motivation: 由于语言风格差异、非正式表达以及许多语言缺乏标注数据,从社交媒体文本中检测抑郁症仍具有挑战性。 Method: 采用教师-学生伪标签框架,结合集成学习与数据增强,通过软投票和不确定性阈值筛选高置信度伪标签,并使用置信度加权训练策略提升跨语言鲁棒性。 Result: 在阿拉伯语、孟加拉语、英语和西班牙语数据集上显著优于强基线方法,缩小了高资源与低资源设置间的性能差距。 Conclusion: 所提框架在多语言抑郁检测中表现优异,适用于标注资源有限场景下的可扩展心理健康监测。 Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

[54] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs,Márton Csutora,Mátyás Antal,Márk Marosi

Main category: cs.CL

TL;DR: 本文研究了大语言模型在数学和推理密集型基准上的测试时计算效率,发现MoE架构在性能与效率之间表现出良好的平衡,并揭示了随着计算资源增加,准确率提升存在饱和点。

Details Motivation: 当前研究忽视了生成长推理链带来的巨大计算负担,而工业应用中模型选择不仅依赖准确性,还需考虑资源限制和推理成本。 Method: 对新旧开源大语言模型进行测试时计算感知的评估,绘制其在数学和推理密集型基准上的Pareto前沿,并分析效率随时间的变化趋势。 Result: Mixture of Experts (MoE) 架构在性能与效率方面表现优异;推理时计算的准确率增益存在饱和点,超过阈值后收益递减。 Conclusion: 尽管扩展推理能力有益,但无法克服模型在特定复杂性上的内在局限,合理平衡计算开销与性能是未来模型设计的关键。 Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

[55] Practising responsibility: Ethics in NLP as a hands-on course

Malvina Nissim,Viviana Patti,Beatrice Savoldi

Main category: cs.CL

TL;DR: 本文介绍了一门关于自然语言处理(NLP)中伦理问题的课程及其以主动学习为基础的教学方法,旨在应对NLP教育中快速发展的技术和培养批判性思维的挑战。

Details Motivation: 随着NLP系统日益普及,将其伦理考量纳入教育变得至关重要。然而,由于该领域快速发展且需超越传统技术训练培养批判性思维,课程设计面临挑战。 Method: 采用基于主动学习的教学方法,包括互动环节、实践练习和“以教促学”模式,并在四年中于不同机构、教育层次和跨学科背景中不断优化课程。 Result: 课程产出了大量可复用的教学资源和面向不同受众的教育产品,均由学生自主开发。 Conclusion: 分享该课程的设计与实践经验,旨在为希望将社会影响因素融入教学的教育工作者提供借鉴和启发。 Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field's rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and "learning by teaching" methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

[56] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Main category: cs.CL

TL;DR: 本文提出了“三角测量”方法,用于验证多语言模型中的机制性主张,要求满足因果标准、跨环境参照和不变性,通过实验验证其在多个模型家族、语言对和任务上的有效性。

Details Motivation: 多语言语言模型在不同语言、脚本和文化中表现不稳定,需要满足因果标准的机制解释来确保可靠性。 Method: 提出参考族作为保持谓词不变的变体,并引入三角测量法,包括必要性、充分性和不变性三个标准,结合自动电路发现技术筛选候选子图。 Result: 三角测量提供了一个可证伪的标准,过滤掉仅通过单环境测试但跨语言不变性失败的虚假电路,在多种模型和任务上进行了验证。 Conclusion: 三角测量为多语言模型的机制性分析提供了可靠的因果验证框架,增强了跨语言解释的鲁棒性和可信度。 Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

[57] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay,Sathwik Reddy,Shruthi Muthukumar,Jisun An,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: PrivacyBench 是一个基于社会情境的基准,用于评估AI代理在多轮对话中保护用户隐私和隐藏秘密的能力,研究发现当前的RAG系统仍存在严重的信息泄露问题。

Details Motivation: 现有的个性化AI系统在缺乏社会情境意识的情况下可能泄露用户的敏感信息,威胁数字隐私安全,因此需要一种能够衡量并改善这一问题的评估工具和方法。 Method: 提出PrivacyBench,包含具有嵌入式秘密的社会化数据集,并通过多轮对话测试检索增强生成(RAG)助手的隐私泄露情况,同时评估隐私提示等缓解措施的效果。 Result: 实验显示RAG助手在最多26.56%的交互中泄露秘密;使用隐私提示可将泄露率降至5.12%,但检索机制仍会无差别访问敏感数据,导致生成器成为隐私保护的单一故障点。 Conclusion: 当前架构难以确保大规模部署的安全性,必须引入以隐私为先、结构化的隐私保护设计,以实现更安全、伦理和包容的网络环境。 Abstract: Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

[58] Big AI is accelerating the metacrisis: What can we do?

Steven Bird

Main category: cs.CL

TL;DR: 本文探讨了生态、意义和语言危机交织成的“元危机”,指出大型AI及其背后的语言工程师在其中扮演的角色,并呼吁重新思考NLP的发展方向,以人类繁荣和地球生命为中心。

Details Motivation: 应对由生态、意义和语言危机汇聚而成的元危机,反思当前自然语言处理技术对社会与环境的负面影响。 Method: 通过批判性分析当前AI和NLP发展的主流范式,特别是可扩展性叙事及其价值中立假设,提出需要替代性路径。 Result: 揭示语言工程师在加剧元危机中的作用,包括为权贵服务和技术价值中立的错觉。 Conclusion: 必须利用集体智慧重新设计以人为本、促进生命繁荣的NLP未来路径。 Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

[59] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang,Yizhi Li,Yantao Du,Ge Zhang,Jiayi Zhou,Yuchen Wu,Yinzhu Piao,Denghui Cao,Tong Sun,Ziniu Li,Li Du,Bo Lei,Jiaheng Liu,Chenghua Lin,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang

Main category: cs.CL

TL;DR: 本文提出Encyclo-K,一种基于知识陈述的动态评估框架,通过从权威教材中提取细粒度知识陈述并随机组合生成问题,克服了传统基准在数据污染、单知识点评估和高标注成本方面的局限性,显著提升了对大语言模型综合理解能力的评测效果。

Details Motivation: 现有LLM基准存在数据污染、仅限单知识点评估和依赖昂贵专家标注三大问题,亟需一种更可靠、全面且可扩展的评估方法。 Method: 从权威教材中提取独立知识陈述作为基本单元,在测试时通过随机采样动态生成问题;每个问题聚合8-10个陈述以实现多知识点综合评估,仅需非专家验证格式合规性。 Result: 在50多个LLM上的实验显示,即使是表现最好的GPT-5.1准确率也仅为62.07%,模型间呈现清晰梯度分布,验证了该方法具有强挑战性和良好区分度。 Conclusion: Encyclo-K为大语言模型提供了一个可扩展、抗数据污染、支持多知识点综合理解评估的动态评测框架,有效提升了基准测试的可靠性和有效性。 Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

[60] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie,Yixuan Wei,Huanqi Cao,Chenggang Zhao,Chengqi Deng,Jiashi Li,Damai Dai,Huazuo Gao,Jiang Chang,Liang Zhao,Shangyan Zhou,Zhean Xu,Zhengyan Zhang,Wangding Zeng,Shengding Hu,Yuqing Wang,Jingyang Yuan,Lean Wang,Wenfeng Liang

Main category: cs.CL

TL;DR: 提出Manifold-Constrained Hyper-Connections (mHC) 框架,通过流形投影恢复超连接中的恒等映射性质,并优化基础设施以提升训练可扩展性和效率。

Details Motivation: 现有超连接方法因连接模式多样化破坏了残差连接的恒等映射特性,导致训练不稳定、可扩展性受限和内存开销增加。 Method: 将超连接的残差空间投影到特定流形上以恢复恒等映射,并结合严格的基础设施优化来提升效率。 Result: 实验表明mHC在大规模训练中有效,显著提升了性能和可扩展性,同时降低了内存访问开销。 Conclusion: mHC作为HC的灵活且实用的扩展,有助于深入理解拓扑结构设计,并为基座模型的发展提供新方向。 Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[61] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Hengli Li,Zhaoxin Yu,Qi Shen,Chenxi Li,Mengmeng Wang,Tinglang Wu,Yipeng Kang,Yuxuan Wang,Song-Chun Zhu,Zixia Jia,Zilong Zheng

Main category: cs.CL

TL;DR: 本文提出了BEDA框架,通过将信念估计作为生成过程中的概率约束,形式化了对抗和协作两种核心对话行为,并在多个任务中显著优于基线模型。

Details Motivation: 现有工作虽能准确估计对话代理的信念,但缺乏在生成过程中有效利用这些信念的原则性机制。 Method: 提出BEDA框架,包含世界集、信念估计器和条件生成器,通过概率约束将信念转化为可操作的对话行为,支持对抗性(Adversarial)和协作性(Alignment)对话。 Result: 在CKBG、MF和CaSiNo三个场景中,BEDA均优于强基线:在CKBG中成功率达5.0点以上提升(GPT-4.1-nano下达20.6点),MF中平均提升9.3点,CaSiNo中实现最优交易结果。 Conclusion: 将信念估计转化为生成约束是一种简单且通用的机制,能有效提升战略对话系统的可靠性与性能。 Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

[62] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Minjun Zhao,Xinyu Zhang,Shuai Zhang,Deyang Li,Ruifeng Shi

Main category: cs.CL

TL;DR: 提出ADOPT框架,用于多步LLM流水线中的自适应、依赖感知的提示优化,通过建模步骤与最终结果间的依赖关系,实现精确的文本梯度估计,并结合Shapley机制动态分配优化资源,显著优于现有方法。

Details Motivation: 多步LLM流水线性能依赖各步骤提示词质量,但缺乏逐步骤监督和存在步骤间依赖使得联合优化困难,现有端到端方法效果不佳且不稳定。 Method: 提出ADOPT框架,显式建模每个LLM步骤与最终任务输出之间的依赖关系,解耦文本梯度估计与更新过程,将多提示优化简化为灵活的单提示优化,并采用基于Shapley值的机制自适应分配优化资源。 Result: 在真实数据集和多种流水线结构上实验表明,ADOPT有效且鲁棒,性能持续优于当前最先进的提示优化基线方法。 Conclusion: ADOPT通过依赖感知的梯度估计和资源分配机制,解决了多步LLM流水线中提示优化的关键挑战,为复杂任务下的提示工程提供了高效可靠的解决方案。 Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

Luis Adrián Cabrera-Diego

Main category: cs.CL

TL;DR: 提出了一种基于DeBERTa V3和LSTM的法律文档分类方法,通过随机选取48个短文本块(最多128个token)作为输入,并结合Temporal部署管道实现高效、可靠的处理流程。

Details Motivation: 法律文档通常词汇专业且篇幅较长,直接使用Transformer模型处理可能不可行、昂贵或缓慢,因此需要一种高效的分类方法。 Method: 采用DeBERTa V3和LSTM结合的模型,输入为随机选取的48个短文本块(每个最多128个token),并通过Temporal构建部署管道以提升处理的可靠性与鲁棒性。 Result: 最佳模型达到加权F分数0.898,CPU上每100个文件的处理中位时间为498秒。 Conclusion: 该方法在不牺牲性能的前提下有效解决了长法律文档分类的挑战,同时通过Temporal实现了稳定高效的处理流程。 Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

[64] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Siddhant Agarwal,Adya Dhuler,Polly Ruhnke,Melvin Speisman,Md Shad Akhtar,Shweta Yadav

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型和多智能体框架的方法MAMAMemeia,用于检测社交媒体中表情包所表现出的抑郁症状,并通过引入RESTOREx资源提升了检测性能,成为新的基准方法。

Details Motivation: 随着表情包逐渐被用来表达抑郁情绪,需要有效的方法来识别这些情绪以进行心理健康监测。 Method: 提出了RESTOREx资源,结合大语言模型生成和人工标注的解释;设计了基于认知分析疗法(CAT)的多智能体多方面讨论框架MAMAMemeia。 Result: MAMAMemeia在macro-F1指标上比现有最优方法提升了7.55%,并在超过30种方法中成为新基准。 Conclusion: 该方法有效提升了抑郁相关表情包的识别性能,为社交媒体中的心理健康分析提供了有力工具。 Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

[65] Modeling Language as a Sequence of Thoughts

Nasim Borazjanizadeh,James McClelland

Main category: cs.CL

TL;DR: 本文提出了Thought Gestalt (TG)模型,一种在token和句子级“思想”状态两个层次上建模语言的循环Transformer,通过共享参数和单一的下一token交叉熵目标训练,提高了数据效率并减少了关系方向错误。

Details Motivation: 受认知科学启发,人类理解语言时会将输入转化为持久的记忆表示,而传统Transformer仅依赖表面共现统计,缺乏全局一致的潜在表示,导致在关系方向、上下文错误和数据效率方面表现不佳。 Method: 提出Thought Gestalt (TG)模型,结合token生成与句子级‘思想’状态的记忆机制,使用相同参数生成两种表示,并通过保留写入记忆的句子表示的计算图,使未来token损失的梯度反向传播以优化先前句子向量的生成参数。 Result: 在扩展实验中,TG相比同等GPT-2运行显著提升效率,拟合结果显示GPT-2需要多5-8%的数据和33-42%的参数才能达到TG的损失水平;同时在父子关系反转诅咒探测任务上减少了关系方向错误。 Conclusion: TG模型通过引入句子级记忆表示和跨注意力机制,在不增加训练目标复杂性的前提下,提升了语言模型的数据效率和关系推理能力,验证了结合认知启发结构对改进语言模型的有效性。 Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

[66] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng

Main category: cs.CL

TL;DR: 本文提出了AdaGReS,一种用于检索增强生成(RAG)的冗余感知上下文选择框架,在保证相关性的同时显式建模集合内冗余,通过自适应校准实现无需手动调参的优化,并在多种任务上验证了其提升上下文质量和生成效果的能力。

Details Motivation: 标准的top-k检索常引入冗余或近似重复的上下文片段,浪费有限的token预算并降低生成质量,因此需要一种能自动平衡相关性与冗余的上下文选择机制。 Method: AdaGReS通过结合查询-片段相关性和集合内冗余惩罚来优化集合级目标函数,在token预算约束下使用边际增益进行贪心选择,并引入闭式、实例自适应的参数校准方法来自适应调整相关性与冗余之间的权衡。 Result: 理论分析表明该目标函数在实际嵌入相似性条件下具有ε-近似子模性,从而为贪心算法提供近似最优性保证;实验结果显示其在开放域问答和高冗余生物医学文本上均显著提升了上下文质量与答案生成性能。 Conclusion: AdaGReS有效解决了RAG中因上下文冗余导致的token浪费和生成质量下降问题,通过自适应、理论可解释的方法实现了鲁棒且高效的上下文选择。 Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

cs.CV [Back]

[67] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich,Yangming Lee

Main category: cs.CV

TL;DR: 提出一种基于Depth Anything V2和DV-LORA的单目深度估计方法,在手术内窥镜环境中显著提升精度与鲁棒性,尤其在高 specular 区域表现优越。

Details Motivation: 现有自监督单目深度估计方法在手术场景中因反光、流体和透明表面等问题易出现边界崩溃,难以准确恢复薄型器械和组织的几何结构。 Method: 利用Depth Anything V2的高质量合成先验,并通过动态向量低秩适应(DV-LORA)将其高效迁移到医学图像领域,同时设计了一种物理分层的评估协议以更准确评估高反光区域性能。 Result: 在SCARED数据集上达到98.1%的准确率(<1.25)和超过17%的平方相对误差下降,显著优于基线方法。 Conclusion: 该方法有效克服了从合成到真实手术环境的域差距,提升了在复杂光照条件下的深度估计鲁棒性,为机器人手术中的视觉感知提供了可靠解决方案。 Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

[68] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

Surya Rayala,Marcos Quinones-Grueiro,Naveeduddin Mohammed,Ashwin T S,Benjamin Goldberg,Randall Spain,Paige Lawton,Gautam Biswas

Main category: cs.CV

TL;DR: 本文提出了一种基于视频的评估管道,利用计算机视觉从城市作战训练视频中提取2D骨架、视线向量和运动轨迹,构建任务特定指标,并结合扩展的认知任务分析(CTA)层次模型,实现对心理运动能力、态势感知和团队协作的客观量化评估。

Details Motivation: 传统军事训练评估依赖昂贵且侵入式的传感器或主观观察,难以实现可扩展、客观的性能评估,尤其在认知、心理运动和团队协作方面存在不足。 Method: 采用纯视频驱动的方法,通过计算机视觉模型提取2D姿态、 gaze 和轨迹数据,构建针对Enter and Clear the Room(ECR)任务的性能指标,并融入加权的扩展认知任务分析(CTA)框架,生成团队与个体的综合评分。 Result: 在真实ECR演练案例中验证了该方法的有效性,提供了可操作的、领域特定的个体与团队性能度量,并支持通过Gamemaster与GIFT系统进行交互式复盘反馈。 Conclusion: 该方法无需额外硬件即可实现对合成训练环境中的复杂技能进行客观、可扩展的自动化评估,具备应用于军事训练复盘系统的潜力,但仍受限于追踪精度、真值验证与泛化能力,未来将拓展至3D视频分析与更广泛的STE应用。 Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

[69] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang,Shengqu Cai,Muyang Li,Chong Zeng,Beijia Lu,Anyi Rao,Song Han,Gordon Wetzstein,Maneesh Agrawala

Main category: cs.CV

TL;DR: 提出PFP神经网络结构,用于将长视频压缩为短上下文,并通过预训练目标保留任意时间位置单帧的高频细节。

Details Motivation: 需要有效压缩长视频并保留关键视觉细节,以支持后续视频生成和记忆建模任务。 Method: 设计PFP神经网络,采用显式预训练目标来保持单帧的高频细节,并将模型作为记忆编码器用于自回归视频模型。 Result: 基线模型可将20秒视频压缩为约5k长度的上下文,支持随机帧的感知质量良好的重建,并可用于低开销、低保真损失的长时记忆建模。 Conclusion: PFP在视频压缩与重建之间取得了良好平衡,适合用于需要长期历史记忆的视频生成任务。 Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[70] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng,Hongfei Xue,Pu Wang,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了终身域自适应3D人体姿态估计新任务,首次将终身域适应引入3D HPE领域,通过结合生成对抗网络框架和新颖的3D姿态生成器范式,有效缓解域偏移和灾难性遗忘问题,在多个数据集上表现出优越性能。

Details Motivation: 现有域适应方法忽视了目标姿态数据集的非平稳性问题,且难以在不访问源域和先前目标域的情况下持续适应新域,导致知识遗忘和泛化能力差。 Method: 提出一种新的终身域自适应3D HPE框架,采用基于GAN的方法,包含3D姿态生成器、2D姿态判别器和3D姿态估计器;设计融合姿态感知、时序感知和域感知知识的3D姿态生成器以增强适应性和减少遗忘。 Result: 在多种域自适应3D HPE数据集上进行了广泛实验,结果表明所提方法在当前域适应和保持过往知识方面均优于现有方法。 Conclusion: 该方法成功解决了3D HPE中连续域变化下的知识迁移与保留难题,为实际应用中的非受限环境姿态估计提供了有效解决方案。 Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[71] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

Krithika Iyer,Austin Tapp,Athelia Paulli,Gabrielle Dickerson,Syed Muhammad Anwar,Natasha Lepore,Marius George Linguraru

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的框架,将儿童T1加权MRI转化为合成CT(sCT),实现颅骨和颅缝的精确分割与可视化,克服了MRI在骨骼成像上的局限性。

Details Motivation: 由于CT的电离辐射不适合儿童,而MRI无法清晰显示颅骨和颅缝,因此需要一种无辐射且能准确评估儿童颅骨发育的方法。 Method: 采用领域特定的变分自编码器构建深度学习 pipeline,利用0.2至2岁儿童的T1加权MRI生成合成CT,预测颅骨分割、生成颅缝概率热图并进行直接颅缝分割。 Result: 合成CT与真实CT的结构相似性达99%,Fréchet起始距离为1.01;七块颅骨的分割Dice系数平均为85%,颅缝分割Dice系数达80%;TOST检验表明sCT与真实CT在颅骨和颅缝分割上具有等效性。 Conclusion: 这是首个能够从儿童MRI生成可用于颅缝分割的合成CT的框架,填补了无创儿科颅骨评估的关键空白,有望用于临床无辐射颅面发育监测。 Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

[72] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema,Eliza Mace,Hunter Brown,Heidys Cabrera,Nick Krall,Matthew O'Neill,Shivangi Sarkar,Lowell Weissman,Eric Hughes,Guido Zarrella

Main category: cs.CV

TL;DR: 本研究利用超过一千万亿像素的商业卫星光电数据,探索了在遥感领域训练大规模视觉变换器模型的缩放行为,发现性能受限于数据而非模型参数,为未来遥感基础模型的发展提供了关于数据收集、计算资源和优化策略的实际指导。

Details Motivation: 由于遥感等高价值领域的缩放规律尚不明确,缺乏指导大规模模型训练的原则,本文旨在通过大规模实验建立适用于高分辨率遥感数据的基础模型训练技术。 Method: 使用大规模商业卫星EO数据和MITRE联邦AI沙盒,训练不同规模的视觉变换器(ViT)骨干网络,分析其在多模态机器学习任务中的表现及失败模式。 Result: 实验表明,在当前规模下,模型性能仍处于数据受限状态,而非参数受限;同时识别出在跨遥感模态时存在的域差距问题。 Conclusion: 研究结果强调了在遥感领域扩大数据规模的重要性,并为未来构建前沿规模的基础模型提供了关于数据采集策略、计算预算和训练调度的实用见解。 Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[73] Learning to learn skill assessment for fetal ultrasound scanning

Yipei Wang,Qianye Yang,Lior Drukker,Aris T. Papageorghiou,Yipeng Hu,J. Alison Noble

Main category: cs.CV

TL;DR: 提出一种基于双层优化框架的胎儿超声技能评估方法,通过任务执行效果自动量化技能水平,无需手动预定义评分。

Details Motivation: 传统超声技能评估依赖专家主观判断,耗时且不一致;现有自动化方法多依赖监督学习和预设特征,限制了客观性和泛化能力。 Method: 设计一个包含临床任务预测器和技能预测器的双层优化框架,联合优化两个网络,以图像任务完成质量作为技能评估指标。 Result: 在真实临床胎儿头部超声视频数据上验证了该方法的可行性,能够有效预测操作者技能水平。 Conclusion: 该框架可实现客观、自动化的超声技能评估,为医学影像技能培训提供新思路。 Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

[74] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

Yulong Zou,Bo Liu,Cun-Jing Zheng,Yuan-ming Geng,Siyue Li,Qiankun Zuo,Shuihua Wang,Yudong Zhang,Jin Hong

Main category: cs.CV

TL;DR: 提出了一种元引导多模态学习框架(MGML),用于在模态缺失情况下提升脑肿瘤MRI分割性能,包含自适应模态融合与一致性正则化模块,无需修改模型结构,可端到端训练,在BraTS2020和BraTS2023上表现优于现有方法。

Details Motivation: 临床中多模态MRI数据常不完整,如何充分利用不完整的多模态信息进行病灶分割是一个关键挑战。 Method: 提出MGML框架,包含两个模块:1)元参数化自适应模态融合(Meta-AMF),根据可用模态生成软标签监督信号以实现动态融合;2)一致性正则化模块,增强模型鲁棒性与泛化能力。该方法不改变原有模型结构,易于集成到训练流程中。 Result: 在BraTS2020数据集上,针对15种模态缺失组合的平均Dice分数,WT、TC和ET分别为87.55、79.36和62.67,优于多个现有先进方法。实验也在BraTS2023上验证了有效性。 Conclusion: MGML能有效利用不完整多模态MRI数据,通过自适应融合与一致性学习显著提升脑肿瘤分割性能,具有良好的通用性和实用性。 Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[75] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

Hualin Ye,Bingxi Liu,Jixiang Du,Yu Qin,Ziyi Chen,Hong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于跨视角地理定位(CVGL)的新方法,通过改进特征提取与聚合机制,在减少参数量的同时实现了竞争性性能。

Details Motivation: 由于不同视角间存在显著差异,传统方法在特征聚合与对齐上面临挑战,因此需要更鲁棒的模型来应对跨视角变化。 Method: 采用DINOv2主干网络结合卷积适配器进行微调,引入多尺度通道重分配模块以增强空间表示的多样性与稳定性,并设计了一种融合Mixture-of-Experts(MoE)路由机制的改进聚合模块,实现对异构输入域的自适应处理。 Result: 在University-1652和SUES-200数据集上的实验表明,所提方法在较少训练参数下仍能达到领先或具有竞争力的性能。 Conclusion: 该方法有效缓解了跨视角地理定位中的视角差异问题,提升了特征匹配的准确性与模型效率,具备良好的应用潜力。 Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

[76] Kinematic-Based Assessment of Surgical Actions in Microanastomosis

Yan Meng,Daniel Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 提出一种基于AI的微血管吻合手术动作分割与技能评估框架,可在边缘计算平台高效运行,实现客观、实时的显微外科培训反馈。

Details Motivation: 传统显微外科技能评估依赖专家主观评分,存在评分者间差异大、耗时长等问题,亟需客观、自动化的评估方法。 Method: 该框架包含三个模块:基于YOLO和DeepSORT的器械尖端追踪定位、基于自相似矩阵的动作边界检测与无监督聚类动作分割、以及用于评估手术动作熟练度的有监督分类模块。 Result: 在58段专家评分的微血管吻合视频数据集上验证,动作分割帧级准确率达92.4%,技能分类准确率达85.5%。 Conclusion: 该方法可有效复制专家评估结果,具备推动显微外科教育标准化和数据驱动培训的潜力。 Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

[77] U-Net-Like Spiking Neural Networks for Single Image Dehazing

Huibin Li,Haoran Liu,Mingzhe Liu,Yulong Xiao,Peng Li,Guibin Zan

Main category: cs.CV

TL;DR: 本文提出了一种结合U-Net结构与脉冲神经网络(SNN)的新型去雾架构DehazeSNN,通过引入OLIFBlock模块提升跨通道通信,在减少计算开销的同时实现了优异的去雾性能。

Details Motivation: 传统去雾方法依赖大气散射模型,而现有深度学习方法在处理长距离依赖或计算效率上存在不足,因此需要一种兼顾性能与效率的新架构。 Method: 提出DehazeSNN,采用U-Net-like结构与Spiking Neural Networks相结合,并设计Orthogonal Leaky-Integrate-and-Fire Block(OLIFBlock)以增强多尺度特征提取和跨通道信息交互能力。 Result: 实验表明,DehazeSNN在多个基准数据集上性能媲美最先进方法,同时模型更小、MACs更低,具备高效推理优势。 Conclusion: DehazeSNN为图像去雾提供了一个高效且高性能的解决方案,推动了低功耗、轻量级去雾模型的发展。 Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.

[78] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li,Yuecong Min,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了T2VAttack,首次从语义和时序角度系统研究了文本到视频扩散模型的对抗攻击,揭示了现有模型在微小提示词修改下的脆弱性。

Details Motivation: 尽管文本到视频生成模型取得了显著进展,但其对对抗攻击的鲁棒性尚未被充分探索,本文旨在揭示其在语义对齐与时序动态方面的潜在漏洞。 Method: 提出两种攻击目标(语义与时间)和两种攻击方法:T2VAttack-S通过贪心搜索替换关键提示词的同义词,T2VAttack-I通过迭代插入优化词进行微小扰动。 Result: 实验表明,仅替换或插入一个单词即可显著降低主流T2V模型(如CogVideoX、Open-Sora等)生成视频的语义保真度和时序连贯性。 Conclusion: 当前文本到视频扩散模型在面对轻微提示词篡改时存在严重脆弱性,需进一步提升其对抗鲁棒性。 Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[79] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Yuang Jia,Jinlong Wang,Jiayi Zhao,Chunlam Li,Shunzhou Wang,Wei Gao

Main category: cs.CV

TL;DR: 本文提出了一种无需昂贵传感器或标注数据的自动驾驶场景视图外推方法,仅使用图像和可选相机姿态,通过可变形4D高斯框架与扩散模型迭代优化,实现高质量新视角合成。

Details Motivation: 现有方法依赖LiDAR、3D框等昂贵先验,限制了实际部署,本文旨在减少对这些先验的依赖。 Method: 首先估计全局静态点云和每帧动态点云并融合;采用可变形4D高斯框架重建场景;用初始渲染结果训练视频扩散模型,并迭代地用扩散模型 refine 高斯渲染,同时将增强结果反馈训练4DGS。 Result: 相比基线方法,在外推的新视角下生成了质量更高的图像。 Conclusion: 该方法在较少输入条件下实现了有效的视图外推,具有更强的实际应用潜力。 Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

[80] Anomaly detection in satellite imagery through temporal inpainting

Bertrand Rouet-Leduc,Claudia Hulbert

Main category: cs.CV

TL;DR: 提出一种基于深度学习的卫星影像时间序列异常检测方法,通过预测地表应有状态并识别偏差来检测表面变化,具有比传统方法更高的灵敏度和特异性。

Details Motivation: 由于大气噪声、季节变化和传感器伪影的复杂相互作用,从卫星影像中检测地表变化仍然具有挑战性,尤其在需要高灵敏度的灾害响应和环境监测场景中。 Method: 基于SATLAS基础模型构建一个图像修复(inpainting)模型,利用Sentinel-2时间序列的前期影像重建最新一帧,并使用全球分布的多气候区和土地覆盖类型数据进行训练。通过比较预测与实际观测之间的差异来识别异常。 Result: 在2023年土耳其-叙利亚地震引发的地表破裂区域(如Tepehan)验证该方法,发现其检测灵敏度约为传统方法(如时间中值或Reed-Xiaoli检测器)的三倍,能识别传统方法遗漏的细微变化特征。 Conclusion: 该方法利用时间冗余提升变化检测性能,为基于免费多光谱卫星数据实现全球尺度自动化地表变化监测提供了可行路径。 Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

[81] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

Jun Ding,Shang Gao

Main category: cs.CV

TL;DR: 提出了一种名为GCA-ResUNet的高效医学图像分割框架,其核心是轻量级的分组坐标注意力(GCA)模块,能够在保持CNN效率的同时增强全局上下文建模能力,在多器官和低对比度区域中显著提升分割精度。

Details Motivation: 现有U-Net类方法因局部感受野和同质化注意力机制难以建模长距离依赖,而Transformer虽能捕捉全局信息但计算开销大,限制了在资源受限临床环境中的应用。因此需要一种兼顾精度与效率的分割方法。 Method: 设计了一个轻量且即插即用的分组坐标注意力(GCA)模块,将通道上下文分组建模以应对语义异质性,并引入方向感知的坐标编码来捕获水平和垂直方向的空间依赖关系,集成到ResUNet架构中形成GCA-ResUNet。 Result: 在Synapse和ACDC两个基准上分别取得了86.11%和92.64%的Dice分数,优于包括Swin-UNet和TransUNet在内的多种代表性CNN和Transformer方法,尤其在小器官和复杂边界结构的分割中表现更优。 Conclusion: GCA-ResUNet在分割准确性与计算效率之间实现了良好平衡,具有较强的临床部署可行性和可扩展性,为医学图像分割提供了一种实用的新方案。 Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

[82] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li,Zhenyu Qi,Hao Qin,Huanrui Yang,Sen He,Kebin Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为GASeg的新框架,通过引入拓扑信息来桥接外观与几何特征,以解决自监督语义分割中因外观模糊导致的性能下降问题。

Details Motivation: 现有的自监督语义分割方法在面对阴影、眩光和局部纹理等外观模糊时表现不佳,主要因为过度依赖不稳定的外观特征。 Method: 提出Differentiable Box-Counting(DBC)模块来提取几何和外观双流的多尺度拓扑统计,并设计Topological Augmentation(TopoAug)策略模拟真实世界中的模糊情况,同时使用GALoss实现跨模态特征对齐。 Result: 在COCO-Stuff、Cityscapes和PASCAL等多个基准上实现了最先进的性能。 Conclusion: 通过结合几何与外观特征并利用稳定的拓扑结构,GASeg有效提升了自监督语义分割在复杂场景下的鲁棒性和准确性。 Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[83] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

Tae Ha Park,Simone D'Amico

Main category: cs.CV

TL;DR: 提出了一种利用太阳位置先验信息改进3D高斯点阵列(3DGS)模型训练的新方法,以在空间交会与近场操作中从动态光照条件下的图像序列恢复未知目标航天器的3D结构,并提升渲染的光度精度用于相机位姿估计。

Details Motivation: 传统3DGS模型假设场景静态,难以应对空间图像中快速变化的光照条件;同时需要高光度精度的3D模型支持后续位姿估计任务。 Method: 将伴飞航天器估计并维持的太阳位置先验信息融入3DGS训练流程,通过引入光照先验来增强模型对阴影、自遮挡和动态照明的建模能力。 Result: 实验表明该方法使3DGS模型能适应空间中快速变化的光照条件,显著提升了渲染图像的光度质量,并有效反映了全局阴影和自遮挡现象。 Conclusion: 融合太阳位置先验可有效提升3DGS在动态光照下的重建质量,所获模型兼具几何与光度准确性,适用于RPO中的3D重建与相机位姿估计任务。 Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target's geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun's position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

[84] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu,Hui Li,Yiyun Su

Main category: cs.CV

TL;DR: 本文提出了一种名为Hilbert-VLM的新型两阶段融合框架,用于提升视觉语言模型在3D医学图像分析中的性能,通过引入Hilbert空间填充曲线和改进SAM2架构,实现了更精确的病灶分割与疾病分类。

Details Motivation: 现有的视觉语言模型在处理复杂的3D多模态医学图像时,难以有效整合信息且容易忽略关键病理特征,因此需要一种能更好保留空间局部性和捕捉细微病变的方法。 Method: 提出Hilbert-VLM框架,包含HilbertMed-SAM模块用于病灶分割,结合Hilbert空间填充曲线优化Mamba状态空间模型的扫描机制,并设计Hilbert-Mamba交叉注意力(HMCA)机制和尺度感知解码器;生成的多模态增强提示用于指导视觉语言模型进行疾病分类。 Result: 在BraTS2021分割基准上达到82.35%的Dice分数和78.85%的诊断分类准确率(ACC),显著提升了3D医学图像分析的准确性。 Conclusion: Hilbert-VLM通过结构创新有效增强了对3D医学图像的空间建模与细粒度特征提取能力,为基于视觉语言模型的医学诊断提供了更可靠的技术路径。 Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[85] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li,Yue Song,Jianing Peng,Ting Liu,Jun Huang,Xiaochao Qu,Luoqi Liu,Wei Wang,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 提出了一种基于条件速度校正(CVC)的流式扩散编辑新框架,通过双视角速度转换机制和后验一致更新,实现稳定、保真且语义一致的图像编辑。

Details Motivation: 现有基于流的扩散编辑方法在潜变量轨迹中存在累积的速度误差,导致语义不一致和结构失真,需一种能保持结构保真与语义控制的新机制。 Method: 将流式编辑重构为基于已知源先验的分布变换问题,引入双分支速度机制:一个保持结构的分支维持源轨迹一致性,另一个语义引导分支推动向目标分布可控偏离;结合基于经验贝叶斯推断和Tweedie校正的后验一致速度更新,补偿速度误差。 Result: CVC显著降低了潜空间中的轨迹漂移与速度误差,实现了更稳定的动态演化,在多种编辑任务中表现出更高的图像保真度、更好的语义对齐和更可靠的编辑行为。 Conclusion: CVC为流式扩散模型中的分布间转换提供了理论严谨且实用的解决方案,通过显式速度校正机制实现了结构保持与语义编辑的平衡,推动了无需显式逆过程的高质量图像编辑的发展。 Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

[86] FitControler: Toward Fit-Aware Virtual Try-On

Lu Yang,Yicheng Liu,Yanan Li,Xiang Bai,Hao Lu

Main category: cs.CV

TL;DR: 本文提出了FitControler,一种可集成到现代虚拟试衣(VTON)模型中的学习型插件,首次实现对服装“合身性”的可控调节,并构建了包含13,000对数据的Fit4Men数据集及提出两个评估指标验证生成效果。

Details Motivation: 现有虚拟试衣技术多关注服装细节的逼真渲染,但忽略了影响整体风格的关键因素——服装合身性(fit),导致生成结果在风格协调上不足。 Method: 提出FitControler,包含一个合身感知布局生成器,基于服装无关的表示生成不同合身效果的体装布局;并通过多尺度合身注入模块将布局信息融入现有VTON模型,实现布局驱动的合身控制。 Result: 构建了首个专注于合身性的数据集Fit4Men,包含13,000个体装对,覆盖上下装、不同姿态和摄像距离;提出了两个合身一致性评估指标;实验表明FitControler可兼容多种VTON模型并实现精确的合身控制。 Conclusion: 通过引入合身性建模,显著提升了虚拟试衣在整体视觉风格上的真实感与可控性,为未来VTON系统提供了新的设计维度。 Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style -- garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

[87] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

Huanxiong Liang,Yunuo Chen,Yicheng Pan,Sixian Wang,Jincheng Dai,Guo Lu,Wenjun Zhang

Main category: cs.CV

TL;DR: 提出一种结构引导的2D高斯点阵分配方法,通过结构引导初始化、自适应位宽量化和几何一致性正则化,显著提升2DGS在低比特率下的率失真性能,同时保持毫秒级解码速度。

Details Motivation: 现有2DGS方法在分配表示容量和参数精度时忽略图像结构,导致低比特率下率失真效率低。 Method: 1) 结构引导初始化:根据自然图像的空间结构先验分配2D高斯;2) 自适应位宽量化:在复杂区域对小尺度高斯赋予更高协方差精度;3) 几何一致性正则化:对齐高斯方向与局部梯度方向。 Result: 在Kodak上BD-rate降低43.44%,DIV2K上降低29.91%,保持超过1000 FPS解码速度。 Conclusion: 所提结构引导分配策略有效提升2DGS的表示能力和率失真性能,同时维持其高速解码优势。 Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

[88] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang,Donghao Wang,Jiacheng Yang,Yifan Jiang,Meiyi Zhu,Yuekun Yang,Cong Wang,Qi Fan,Wenbin Li,Yang Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的多特征融合遥感视觉-语言模型MF-RSVLM,通过多尺度视觉表示和循环视觉特征注入机制,有效提升了遥感图像理解中的细粒度特征提取与视觉信息保持能力,在多种遥感任务上达到先进水平。

Details Motivation: 现有的视觉-语言模型在处理遥感图像时面临细粒度特征提取困难和语言深层处理过程中的视觉遗忘问题,且遥感图像与自然图像存在本质差异,导致现有方法性能受限。 Method: 提出MF-RSVLM模型,采用多尺度视觉特征提取与融合策略,结合全局上下文与局部细节;设计循环视觉特征注入机制,在语言生成过程中持续引入视觉信息,缓解视觉遗忘。 Result: 在多个遥感分类、图像描述生成和视觉问答基准上进行了广泛实验,MF-RSVLM在多数任务上取得最优或具有竞争力的性能。 Conclusion: MF-RSVLM能更有效地捕捉遥感场景中的复杂结构和小目标,显著提升视觉-语言模型在遥感领域的理解和生成能力。 Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[89] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He,Yujie Zhang,Shuyong Gao,Wenjie Li,Lingyi Hong,Mingxi Chen,Kaixun Jiang,Jiyuan Fu,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出RSAgent,一种基于多模态大语言模型的智能体框架,通过多轮工具调用实现推理与动作交替的文本引导对象分割,显著提升了零样本和跨域场景下的分割性能。

Details Motivation: 现有方法将文本引导分割视为单次推理任务,缺乏对初始定位错误的修正能力,限制了分割的准确性和鲁棒性。 Method: 提出RSAgent,利用多轮工具调用机制,在推理过程中结合视觉反馈迭代更新空间假设;构建多轮分割轨迹的数据管道,并采用两阶段训练:冷启动监督微调 + 基于细粒度任务特定奖励的代理式强化学习。 Result: 在ReasonSeg测试集上达到66.5% gIoU的零样本性能,比Seg-Zero-7B提升9%;在RefCOCOg上达到81.5% cIoU,均取得当前最优结果。 Conclusion: RSAgent通过引入代理式多轮推理机制,有效增强了文本引导分割中的定位修正与掩码优化能力,在多种基准上实现了领先性能。 Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[90] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Mustafa Munir,Md Mostafijur Rahman,Kartikeya Bhardwaj,Paul Whatmough,Radu Marculescu

Main category: cs.CV

TL;DR: PipeFlow是一种可扩展的长视频编辑方法,通过跳过低运动帧、分段并行处理和神经插值技术,实现编辑时间随视频长度线性增长,显著提升效率。

Details Motivation: 长视频编辑因计算成本随序列延长呈指数增长而面临挑战,尤其是联合编辑和DDIM反演的高开销。 Method: 提出PipeFlow,包含三项创新:基于SSIM和光流分析跳过低运动帧;采用分段并行的流水线调度算法;使用神经网络插值平滑边界帧和补全 skipped 帧。 Result: PipeFlow在编辑长视频时速度最多比TokenFlow快9.6倍,比DMT快31.7倍,且编辑时间随视频长度线性增长。 Conclusion: PipeFlow通过分段流水线策略有效解决了长视频编辑中的计算瓶颈,可扩展至无限长度视频,避免了传统方法中每帧计算开销递增的问题。 Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

[91] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Xinran Qin,Yuhui Quan,Ruotao Xu,Hui Ji

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的可训练各向异性扩散框架,用于图像去噪,能够自适应复杂图像结构,并在多种噪声类型上去噪效果优于传统扩散方法,与深度CNN方法相当。

Details Motivation: 传统各向异性扩散方法使用显式扩散算子,难以适应复杂图像结构,性能受限,而现有基于学习的方法表现更优,因此需要一种更具适应性的扩散去噪框架。 Method: 将去噪过程建模为一系列由深度Q学习优化的扩散动作序列,通过强化学习自动学习动作顺序,构建具有强适应性的随机各向异性扩散过程。 Result: 所提方法在三种常见噪声上去噪效果优于现有的扩散基方法,并与代表性深度CNN方法性能相当。 Conclusion: 基于强化学习的各向异性扩散框架有效提升了传统方法的适应性和去噪性能,为扩散模型与深度强化学习结合提供了新思路。 Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

[92] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

Yizhi Liu,Ruitao Pu,Shilin Xu,Yingke Chen,Quan-Hui Liu,Yuan Sun

Main category: cs.CV

TL;DR: 提出了一种新的鲁棒跨模态学习框架NIRNL,通过跨模态边界保持和邻居感知实例精炼来提升含噪标签下的检索性能。

Details Motivation: 现有方法难以同时满足模型性能、标签校准可靠性和数据利用率,在高噪声环境下表现受限。 Method: 提出Cross-modal Margin Preserving (CMP) 以增强样本对的判别性,并设计Neighbor-aware Instance Refining (NIR) 通过跨模态邻域共识识别纯样本、难样本和噪声样本,进而采用定制化优化策略。 Result: 在三个基准数据集上实验表明,NIRNL在不同噪声率下均达到SOTA性能,尤其在高噪声率下表现出更强的鲁棒性。 Conclusion: NIRNL有效提升了含噪标签下的跨模态检索鲁棒性,实现了更高性能、更可靠的标签校准与更高的数据利用率之间的平衡。 Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[93] Pathology Context Recalibration Network for Ocular Disease Recognition

Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于自动化眼病识别的PCRNet模型,结合病理上下文和专家经验先验,通过新设计的病理重校准模块(PRM)和专家先验引导适配器(EPGA),并引入集成损失(IL)提升性能,在三个眼病数据集上优于现有方法。

Details Motivation: 深度神经网络在眼病识别中忽略临床病理上下文和专家经验先验,导致性能和决策可解释性受限,因此需要引入这些先验知识以提升模型表现。 Method: 提出病理重校准模块(PRM)利用像素级上下文压缩和病理分布集中操作捕获病理上下文;设计专家先验引导适配器(EPGA)突出关键区域;构建PCRNet模型,并引入集成损失(IL)优化训练过程。 Result: 在三个眼病数据集上,PCRNet结合IL显著优于现有的注意力机制和先进损失方法,可视化分析显示PRM和EPGA有效影响了模型决策过程。 Conclusion: 融合病理上下文与专家经验先验可有效提升眼病识别性能与模型可解释性,PCRNet为临床辅助诊断提供了更优解决方案。 Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

[94] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen,Dexin Chen,Fengchao Xiong,Yuntao Qian,Liang Xiao

Main category: cs.CV

TL;DR: 提出一种平衡的分层对比损失和解耦学习策略,以改善细粒度遥感图像检测中语义层次结构的嵌入效果。

Details Motivation: 现有方法在嵌入标签层级结构时面临数据分布不均衡和分类与定位任务干扰的问题,影响细粒度检测性能。 Method: 设计了一种平衡的分层对比损失,引入可学习的类原型并均衡各类别的梯度贡献;同时在DETR框架中采用解耦策略,将对象查询分为分类和定位两组,实现任务特异性优化。 Result: 在三个具有层级标注的细粒度数据集上实验表明,该方法优于现有的最先进方法。 Conclusion: 所提方法有效缓解了层级对比学习中的类别不平衡问题,并通过解耦学习提升了分类与定位的协同性能,显著提高了细粒度遥感目标检测的精度。 Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[95] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Aiyue Chen,Yaofu Liu,Junjian Huang,Guang Lian,Yiwu Yao,Wangli Lan,Jing Lin,Zhixin Ma,Tingting Zhou,Harry Yang

Main category: cs.CV

TL;DR: RainFusion2.0提出了一种高效、低开销、硬件通用的稀疏注意力机制,用于加速视频和图像生成模型,在保持质量的同时实现1.5~1.8倍端到端加速。

Details Motivation: DiT模型因注意力机制计算成本高而受限,且现有稀疏注意力方法存在预测开销大和硬件通用性差的问题。 Method: 采用块均值代表令牌进行稀疏掩码预测,实现时空感知的令牌重排,并引入首帧锚机制以适应视频生成。 Result: 实现了80%的稀疏度,端到端速度提升1.5~1.8倍,且不损失视频质量,支持多种生成模型和硬件平台。 Conclusion: RainFusion2.0是一种在线自适应、硬件高效的稀疏注意力方案,具有良好的跨硬件泛化能力和应用前景。 Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

[96] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng

Main category: cs.CV

TL;DR: 提出D$^2$VLM框架和因子化偏好优化(FPO)算法,解耦视频理解中的时间定位与文本回答任务,提升事件级感知性能。

Details Motivation: 现有视频语言模型在时间定位和文本响应两个任务上常被耦合处理,缺乏清晰逻辑结构,导致次优结果。需要一种能体现二者依赖关系且分步优化的学习框架。 Method: 提出D$^2$VLM框架,采用“先定位证据再回答”的范式,引入证据标记以增强事件级语义捕捉;设计因子化偏好优化(FPO)算法,将概率性时间定位建模融入优化目标,分别优化时间定位与文本生成;构建合成数据集支持训练。 Result: 在多个视频理解任务上实验表明,该方法在时间定位准确性和问答性能上均优于现有方法,验证了因子化学习的有效性。 Conclusion: 通过解耦但保留依赖的因子化学习方式,D$^2$VLM和FPO能更有效地实现事件级视频理解,为视频语言模型提供了新的训练范式。 Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[97] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian,Juncheng Wang,Yuxiang Feng,Chao Xu,Wang Lu,Yang Liu,Baigui Sun,Yiqiang Chen,Yong Liu,Shujun Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到动作生成框架Latent Motion Reasoning (LMR),通过引入双阶段的“思考-行动”机制,解决语言语义与运动学数据之间的语义-运动阻抗不匹配问题。

Details Motivation: 现有的文本到动作生成方法在直接映射语言到高频率动作时存在语义与运动信息不匹配的问题,难以有效处理复杂的语义意图。 Method: 受认知科学中分层运动控制启发,提出Latent System 2 Reasoning架构,采用Dual-Granularity Tokenizer将动作分解为用于全局规划的语义丰富的Reasoning Latent和保留物理细节的Execution Latent,并以自回归方式先推理后生成。 Result: 在T2M-GPT和MotionStreamer两个基线上实现显著提升,实验表明LMR在语义对齐和物理合理性方面均优于现有方法。 Conclusion: 动作生成的最佳规划空间不是自然语言本身,而是学习得到的、与动作对齐的概念空间,LMR为文本到动作生成提供了更优的架构范式。 Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

[98] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen,Yanbo Wang,Wentao Zhao,Guole Shen,Tianchen Deng,Jingchuan Wang

Main category: cs.CV

TL;DR: 提出一种无需训练的生成式对抗攻击框架,通过扩散模型生成自然且与场景一致的对抗性物体,以更有效地干扰自动驾驶系统中的单目深度估计。

Details Motivation: 现有的基于纹理补丁的物理攻击在复杂驾驶环境中存在放置约束严格、真实感不足的问题,限制了其攻击效果,因此需要更自然、更具实用性的攻击方法来评估自动驾驶系统的安全性。 Method: 提出一个无需训练的对抗攻击框架,结合显著区域选择模块和雅可比向量积引导机制,利用基于扩散的条件生成过程合成物理上可信的对抗性物体,从而扰动单目深度估计结果。 Result: 在数字和物理实验中,该方法在攻击有效性、隐蔽性和物理可部署性方面均显著优于现有方法。 Conclusion: 所提方法能够生成高度逼真的对抗性物体,有效误导单目深度估计模型,对自动驾驶系统的安全评估具有重要意义。 Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

[99] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng,Yue Yang,Xiaohan He,Jiatong Zhao,Jianlong Chen,Zijun Chen,Daocheng Fu,Qi Liu,Renqiu Xia,Bo Zhang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了GeoBench,一个用于评估视觉-语言模型在几何问题解决中的分层基准,涵盖四个推理层级,并通过六个经过形式化验证的任务系统评估模型能力,揭示了子目标分解和无关前提过滤对准确性的关键影响。

Details Motivation: 现有几何推理评估存在数据污染、过度关注答案而忽视推理过程、诊断粒度不足等问题,亟需更可靠的评估框架。 Method: 提出GeoBench,包含四个推理层次:视觉感知、目标导向规划、严格定理应用和自反式回溯;使用TrustGeoGen生成六个形式化验证任务进行系统评估。 Result: 实验显示,尽管如OpenAI-o3等推理模型优于通用多模态大模型,但随着任务复杂度增加性能显著下降;子目标分解和无关前提过滤对准确性至关重要;链式思维提示在某些任务中反而降低性能。 Conclusion: GeoBench是一个全面且可操作的几何推理评估基准,为构建几何问题求解系统提供了明确改进方向。 Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[100] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 本文提出了用于计算机视觉中基于大语言模型的神经网络架构生成的两种关键技术:少样本架构提示(FSAP)和空白归一化哈希验证,并通过大规模实验验证了其有效性与效率。

Details Motivation: 自动化神经网络架构设计在计算机视觉中仍具挑战性,现有方法计算成本高,且缺乏对提示工程和验证策略的系统研究。大语言模型(LLM)虽有潜力,但其在视觉架构生成中的应用尚未被系统探索。 Method: 基于NNGPT/LEMUR框架,提出少样本架构提示(FSAP),系统研究不同示例数量(n=1~6)对生成效果的影响;并引入空白归一化哈希验证方法,实现快速去重。在七个视觉基准上生成1900个独特架构,采用数据集平衡评估方法进行评估。 Result: 发现使用n=3个示例在多样性与上下文聚焦间达到最佳平衡;哈希验证方法耗时小于1毫秒,比AST解析快100倍,显著减少重复训练;大规模实验证明所提方法高效且有效。 Conclusion: 本文为基于大语言模型的计算机视觉架构搜索提供了实用指南和严谨评估范式,降低了对计算资源的要求,使更多研究者能参与自动化模型设计。 Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

[101] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen,Sujie Hu,Jiashu Zhu,Meiqi Wu,Jintao Chen,Yanxun Li,Nisha Huang,Chengyu Fang,Jiahong Wu,Xiangxiang Chu,Xiu Li

Main category: cs.CV

TL;DR: 本文提出了一种新的对齐方法D²-Align,以缓解文本到图像扩散模型中的偏好模式崩溃(PMC)问题,通过在奖励信号中引入方向性解耦来保持生成多样性。

Details Motivation: 现有基于人类反馈的强化学习方法虽然在自动奖励指标上表现良好,但容易导致模型生成结果缺乏多样性,出现偏好模式崩溃(PMC)现象。 Method: 提出Directional Decoupling Alignment (D²-Align),在冻结奖励模型的情况下,学习其嵌入空间中的方向性修正,并在优化过程中校正奖励信号,从而避免模型坍缩至特定模式。 Result: 通过定性和定量评估,D²-Align在图像质量和多样性方面均优于现有方法,在DivGenBench基准上有效缓解了PMC问题。 Conclusion: D²-Align能够有效抑制奖励过优化带来的模式崩溃,提升生成多样性和与人类偏好的对齐效果。 Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[102] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni,ZhenQi Chen,YuanFu Yang

Main category: cs.CV

TL;DR: IMDD-1M是一个包含100万对图文数据的大规模工业缺陷数据集,支持多种任务,并基于此训练了一个可高效微调的扩散基础模型,实现数据高效的工业检测。

Details Motivation: 推动制造业中多模态学习的发展,解决现有数据集规模小、标注粗略、缺乏上下文信息的问题。 Method: 构建大规模对齐的图像-文本缺陷数据集IMDD-1M,并从零开始训练一个基于扩散机制的视觉-语言基础模型,支持轻量级微调以适应特定任务。 Result: 模型在少于5%的任务特定数据下达到与专用专家模型相当的性能,验证了数据高效适应的可行性。 Conclusion: IMDD-1M为工业智能提供了可扩展、领域自适应且知识扎实的基础,展示了基础模型在工业质检与生成中的巨大潜力。 Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[103] Bayesian Self-Distillation for Image Classification

Anton Adelöw,Matteo Gamba,Atsuto Maki

Main category: cs.CV

TL;DR: 提出贝叶斯自蒸馏(BSD)方法,通过贝叶斯推断构建样本特定的目标分布,无需依赖硬目标,提升模型准确性、校准性和鲁棒性。

Details Motivation: 监督训练深度神经网络常依赖硬目标,导致模型过度自信,影响校准性、泛化能力和鲁棒性;现有自蒸馏方法仍依赖硬目标,效果受限。 Method: 提出贝叶斯自蒸馏(BSD),利用模型自身预测通过贝叶斯推断构建样本特定的目标分布,初始化后不再依赖硬目标。 Result: BSD在多种架构和数据集上显著提升测试准确率(如ResNet-50在CIFAR-100上+1.4%),降低预期校准误差(ECE减少40%),并增强对数据损坏、扰动和标签噪声的鲁棒性;结合对比损失,在标签噪声下达到单阶段单网络方法的最先进水平。 Conclusion: BSD是一种原理清晰且有效的自蒸馏方法,摆脱了对硬目标的依赖,全面提升了模型的性能与可靠性。 Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

[104] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He,Xiaoye Qu,Yafu Li,Tong Zhu,Siyuan Huang,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的生成式多模态推理范式DiffThinker,将多模态推理重构为图像到图像的生成任务,在视觉中心型复杂任务中显著优于现有MLLMs,展现出更高的逻辑一致性和空间精度。

Details Motivation: 现有的多模态大模型(MLLMs)主要依赖文本为中心的推理方式,在复杂、长视野、以视觉为中心的任务中表现不佳,因此需要一种更契合视觉推理本质的新范式。 Method: 提出DiffThinker,基于扩散模型的生成式图像到图像推理框架,将多模态推理直接在视觉空间中进行,并系统分析其效率、可控性、原生并行性和协作性四大特性。 Result: 在序列规划、组合优化、约束满足和空间配置四个领域实验表明,DiffThinker显著优于GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)及微调后的Qwen3-VL-32B基线(+39.0%)。 Conclusion: 生成式多模态推理是一种有前景的视觉中心型推理新路径,DiffThinker为其提供了有效实现框架。 Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[105] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Yu-Tang Chang,Pin-Wei Chen,Shih-Fang Chen

Main category: cs.CV

TL;DR: 提出了一种名为Deep Global Clustering(DGC)的框架,用于内存高效的高光谱图像分割,无需预训练即可学习全局聚类结构,但在多目标损失平衡方面存在优化不稳定性。

Details Motivation: 由于数据量大且现有基础模型难以迁移到特定领域(如近距离农业监测),高光谱成像分析面临计算和内存瓶颈,需要一种无需预训练、内存友好的分割方法。 Method: DGC通过处理带有重叠区域的小块数据来学习局部观测下的全局聚类结构,强制一致性,实现低内存消耗和快速训练。 Result: 在叶片病害数据集上,DGC实现了高质量的背景-组织分离(平均IoU为0.925),并展示了通过可导航语义粒度进行无监督病害检测的能力;但存在特征空间中簇过度合并导致表示退化的问题。 Conclusion: DGC作为一种概念框架具有潜力,尤其在资源受限环境下表现良好,但其实用性依赖于对动态损失平衡的更优解决方案,当前工作为后续研究提供了思想基础。 Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.

[106] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou,Qifan Li,Xiaobin Hu,Hai Chen,Shuhang Gu

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的内部引导(Internal Guidance, IG)策略,通过在训练过程中对中间层引入辅助监督,并在采样时外推中间和深层输出,显著提升了扩散模型的训练效率和生成质量。

Details Motivation: 现有扩散模型在低概率区域生成质量较差,分类器自由引导(CFG)易导致样本过度简化或失真,而基于“坏版本”引导的方法依赖复杂设计、额外训练和采样步骤,限制了其应用。 Method: 提出内部引导(IG)策略,在训练时对中间层添加辅助监督,在采样时外推中间层和深层的输出以改善生成效果。该方法无需额外网络或复杂退化策略。 Result: 在ImageNet 256x256上,SiT-XL/2+IG在80和800 epoch分别达到FID=5.31和FID=1.75;LightningDiT-XL/1+IG达到FID=1.34,结合CFG后进一步提升至FID=1.19,成为当前最优方法。 Conclusion: IG是一种高效且通用的引导策略,无需额外网络或复杂设计即可显著提升扩散模型的生成质量和训练效率,具有广泛的应用潜力。 Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

[107] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

Pieter M. Blok,Haozhou Wang,Hyun Kwon Suh,Peicheng Wang,James Burridge,Wei Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为PointRAFT的高通量点云回归网络,用于从RGB-D相机获取的部分点云直接预测马铃薯块茎重量,避免了因自遮挡导致的重量低估问题。

Details Motivation: 由于RGB-D相机在收获过程中采集的点云存在自遮挡,导致块茎三维信息不完整,从而系统性地低估重量,因此需要一种能直接从部分点云准确估计重量的方法。 Method: 提出PointRAFT网络,引入块茎高度嵌入作为额外几何线索,直接从原始部分点云回归预测重量,而不进行完整三维重建,并在真实 harvesting 环境下收集的大规模数据集上训练与评估。 Result: 在包含5,254个点云的测试集上,PointRAFT达到12.0 g的平均绝对误差和17.2 g的均方根误差,显著优于线性回归和PointNet++基准模型,单次推理仅需6.3 ms,支持每秒150个块茎的处理速度。 Conclusion: PointRAFT能够高效准确地从部分点云中估计马铃薯重量,满足商业收割机的高通量需求,并具有推广至其他三维表型和机器人感知任务的潜力。 Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.

[108] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son,Suhyeok Kim,Seungryong Kim,Young Geun Kim

Main category: cs.CV

TL;DR: 提出CorGi和CorGi+方法,通过贡献度引导的块级缓存机制,在不损失生成质量的前提下显著加速DiT模型推理。

Details Motivation: DiT在图像生成中表现优异,但其迭代去噪过程计算量大,存在跨步骤的冗余计算问题。 Method: 提出CorGi,一种无需训练的推理加速框架,通过评估每个Transformer块的贡献度,缓存低贡献块并在后续步骤中复用;针对文生图任务进一步提出CorGi+,利用跨注意力图识别显著token并进行部分注意力更新。 Result: 在最先进的DiT模型上验证,CorGi和CorGi+平均可达2.0倍加速,同时保持高质量生成效果。 Conclusion: CorGi系列方法有效减少了DiT推理中的冗余计算,为扩散Transformer的高效推理提供了实用解决方案。 Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[109] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Sina Jahromi,Farshid Hajati,Alireza Rezaee,Javaher Nourian

Main category: cs.CV

TL;DR: 提出一种基于渐进式生成对抗网络和多目标优化的模型,通过合成数据增强与加权融合策略,有效解决医学图像分类中的不平衡数据问题,在COVID-19胸部X光数据集上取得优异分类准确率。

Details Motivation: 医学图像分类中存在显著的类别不平衡问题,尤其是在疫情期间数据分布更加不均,限制了人工智能方法在疾病检测中的应用效果。 Method: 提出一种渐进式生成对抗网络生成合成数据,并采用加权方式融合合成数据与真实数据;使用多目标元启发式种群优化算法优化深度分类器的超参数。 Result: 在大型不平衡COVID-19胸部X光数据集上,该模型优于现有方法,4类分类准确率达95.5%,2类分类准确率达98.5%。 Conclusion: 所提模型能有效应对医学图像中因疫情导致的不平衡数据挑战,显著提升分类性能。 Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

[110] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu,Zhewei Zhu,Xuyang Shi

Main category: cs.CV

TL;DR: 提出了一种轻量级可学习的注意力精炼模块(ARM),通过自适应融合CLIP的层次特征,在无需重新训练的情况下显著提升开放词汇语义分割性能。

Details Motivation: 现有方法依赖昂贵的外部模型或静态启发式策略,难以有效利用CLIP的像素级细节,限制了训练自由的开放词汇语义分割性能。 Method: 设计了一个语义引导的交叉注意力块,用深层特征指导浅层特征的精炼,并结合自注意力机制;该模块在通用数据集上一次性训练后可作为即插即用组件用于多种框架。 Result: ARM在多个基准上持续提升基线性能,推理开销极低,且不需针对特定任务微调。 Conclusion: ARM实现了高效、通用的训练自由OVSS新范式,验证了通过学习方式挖掘预训练模型内部潜力的有效性。 Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[111] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Shuyun Wang,Haiyang Sun,Bing Wang,Hangjun Ye,Xin Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为Mirage的单步视频扩散模型,用于自动驾驶场景中的高质量、时间连贯的资产编辑,通过结合2D编码器特征和3D解码结构以及两阶段对齐策略,实现了逼真的视频编辑效果。

Details Motivation: 现有的视频对象编辑方法在视觉保真度和时间连贯性方面存在不足,难以满足视觉中心自动驾驶系统对高质量训练数据的需求。 Method: 基于文本到视频扩散先验构建Mirage模型,引入预训练2D编码器的时序无关潜在变量注入3D解码器以恢复空间细节并保持因果结构;采用两阶段数据对齐策略(粗略3D对齐+精细2D优化)缓解高斯分布不匹配导致的姿态错位问题。 Result: 实验表明,Mirage在多种编辑场景下均实现了高真实感和时间一致性,并能泛化至其他视频到视频转换任务。 Conclusion: Mirage有效解决了现有方法在空间保真度和时间连贯性之间的权衡问题,为自动驾驶数据增强提供了可靠且可扩展的视频编辑方案。 Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

[112] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Rahul Medicharla,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了MotivNet,一种基于Meta-Sapiens骨干网络的通用面部表情识别模型,无需跨域训练即可在多种数据集上实现竞争性性能,具备良好的真实世界泛化能力。

Details Motivation: 现有面部表情识别(FER)模型在多样化数据上泛化能力弱,限制了其在现实场景中的应用。尽管已有研究提出复杂架构,但仍需跨域训练,与实际应用需求相矛盾。 Method: 利用具有强大泛化能力的人类视觉基础模型Sapiens作为骨干网络,将FER作为其下游任务之一,提出MotivNet,并通过基准性能、模型相似性和数据相似性三个标准评估其可行性。 Result: MotivNet在多个数据集上表现出色,无需跨域训练即具备跨域泛化能力,满足作为Sapiens下游任务的三项评估标准。 Conclusion: MotivNet验证了将FER作为Sapiens下游任务的可行性,提升了FER在真实场景中的适用性和研究吸引力。 Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.

[113] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu,Yuanke Li,Xianlei Long,Kangping Ji,Chao Chen,Qingyi Gu,Zhenliang Ni

Main category: cs.CV

TL;DR: 本文提出MambaSeg,一种基于双分支Mamba编码器的RGB-事件多模态语义分割框架,通过空间与时间维度的联合交互模块(DDIM)实现高效、低计算成本的跨模态融合,在DDD17和DSEC数据集上实现了最先进的性能。

Details Motivation: 传统RGB方法在快速运动、低光或高动态范围场景下性能下降,而事件相机虽具优势但缺乏颜色纹理信息;现有融合方法计算开销大且忽视事件流的时间动态特性。 Method: 设计双分支Mamba编码器分别处理RGB图像和事件流,并引入双维交互模块(DDIM),包含跨空间(CSIM)和跨时间(CTIM)交互模块,进行细粒度的空间与时间维度融合。 Result: 在DDD17和DSEC数据集上达到最先进分割性能,同时显著降低计算成本。 Conclusion: MambaSeg有效利用RGB与事件数据的互补性,通过时空联合融合提升多模态语义分割的效率与鲁棒性,适用于自动驾驶等实时应用。 Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[114] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Zhi Li,Yaqi Wang,Bingtao Ma,Yifan Zhang,Huiyu Zhou,Shuai Wang

Main category: cs.CV

TL;DR: 提出一种基于物理引导流形投影(PGMP)的牙科CBCT金属伪影去除框架,通过高保真仿真数据、确定性恢复模型和语义结构对齐,实现高效且临床可靠的伪影去除。

Details Motivation: 现有深度学习方法在金属伪影去除中存在回归均值导致的模糊或无监督方法引发的结构幻觉问题,扩散模型虽真实但采样慢,难以满足临床需求。 Method: 1) 构建解剖自适应物理仿真(AAPS)生成高质量训练数据;2) 设计DMP-Former网络,采用直接x预测范式,将恢复过程建模为确定性流形投影,单步前向推理完成去伪影;3) 引入语义结构对齐(SSA)模块,利用医学基础模型先验保证解剖合理性。 Result: 在合成与多中心临床数据上优于现有最先进方法,尤其在未见解剖结构上表现更优,同时实现快速推理,无需迭代采样。 Conclusion: PGMP通过物理引导与确定性建模,在速度、真实性和临床可靠性之间取得良好平衡,为实际牙科CBCT应用提供了可行的MAR解决方案。 Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP

[115] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang,Hao Wen,Aiming Hao,Bingze Song,Meiqi Wu,Jiahong Wu,Xiangxiang Chu,Sheng Lu,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出DualityForge框架和DualityVidQA数据集,通过可控的扩散模型视频编辑生成反事实视频与高质量问答对,结合DNA-Train训练方法显著减少多模态大模型在反事实视频中的幻觉问题。

Details Motivation: 多模态大语言模型(MLLMs)在视频理解中存在对语言先验过度依赖的问题,导致在违反常识的反事实视频中产生视觉无根据的幻觉,且因反事实数据稀缺而难以解决。 Method: 提出DualityForge框架,利用基于扩散模型的可控视频编辑技术将真实视频转化为反事实场景,并结合结构化上下文信息自动生成原视频-编辑视频配对及对应的高质量问答对;构建大规模DualityVidQA数据集用于对比训练;设计DNA-Train两阶段训练方法,在强化学习阶段采用成对ℓ₁优势归一化以稳定优化过程。 Result: 在DualityVidQA-Test上实验表明,相比Qwen2.5-VL-7B基线模型,该方法在反事实视频上的幻觉问题相对减少了24.0%,并在幻觉和通用基准测试中均取得显著性能提升。 Conclusion: 所提方法有效缓解了MLLM在视频理解中的语言先验依赖和视觉幻觉问题,具备强泛化能力,未来将开源数据集与代码。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[116] LiftProj: Space Lifting and Projection-Based Panorama Stitching

Yuan Jia,Ruimin Wu,Rui Song,Jiaojiao Li,Bin Song

Main category: cs.CV

TL;DR: 提出一种基于三维空间提升的全景图像拼接框架,通过将图像提升到三维空间进行全局融合,并利用等距圆柱投影生成几何一致的360°全景图,有效减少了传统方法在复杂场景下的几何畸变和重影问题。

Details Motivation: 传统二维拼接方法在处理具有多层深度和遮挡的三维场景时易产生重影、弯曲和拉伸失真,尤其在多视角累积和360°闭环拼接中问题突出,需更鲁棒的解决方案。 Method: 将输入图像提升为统一坐标系下的密集三维点云表示,结合置信度进行跨视图融合;在三维空间设定统一投影中心,采用等距圆柱投影映射到全景平面,并在画布域内进行空洞填充以恢复纹理连续性。 Result: 该方法显著减轻了大视差和复杂遮挡场景下的几何失真与重影伪影,生成更自然、几何一致的全景图像。 Conclusion: 所提框架将拼接范式从二维变换转向三维一致性,支持灵活集成各类三维提升与补全模块,提升了复杂真实场景下全景拼接的鲁棒性与质量。 Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

[117] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Zhihua Wang,Fei Wu,Quanlin Li,Pinghong Zhou,Shuo Wang,Xian Yang

Main category: cs.CV

TL;DR: EndoRare是一个无需重新训练的生成框架,能从单个参考图像生成高保真、多样化的罕见胃肠病变图像,用于增强AI模型性能并提升新手内镜医师的诊断能力。

Details Motivation: 罕见胃肠道病变在常规内镜检查中少见,导致可用于开发可靠人工智能(AI)模型和培训新手临床医生的数据有限。 Method: 提出EndoRare框架,采用语言引导的概念解耦方法,将病理性特征与非诊断性属性分离,前者编码为可学习的原型嵌入,后者进行变化以保证多样性,实现基于单张图像的多样化病变生成。 Result: 在四种罕见病理上验证了该框架的有效性;生成的图像被专家评估为具有临床合理性;用于数据增强后显著提升了下游AI分类器在低假阳性率下的真阳性率;盲法读片研究显示新手内镜医师的召回率提升0.400,精确度提升0.267。 Conclusion: EndoRare为解决罕见疾病在计算机辅助诊断和临床教育中的数据稀缺问题提供了实用且高效的方法。 Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

[118] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq,Linda Larson-Prior,Fred Prior

Main category: cs.CV

TL;DR: 本文提出并验证了Virtual-Eyes,一种面向低剂量CT肺癌筛查的16位图像质量控制预处理流程,发现其可显著提升通用基础模型(如RAD-DINO)在切片级和患者级分类任务中的性能与校准效果,但可能损害依赖原始临床上下文的专用模型(如Sybil、ResNet-18)的表现,揭示了预处理对不同模型类型的差异化影响。

Details Motivation: 在低剂量CT肺癌筛查的深度学习流程中,稳健的预处理很少被量化评估。现有模型可能依赖于原始数据中的非语义特征或捷径学习,导致泛化能力差。因此需要一种临床动机明确的质量控制预处理方法来标准化输入,并系统评估其对通用基础模型与专用模型的影响差异。 Method: 开发了Virtual-Eyes预处理流程,强制512x512平面内分辨率,剔除过短或非诊断性序列,通过Hounsfield单位过滤和双侧肺覆盖评分提取连续肺块,同时保留原生16位数据精度。使用NLST数据集(765例患者)生成切片级嵌入,冻结RAD-DINO和Merlin编码器,训练无泄漏的MLP头部进行患者级分类;同时评估Sybil和ResNet-18在Raw与Virtual-Eyes输入下的表现,不重新训练主干网络。 Result: Virtual-Eyes使RAD-DINO切片级AUC从0.576提升至0.610,患者级AUC在均值池化下从0.646升至0.683,在最大池化下从0.619显著提升至0.735,且校准性改善(Brier分数由0.188降至0.112)。而Sybil AUC从0.886下降至0.837,ResNet-18从0.571轻微降至0.596,Merlin提升有限(约0.507至0.567),表明专用模型可能因失去原始上下文而性能下降。 Conclusion: 解剖结构导向的质量控制预处理能稳定并提升通用基础模型在低剂量CT分析中的表现,但可能干扰已适应原始临床数据分布的专用模型,提示在部署前需谨慎评估预处理对不同类型模型的差异化影响。 Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

[119] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang,Zimo He,Wanhe Yu,Lexi Pang,Yunhao Li,Hongjie Li,Jieming Cui,Yuhan Li,Yizhou Wang,Yixin Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: UniAct是一个两阶段框架,通过整合微调的MLLM和因果流式管道,实现人形机器人对多模态指令(如语言、音乐、轨迹)的实时响应(延迟<500ms),并在零样本动作跟踪中提升19%的成功率。

Details Motivation: 现有方法难以将异构的多模态指令(如语言、音乐、轨迹)统一转化为稳定、实时的人形机器人全身动作,缺乏跨模态对齐与物理合理性的结合。 Method: 提出UniAct框架:第一阶段使用微调的多模态大语言模型(MLLM)理解多模态输入;第二阶段通过因果流式 pipeline 结合FSQ共享离散码本,实现动作生成,并将运动约束在物理合理的流形空间中。 Result: 在自建20小时人形动作数据集UniMoCap上验证,UniAct在零样本跟踪不完美参考动作时成功率提升19%,延迟低于500ms,展现出强泛化能力。 Conclusion: UniAct通过统一感知与控制,实现了对多模态指令的快速、准确响应,推动了通用、响应式人形助手的发展。 Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

[120] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu,Zhiyuan Song,Hefeng Wu,Tao Pu,Keze Wang,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了CERES框架,通过因果干预方法解决第一人称视频中指代表达视频对象分割(Ego-RVOS)任务中的数据偏见和视觉混淆问题,显著提升了性能。

Details Motivation: 现有方法在处理第一人称视频时容易受到数据集中对象-动作配对偏差和视觉混淆因素(如快速运动和频繁遮挡)的影响,导致性能受限。 Method: 提出CERES框架,采用双模态因果干预:利用后门调整消除语言表示中的数据偏见,结合前门调整融合语义特征与几何深度信息,以缓解视觉混淆。该框架可插拔,适配预训练的RVOS骨干网络。 Result: 在Ego-RVOS基准上实现了最先进的性能,验证了因果推理在提升模型鲁棒性方面的有效性。 Conclusion: 通过引入因果推理机制,CERES有效克服了第一人称视频中的关键挑战,展示了因果方法在ego-centric视频理解中的潜力。 Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[121] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng,Tao Hu,Wenwen Tong,Xueheng Li,Jiandong Chen,Haojia Yu,Jiefan Lu,Hewei Guo,Hanming Deng,Chengjun Xie,Gao Huang,Dahua Lin,Lewei Lu

Main category: cs.CV

TL;DR: 本文提出了SenseNova-MARS,一种通过强化学习实现多模态智能体推理与工具协同使用的框架,显著提升了视觉语言模型在知识密集型和高分辨率图像理解任务中的表现,并发布新的评测基准HR-MMSearch。

Details Motivation: 现有视觉语言模型在复杂场景下缺乏将动态工具操作与连续推理无缝结合的能力,尤其在需要调用搜索、图像裁剪等外部工具时表现不足。 Method: 提出SenseNova-MARS框架,结合图像搜索、文本搜索和图像裁剪工具,通过强化学习(特别是BN-GSPO算法)实现推理与工具使用的交错执行。 Result: 在MMSearch和HR-MMSearch等基准上达到最优性能,SenseNova-MARS-8B分别取得67.84和41.64的得分,超越Gemini-3-Flash和GPT-5等专有模型。 Conclusion: SenseNova-MARS推动了具备工具使用能力的智能体式VLM发展,为复杂视觉任务提供了有效且鲁棒的解决方案。 Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[122] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei,Zhipeng Luo,Ling Feng,Venice Erin Liong

Main category: cs.CV

TL;DR: 本文提出LVLDrive框架,通过引入LiDAR点云增强视觉语言模型(VLM)在自动驾驶中的3D空间理解能力,解决图像模态在几何推理和度量空间感知上的不足。

Details Motivation: 现有基于图像的视觉语言模型在自动驾驶中依赖2D视觉线索,难以实现精确的度量空间推理和几何理解,影响驾驶安全性与可靠性。 Method: 提出LVLDrive框架,融合LiDAR点云作为额外输入模态,并设计渐进式融合Q-Former以稳定注入3D特征,避免破坏预训练VLM的知识;同时构建空间感知问答(SA-QA)数据集,提升模型的3D感知与推理能力。 Result: 在多个自动驾驶基准上实验表明,LVLDrive在场景理解、度量空间感知和驾驶决策方面均优于纯视觉VLM方法。 Conclusion: 引入显式的3D度量信息(如LiDAR)对构建可靠、安全的VLM-based自动驾驶系统至关重要。 Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[123] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altrac,Matthew Toews

Main category: cs.CV

TL;DR: 提出了一种基于特殊相对论和量子力学启发的卷积滤波理解新模型——基本信息力学,通过偶/奇核分解和DCT频域分析揭示了CNN中信息处理与能量-动量关系之间的联系。

Details Motivation: 受现代物理理论启发,试图从能量-动量角度建立对卷积神经网络中信息处理机制的更深层理解。 Method: 将卷积核分解为正交的偶部和奇部,并在离散余弦变换(DCT)谱域中分析其性质,研究低频基函数对小滤波器结构的影响。 Result: 发现偶核对应各向同性扩散(类势能),奇核引起质心位移(类动能),信息传播速度与奇核能量占比线性相关,且小尺寸滤波器由DC分量Σ和梯度分量∇主导。 Conclusion: 建立了卷积神经网络信息处理与相对论物理中能量-动量关系之间的类比,首次揭示了二者之间的理论联系。 Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

[124] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy

Main category: cs.CV

TL;DR: 本文提出了DermaVQA-DAS,扩展自DermaVQA数据集,支持皮肤病病变分割和封闭式问答任务,引入了由专家设计的Dermatology Assessment Schema(DAS)框架,以结构化方式捕捉临床有意义的皮肤特征,并发布数据集与评估协议以推动面向患者的皮肤视觉-语言模型研究。

Details Motivation: 现有皮肤病图像分析数据集多关注皮肤镜图像,缺乏患者自主提问和临床背景,限制了其在以患者为中心的护理中的应用。因此,需要一个结合临床语境和患者视角的数据集来促进更实用的智能诊断工具发展。 Method: 提出Dermatology Assessment Schema(DAS),包含36个高层级与27个细粒度问题,覆盖中英文选项;基于DAS对DermaVQA进行扩展,构建用于闭合式问答和病灶分割的专家标注数据集,并评估多种多模态模型在不同提示策略下的表现。 Result: 在分割任务中,不同提示策略影响模型性能:默认提示在Mean-of-Max和Mean-of-Mean指标下最优,增强提示在多数投票微分中表现最佳(Jaccard指数0.395,Dice分数0.566);在问答任务中,o3模型准确率最高(0.798),GPT-4.1紧随其后(0.796),Gemini-1.5-Pro在Gemini系列中表现突出(0.783)。 Conclusion: DermaVQA-DAS通过整合结构化临床评估框架DAS,为患者中心型皮肤病分析提供了新的基准,验证了提示设计对多模态模型性能的影响,并推动了皮肤病视觉-语言模型的发展与公开研究。 Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

[125] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang,Lingdong Kong,Xiaolu Liu,Hao Shi,Wentong Li,Jianke Zhu,Steven C. H. Hoi

Main category: cs.CV

TL;DR: 本文提出了一种多模态预训练框架,旨在通过整合摄像头、LiDAR等传感器数据,构建统一的时空智能模型,推动自动驾驶与无人机等自主系统的感知与规划能力。

Details Motivation: 现有基础模型在单模态任务中表现优异,但在融合多模态传感器数据(如视觉与LiDAR)以实现统一空间理解方面仍面临挑战,亟需系统性框架支持。 Method: 提出一个统一的多模态预训练分类体系,涵盖从单模态基线到学习整体表征的统一框架,并探索文本输入与占据表示的融合,以支持开放世界感知与规划。 Result: 系统分析了传感器特性、学习策略与平台特定数据集之间的关系,验证了所提框架在3D目标检测与语义占据预测等任务上的有效性。 Conclusion: 论文总结了计算效率与模型可扩展性等关键瓶颈,并提出了通向通用多模态基础模型的发展路线图,为实现实时部署的鲁棒空间智能提供了理论与实践基础。 Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[126] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

Gur-Eyal Sela,Kumar Krishna Agrawal,Bharathan Balaji,Joseph Gonzalez,Ion Stoica

Main category: cs.CV

TL;DR: 本文提出了RedunCut,一种新的动态模型尺寸选择(DMSS)系统,用于降低大规模实时视频分析中的推理成本。该系统通过测量驱动的规划器和轻量级数据驱动性能模型,有效提升了采样效率和准确率预测精度,在多种视频场景下显著降低了计算成本。

Details Motivation: 现有DMSS方法在处理多样化工作负载(如移动视频和低精度目标)时泛化能力差,主要由于采样效率低和每段准确率预测不准确,且运行时缺乏真实标签导致难以优化。 Method: RedunCut采用测量驱动的规划器来估计采样的成本-收益权衡,并引入轻量级数据驱动的性能模型以提高准确率预测。其设计无需模型重训练或修改,兼容不同模型族与任务。 Result: 在道路车辆、无人机和监控视频等多种场景下,RedunCut在保持固定准确率的前提下,将计算成本降低了14%-62%,并对历史数据有限和数据漂移具有鲁棒性。 Conclusion: RedunCut通过更高效的采样策略和更精确的性能预测,显著提升了DMSS系统的实用性与通用性,为大规模视频分析提供了低成本、高适应性的解决方案。 Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

[127] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen,Haiyang Liu

Main category: cs.CV

TL;DR: 本文提出DyStream,一种基于流匹配的自回归模型,用于实时生成双人对话头像视频,具有超低延迟和高质量唇部同步表现。

Details Motivation: 现有基于块的方法需要完整的非因果上下文窗口,导致高延迟,无法实现真实对话中所需的即时非语言反馈。 Method: 采用流匹配头的流友好型自回归框架,并设计带有前视模块的因果编码器以引入短期未来上下文(如60ms),在保持低延迟的同时提升生成质量。 Result: 每帧生成时间仅为34ms,系统总延迟低于100ms,在HDTF数据集上离线和在线唇同步置信度分别达到8.13和7.61,优于现有因果方法。 Conclusion: DyStream在保证超低延迟的同时实现了最先进的唇同步质量,适用于需要实时非语言反馈的真实双人对话场景。 Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[128] AI-Driven Evaluation of Surgical Skill via Action Recognition

Yan Meng,Daniel A. Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 提出一种基于AI的微血管吻合术性能自动评估框架,结合改进的TimeSformer和YOLO,实现高精度动作识别与运动分析。

Details Motivation: 传统外科技能评估依赖专家主观判断,存在变异性大、耗时耗力问题,尤其在资源有限地区难以推广。 Method: 采用改进的TimeSformer(含分层时间注意力和加权空间注意力)进行动作识别,结合YOLO-based目标检测与跟踪提取精细运动特征,从五个维度评估微血管吻合技能。 Result: 在58个标注视频上验证,动作分割帧级准确率达87.7%(后处理后93.62%),技能评估平均分类准确率76%。 Conclusion: 该系统可提供客观、一致且可解释的反馈,有望推动外科教育的标准化和数据驱动评估。 Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

[129] Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas,Pranav K Nayak,Divya Mehul Rajparia,Deekshith Patel,Yashmitha Gogineni,Konda Reddy Mopuri,Sumohana S. Channappayya

Main category: cs.CV

TL;DR: 本文提出了一种新框架,利用离散小波变换(DWT)作为视觉中的输入依赖基元,来研究Vision Transformer(ViT)编码器表示空间中的组合性。实验结果表明,基于一级DWT分解的基元在潜在空间中近似满足组合性,揭示了ViT组织信息的新机制。

Details Motivation: 尽管对Transformer模型的理解主要来自语言任务上的分析,但在视觉领域,ViT如何构建和组织表示仍不清晰。本文旨在探究ViT编码器是否以组合方式学习视觉表示,即复杂表示能否由简单基元组合而成。 Method: 引入一种类比于表示学习中组合性度量的框架,使用离散小波变换(DWT)提取图像中的输入依赖基元,并测试这些基元在ViT编码器中的表示是否能在潜在空间中组合以重建原始图像表示。 Result: 实验证明,一级DWT分解得到的基元在ViT的潜在空间中能够近似组合,恢复原始图像表示,说明ViT的表示具有一定程度的组合性,且该性质在不同数据集和模型规模下具有一致性。 Conclusion: ViT编码器的表示空间支持基元的近似组合操作,表明其可能通过组合底层特征来构建高层视觉表示,这为理解ViT的信息处理机制提供了新的视角。 Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

[130] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

Prasiddha Siwakoti,Atefeh Khoshkhahtinat,Piyush M. Mehta,Barbara J. Thompson,Michael S. F. Kirk,Daniel da Silva

Main category: cs.CV

TL;DR: 提出了一种面向多光谱太阳图像的高保真压缩框架,结合图嵌入与注意力机制,在SDOML数据集上显著优于现有方法。

Details Motivation: 在带宽受限的太空任务中,如何在压缩多光谱太阳图像时保持精细的光谱和空间细节是一个挑战。 Method: 提出了iSWGE模块建模波段间关系,将光谱通道表示为带学习边特征的图节点;并设计WSGA-C模块,结合稀疏图注意力与卷积注意力以减少空间冗余并突出细小结构。 Result: 在SDOML数据集六个EUV通道上的实验表明,相比强学习基线,MSID降低20.15%,PSNR最高提升1.09%,log MS-SSIM提高1.62%,重建图像更清晰且光谱保真度更高。 Conclusion: 所提方法在相同比特率下实现了更优的压缩性能,兼顾光谱与空间质量,适用于太阳观测任务中的数据传输。 Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .

[131] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Devendra K. Jangid,Ripon K. Saha,Dilshan Godaliyadda,Jing Li,Seok-Jun Lee,Hamid R. Sheikh

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv2低级特征条件的图像超分辨率新方法F2IDiff,以减少生成过程中的幻觉问题,特别适用于高保真度的手机摄影场景。

Details Motivation: 现有的文本到图像扩散模型在单图象超分辨率中容易产生不可控的幻觉,且文本特征难以描述细节纹理,限制了其在手机摄影中的应用。 Method: 采用DINOv2提取的低级特征作为扩散模型的条件输入,构建特征到图像扩散模型(F2IDiff),实现更严格且丰富的条件控制。 Result: 该方法在小图像块上实现了更精确的条件生成,减少了不必要幻觉,在消费者级高分辨率低分辨率图像上表现优于传统基于文本条件的方法。 Conclusion: 通过使用低级、密集的视觉特征替代高级文本特征进行条件控制,F2IDiff能更有效地平衡生成质量与真实性,适合应用于高保真消费摄影的超分辨率任务。 Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

[132] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong

Main category: cs.CV

TL;DR: 提出了一种无需训练的少样本框架CPJ,通过结构化图像字幕提升农业病害诊断的准确性和可解释性,在多个指标上显著优于基线方法。

Details Motivation: 现有农作物病害诊断方法依赖昂贵的监督微调,且在域迁移下表现不佳,需要更鲁棒、可解释且无需训练的方法。 Method: 提出Caption-Prompt-Judge(CPJ)框架,利用大视觉语言模型生成多角度图像字幕,通过LLM-as-Judge模块迭代优化字幕,并用于双答案VQA流程以支持识别与管理决策。 Result: 在CDDMBench上评估显示,使用GPT-5-mini生成字幕时,GPT-5-Nano在疾病分类上提升+22.7个百分点,QA得分提升+19.5点。 Conclusion: CPJ实现了无需微调的高性能农业病害诊断,提供透明、基于证据的推理过程,推动了可解释性农业AI的发展。 Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[133] Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula,Jonathan Stubblefield

Main category: cs.CV

TL;DR: 本研究提出了一种结合YOLOv5和YOLOv8进行胸部X光异常检测,并利用大语言模型(LLM)生成自然语言放射学报告的管道。

Details Motivation: 现有计算机视觉系统虽能高效完成医学图像分类与检测,但输出为结构化预测,仍需放射科医生撰写报告;因此需要一种能自动生成高质量诊断叙述的方法以提升效率。 Method: 采用YOLOv5和YOLOv8模型进行异常检测,输出边界框和类别标签,随后将这些结构化结果输入大语言模型(如GPT-4)生成描述性发现和临床摘要,并通过余弦相似度和人工评估评价生成文本质量。 Result: AI生成报告与真实报告之间具有较高的语义相似性,人工评估显示GPT-4在清晰度上得分高(4.88/5),但在写作流畅性方面较低(2.81/5)。 Conclusion: 该方法能够实现临床准确的自动报告生成,但在语言自然度方面仍有改进空间,当前系统尚无法完全模仿放射科医生的写作风格。 Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

[134] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli,Banafsheh Rekabdar

Main category: cs.CV

TL;DR: 本文提出了一种用于低分辨率视频的多尺度向量量化变分自编码器(MS-VQ-VAE),支持高效存储、传输和边缘设备解码,结合感知损失提升重建质量,在UCF101数据集上优于单尺度基线。

Details Motivation: 传统视频编解码器如H.264和HEVC主要面向像素级重建,缺乏对机器学习友好的潜在表示支持,难以融入深度学习流程,限制了其在现代AI驱动场景中的应用效率。 Method: 基于VQ-VAE-2框架,构建了一个具有两级层次化潜在结构的时空3D残差卷积模型,并引入基于预训练VGG16的感知损失以提升重建质量,模型轻量(约1850万参数),适用于64x64分辨率视频片段。 Result: 在UCF101测试集上达到25.96 dB PSNR和0.8375 SSIM;在验证集上相比单尺度基线提升1.41 dB PSNR和0.0248 SSIM。 Conclusion: 所提出的MS-VQ-VAE框架能够生成紧凑且高保真的视频潜在表示,适合带宽受限场景下的可扩展视频压缩,如实时流媒体、移动视频分析和CDN存储优化。 Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[135] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai,Kunpeng Li,Menglin Jia,Jialiang Wang,Junzhe Sun,Feng Liang,Weifeng Chen,Felix Juefei-Xu,Chu Wang,Ali Thabet,Xiaoliang Dai,Xuan Ju,Alan Yuille,Ji Hou

Main category: cs.CV

TL;DR: 本文提出了一种物理感知的视频生成方法PhyGDPO,通过构建大规模物理增强视频数据集PhyVidGen-135K,并结合物理引导奖励机制和高效的训练策略,在文本到视频生成中实现了更好的物理一致性。

Details Motivation: 现有文本到视频生成方法在遵循物理规律方面表现不足,且缺乏包含丰富物理交互的大规模训练数据。 Method: 提出了PhyAugPipe数据构建管道,利用视觉语言模型和思维链推理生成大规模物理相关视频数据集PhyVidGen-135K;设计了基于群组Plackett-Luce模型的PhyGDPO框架,引入物理引导奖励(PGR)和LoRA-Switch Reference(LoRA-SR)机制进行高效训练。 Result: 实验表明,该方法在PhyGenBench和VideoPhy2两个物理感知视频生成评测基准上显著优于现有的开源方法。 Conclusion: PhyGDPO通过融合视觉语言模型的物理推理能力与高效的偏好优化框架,有效提升了生成视频的物理合理性,为物理一致的视频生成提供了新思路。 Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[136] OCP-LS: An Efficient Algorithm for Visual Localization

Jindi Zhong,Hongxia Wang,Huanshui Zhang

Main category: cs.CV

TL;DR: 提出了一种新的二阶优化算法,用于解决深度学习中的大规模优化问题,通过结合OCP方法和对Hessian矩阵对角元素的适当近似,在多个视觉定位基准上表现出优越性能。

Details Motivation: 为了解决深度学习中大规模优化问题,尤其是传统优化算法在收敛速度、训练稳定性和抗噪性方面的不足。 Method: 提出一种新型二阶优化算法,结合OCP方法,并对Hessian矩阵的对角元素进行适当近似。 Result: 在多个标准视觉定位基准上实验表明,该方法具有更快的收敛速度、更强的训练稳定性以及更好的抗噪声干扰能力,同时保持有竞争力的定位精度。 Conclusion: 所提方法在处理大规模深度学习优化问题时优于传统优化算法,具有实际应用潜力。 Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

[137] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao,Jiawen Xi,Linhui Xiao,Junnan Li,Xue Yang,Maoxun Yuan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了RGBT-Ground,首个面向复杂真实场景的大规模视觉定位基准,包含对齐的RGB与热红外图像对及高质量标注,并提出统一框架与RGBT-VGNet模型以实现鲁棒的跨模态视觉定位。

Details Motivation: 现有视觉定位基准多基于清洁环境下的数据集,缺乏对光照、天气等真实复杂条件的覆盖,难以评估模型在安全关键应用中的鲁棒性与泛化能力。 Method: 构建了包含RGB与热红外图像对的大规模数据集RGBT-Ground,配备精细标注;设计统一框架支持单模态与多模态输入,并提出RGBT-VGNet模型融合互补视觉模态。 Result: 在RGBT-Ground上进行了广泛实验,结果表明RGBT-VGNet显著优于现有方法的适配版本,尤其在夜间和远距离场景下表现突出。 Conclusion: RGBT-Ground为复杂环境下的鲁棒视觉定位提供了新的评估基准和研究平台,所提方法有效提升了多模态条件下的定位性能。 Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[138] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong,Ke Li,Di Wang,Nan Luo,Yiming Zhang,Kaiyu Li,Jianfei Yang,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了一种针对变化检测视觉问答(CDVQA)中决策模糊问题的强化微调框架DARFT,通过挖掘决策模糊样本并进行组相对策略优化,提升了模型的判别能力和鲁棒性。

Details Motivation: 在CDVQA任务中,现有模型虽能产生合理预测,但常因正确答案与强干扰项间置信度差异小而导致决策模糊,影响性能;作者旨在显式优化这类模糊样本以提升模型区分能力。 Method: 提出DARFT框架:首先利用监督微调(SFT)训练的参考策略挖掘决策模糊样本(DAS),然后采用组相对策略优化方法对这些样本进行强化学习微调,利用多样本解码和组内相对优势抑制强干扰项。 Result: 实验表明DARFT在多种设置下均优于SFT基线模型,尤其在少样本场景中表现显著提升。 Conclusion: 显式优化决策模糊样本可有效增强CDVQA模型的决策清晰度与鲁棒性,DARFT为视觉语言模型在细粒度推理任务中的优化提供了新思路。 Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[139] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang,Chaoqun Wang,Zixuan Guan,Sam Kao,Pengfei Zhao,Peng Wu,Sifeng He

Main category: cs.CV

TL;DR: 本文提出了SliceLens,一个基于大语言模型和视觉语言模型的假设驱动框架,用于在实例级视觉任务中发现细粒度、可解释的错误切片,并构建了首个面向此类任务的基准FeSD,实验表明其在精度和实际改进价值上均显著优于现有方法。

Details Motivation: 现有的错误切片发现方法主要集中于图像分类任务,难以适用于检测、分割和姿态估计等多实例任务;同时,现有基准存在人工标注偏差且无法反映真实模型失败情况,尤其在涉及复杂视觉关系的角落案例中表现不足。 Method: 提出SliceLens框架,利用大语言模型(LLM)和视觉语言模型(VLM)生成并验证多样化的失败假设,通过 grounded visual reasoning 实现对细粒度错误切片的可靠识别;同时构建新基准FeSD,包含专家标注、精细打磨的真实错误切片,并精确关联到局部错误区域。 Result: 在现有基准和新基准FeSD上进行的大量实验显示,SliceLens在Precision@10指标上达到0.73,相比之前的0.31提升显著(提升0.42),且发现的切片具有高度可解释性,能有效指导模型修复。 Conclusion: SliceLens结合LLM与VLM实现了跨实例级视觉任务的高效、可解释错误切片发现,配合新基准FeSD为未来研究提供了更真实可靠的评估标准,推动了鲁棒模型评估的发展。 Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[140] 3D Semantic Segmentation for Post-Disaster Assessment

Nhut Le,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 本文提出了一种专用于灾后环境的3D语义分割数据集,基于无人机拍摄的飓风伊恩灾区影像,利用SfM和MVS技术重建3D点云,并评估了当前最先进的模型在该数据集上的性能,揭示了现有方法在灾后场景中的局限性。

Details Motivation: 现有的深度学习模型缺乏专门针对灾后环境设计的3D数据集,限制了灾后评估的准确性与效率。 Method: 使用无人机采集飓风伊恩灾区的航拍影像,采用Structure-from-Motion(SfM)和Multi-View Stereo(MVS)技术生成3D点云,并构建专用数据集;在此基础上评估了FPT、PTv3和OA-CNNs等最先进3D语义分割模型。 Result: 实验表明,现有SOTA模型在该灾后3D数据集上表现不佳,暴露出在复杂灾害场景下的分割能力不足。 Conclusion: 需要开发更先进的3D分割技术和专用基准数据集,以提升灾后场景理解能力,支持应急响应决策。 Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

[141] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

Zheng Liu,Jinchao Zhu,Gao Huang

Main category: cs.CV

TL;DR: 提出了一种新的微调方法CLoRA,通过基础空间共享和样本无关多样性增强(SADE)在保持参数效率的同时提升学习能力,在图像和点云任务中实现了性能与效率的更好平衡。

Details Motivation: 现有低秩适配方法在参数效率和微调性能之间难以兼顾,往往牺牲性能或引入过多可训练参数。 Method: 提出协作式低秩适配(CLoRA),包含基础空间共享机制(多个低秩模块共享投影空间)和样本无关多样性增强(SADE)以促进表示多样性。 Result: 在多个图像和点云数据集上实验表明,CLoRA在更少可训练参数和最低GFLOPs下取得了优于现有方法的性能。 Conclusion: CLoRA有效平衡了参数效率与模型性能,通过共享机制和多样性正则化提升了低秩微调的表现力,适用于视觉Transformer的高效微调。 Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

[142] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang,Junfei Huang,Zongzhangbao Yin,Yingsong Hu,Anni Xu,Xinyi Luo,Xueqi Sun,Hai Wu,Sheng Ao,Zhaoxing Zhu,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出了面向室外监控场景的3D视觉定位新任务,并构建了首个大规模真实世界多模态数据集MoniRefer,包含约13.6万个物体和41.1万条自然语言描述。同时提出端到端方法Moni3DVG,融合图像外观与点云几何信息进行多模态学习,在新任务上表现出优越性能。

Details Motivation: 现有3D视觉定位研究主要集中于室内或自动驾驶场景,缺乏针对路侧基础设施监控场景的数据集和方法,限制了交通场景中基础设施对自然语言的理解能力。 Method: 构建了名为MoniRefer的大规模真实世界多模态数据集,包含大量经人工验证的点云-文本配对数据;提出Moni3DVG方法,端到端融合图像的外观信息与点云的几何及光学信息,实现多模态特征学习与3D目标定位。 Result: 在新构建的MoniRefer数据集上进行了广泛实验和消融研究,结果表明所提Moni3DVG方法在3D视觉定位任务中具有优越性和有效性。 Conclusion: 本文推动了路侧基础设施层面的3D视觉定位研究,通过引入新任务、发布高质量数据集及提出有效方法,为复杂交通环境下的自然语言驱动目标定位提供了新的基准和解决方案。 Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[143] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

Shuyuan Lin,Yu Guo,Xiao Chen,Yanjie Liang,Guobao Xiao,Feiran Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为逐层分层注意力网络(Layer-by-Layer Hierarchical Attention Network)的新方法,用于提升计算机视觉中特征点匹配的精度,尤其在存在大量异常值的情况下表现出色。

Details Motivation: 特征点匹配中的大量异常值会显著影响匹配结果的准确性和鲁棒性,尤其是在高比例异常值情况下如何有效提取高质量信息并减少负样本带来的误差是一个关键挑战。 Method: 提出了一种结合阶段融合、分层提取和注意力机制的网络结构,包括逐层通道融合模块以保留各阶段语义信息并实现整体融合,以及分层注意力模块来自适应捕获和融合全局感知与结构语义信息,并设计了两种架构以增强网络适应性。 Result: 在YFCC100M和SUN3D两个公开数据集上的实验表明,该方法在去除异常值和相机位姿估计任务上优于多种现有最先进方法。 Conclusion: 所提出的分层注意力网络能有效提升特征点匹配的精度与鲁棒性,尤其在高异常值比例下表现优异,具有较强的表示能力和应用潜力。 Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.

[144] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

Qingyu Xu,Runtong Zhang,Zihuan Qiu,Fanman Meng

Main category: cs.CV

TL;DR: 本文提出了一种面向消防救援场景的目标检测新方法,构建了包含多种场景和关键目标类别的FireRescue数据集,并提出了改进的FRS-YOLO模型以提升复杂环境下的检测性能。

Details Motivation: 现有研究主要集中于山区或森林等场景,忽视了更常见且结构复杂的 urban 救援场景,同时检测类别有限,缺乏对指挥决策至关重要的多类目标(如消防车、消防员)的覆盖。 Method: 构建了一个名为FireRescue的新数据集,包含15,980张图像和32,000个边界框,涵盖城市、山地、森林和水域等多种救援场景中的8个关键类别;提出FRS-YOLO模型,引入即插即用的多维协同增强注意力模块和动态特征采样器,以提升易混淆类别和小目标的检测能力。 Result: 实验证明消防救援场景下的目标检测具有挑战性,所提方法显著提升了YOLO系列模型在该场景下的检测性能,有效缓解了类别混淆和烟雾遮挡等问题。 Conclusion: 本文通过构建更贴近实际的FireRescue数据集和设计针对性的FRS-YOLO模型,推动了消防救援场景下目标检测技术的发展,为指挥决策提供了更可靠的技术支持。 Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named "FireRescue" for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

[145] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang,Hanting Li,Wei Li,Jie Hu,Xinghao Chen,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出RadAR,一种用于加速自回归视觉生成的并行化框架,通过径向拓扑结构和嵌套注意力机制提升生成效率与一致性。

Details Motivation: 传统的自回归视觉生成模型采用逐token的顺序解码,推理效率低;且标准光栅扫描顺序未能充分利用视觉token间的局部依赖性和空间相关性。 Method: 提出RadAR框架:将视觉token按空间距离组织成同心环状结构,从中心向外围逐环生成,实现环内token的并行预测;引入嵌套注意力机制,在前向过程中动态修正不合理输出,缓解上下文不足导致的不一致问题。 Result: RadAR在保持自回归模型表达能力的同时,显著提升了视觉生成的并行度和推理效率,并通过动态校正机制有效防止错误累积和模型崩溃。 Conclusion: RadAR通过径向并行生成和嵌套注意力校正,实现了高效、可并行的自回归视觉生成,在保持生成质量的同时大幅加快推理速度。 Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

Maolin Wang,Bowen Yu,Sheng Zhang,Linjie Mi,Wanyu Wang,Yiqi Wang,Pengyue Jia,Xuetao Wei,Zenglin Xu,Ruocheng Guo,Xiangyu Zhao

Main category: cs.CV

TL;DR: 提出了一种受重整化群启发的张量网络搜索框架RGTN,通过多尺度连续结构演化实现高效、鲁棒的张量分解,在压缩比和速度上均显著优于现有方法。

Details Motivation: 现有张量网络结构搜索方法在计算可扩展性、结构适应性和优化鲁棒性方面存在不足,难以有效捕捉多尺度结构且优化效率低。 Method: 提出RGTN框架,利用重整化群流进行多尺度变换,引入可学习边门控机制和基于物理量(如节点张力、边信息流)的智能结构建议,实现从粗到细的连续结构演化与拓扑动态调整。 Result: 在光场数据、高阶合成张量和视频补全任务上,RGTN实现了最先进的压缩比,并比现有方法快4-600倍。 Conclusion: RGTN通过物理启发的多尺度优化机制,有效解决了传统TN-SS方法在结构搜索中的局限性,兼具高效性与强适应性。 Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

[147] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye,Xiaotong You,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出EVOL-SAM3,一种无需训练的零样本推理分割框架,通过推理时的进化搜索机制克服现有方法在语言幻觉和空间误判上的局限。

Details Motivation: 现有推理分割方法受限于监督微调的灾难性遗忘、强化学习的训练不稳定,或训练自由方法的静态推理缺陷,缺乏自我修正能力。 Method: 提出EVOL-SAM3,将推理分割建模为推理时的进化搜索过程,维护一组提示假设,并通过“生成-评估-进化”循环迭代优化;引入无参考的视觉竞技场进行配对评估,语义变异算子纠正错误,并结合几何先验的异构竞技场模块提升选择鲁棒性。 Result: 在ReasonSeg基准上,EVOL-SAM3在零样本设置下显著优于静态基线和全监督最先进方法。 Conclusion: EVOL-SAM3通过进化式推理机制实现了更深层的语义理解与自我修正,为零样本推理分割提供了新范式。 Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[148] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh

Main category: cs.CV

TL;DR: FlowBlending是一种阶段感知的多模型采样策略,通过在不同生成阶段使用大模型和小模型,在保持生成质量的同时显著提升推理速度并减少计算开销。

Details Motivation: 模型容量对生成过程不同时间步的影响存在差异,传统方法未充分利用这一特性,导致计算资源浪费。 Method: 提出FlowBlending方法,在对容量敏感的早期和晚期阶段使用大模型,在中间阶段使用小模型,并引入简单准则确定阶段边界,结合速度散度分析识别容量敏感区域。 Result: 在LTX-Video和WAN 2.1等模型上,FlowBlending实现了最高1.65倍的推理加速和57.35%的FLOPs减少,同时保持了大模型的视觉保真度、时序连贯性和语义一致性,并可与现有加速技术结合获得额外2倍加速。 Conclusion: FlowBlending通过阶段感知的模型切换策略,有效平衡了生成质量与计算效率,为扩散模型的高效推理提供了新思路。 Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

[149] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li,Yiming Cui,Yicheng He,Yiwei Wang,Shu Zhang,Longyin Wen,Yulei Niu

Main category: cs.CV

TL;DR: 本文提出了EchoFoley任务和EchoVidia框架,用于实现基于视频的精细化可控声音生成,通过引入细粒度的声音事件表示和大规模数据集EchoFoley-6k,在可控性和感知质量上显著超越现有方法。

Details Motivation: 现有视频-文本到音频生成(VT2A)方法存在视觉与文本条件不平衡、缺乏细粒度可控生成定义以及指令理解能力弱等问题,限制了声音效果在多模态叙事中的应用。 Method: 提出EchoFoley任务,定义事件级局部控制与分层语义控制;设计声音事件的符号化表示,明确声音的时间、对象和生成方式;构建包含6000多个样本的高质量数据集EchoFoley-6k;并提出基于慢-快思考策略的智能体生成框架EchoVidia。 Result: 实验表明,EchoVidia在可控性方面超过现有VT2A模型40.7%,在感知质量上提升12.5%。 Conclusion: EchoFoley任务和EchoVidia框架有效解决了当前VT2A在可控性与语义理解上的局限,推动了视频相关声音生成向更精细、可控的方向发展。 Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[150] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

Xiang Liu,Yimin Zhou,Jinxiang Wang,Yujun Huang,Shuzhao Xie,Shiyu Qin,Mingyao Hong,Jiawei Li,Yaowei Wang,Zhi Wang,Shu-Tao Xia,Bin Chen

Main category: cs.CV

TL;DR: 本文提出了Splatwizard,一个专为3D高斯点阵压缩模型设计的统一基准测试工具包,支持自动化评估渲染质量、几何精度、帧率和资源消耗等关键指标。

Details Motivation: 现有的评估工具缺乏对3DGS压缩方法在渲染速度、率失真权衡、内存效率和几何准确性等方面的全面评估能力,亟需一个标准化评测平台。 Method: 开发了一个名为Splatwizard的统一框架,集成了新模型实现、现有技术复现和自动化的性能指标计算流程,涵盖图像质量、重建网格的Chamfer距离、渲染帧率及资源消耗。 Result: Splatwizard能够有效支持多种3DGS压缩算法的评估,提供一致且全面的性能分析,并已开源代码以促进社区使用。 Conclusion: Splatwizard填补了3DGS压缩领域缺乏标准化评测工具的空白,为未来研究提供了可扩展、易用且集成化的一站式评估解决方案。 Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard

[151] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman,Srinath R,Jaswanth Reddy,Lokesh R Boregowda,Venkatesh Babu Radhakrishnan

Main category: cs.CV

TL;DR: 提出了一种统一的3D实例分割框架,通过可学习的高斯基元特征嵌入和“嵌入到标签”解码机制,结合边界硬挖掘策略,提升了3D高斯点阵和NeRF场景下的实例分割性能。

Details Motivation: 解决现有方法在多视角2D实例标签不一致导致的3D预测效果差的问题,以及两阶段方法训练耗时、依赖敏感超参数或预处理的局限性。 Method: 设计了一个统一框架,将特征嵌入学习与标签生成整合,在高斯基元上构建可学习的特征嵌入,并通过新颖的'嵌入到标签'过程解码为实例标签;为缓解边界伪影问题,引入基于线性层变换后特征的三元组损失进行边界样本硬挖掘。 Result: 在ScanNet、Replica3D和Messy-Rooms数据集上实现了优于基线方法的定性和定量结果,有效提升性能并减少训练时间。 Conclusion: 所提方法通过统一优化和边界感知的硬挖掘策略,显著改善了3DGS和NeRF中的3D实例分割效果,具有更强的鲁棒性和实用性。 Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[152] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe,Yudai Hirose,Mashiho Mukaida,Satoshi Ono

Main category: cs.CV

TL;DR: 提出一种基于投影的对抗攻击方法,利用物理闭环优化和分布式协方差矩阵适应进化策略,验证了深度神经网络在单目深度估计中的脆弱性。

Details Motivation: 深度神经网络在单目深度估计模型中易受对抗攻击,影响其可靠性,需验证其在真实环境中的脆弱性。 Method: 提出基于投影的对抗攻击方法,采用物理闭环(PITL)优化并结合分布式协方差矩阵适应进化策略,将扰动光投射到目标物体上生成对抗样本。 Result: 实验证明该方法成功生成导致深度误估计的对抗样本,使目标场景中的部分物体消失。 Conclusion: DNN-based MDE模型在现实攻击下存在显著脆弱性,所提方法有效验证了其鲁棒性缺陷。 Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization -- evaluating candidate solutions in actual environments to account for device specifications and disturbances -- and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

[153] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

Andrew Tinits,Stephen Mann

Main category: cs.CV

TL;DR: 本文提出了一种改进的Noise2Noise方法,通过理论分析表明某些非线性函数在图像去噪训练中引入的偏差极小,从而允许在高动态范围(HDR)图像去噪中使用非线性色调映射来抑制异常值影响,实现了仅用噪声数据训练出接近原需干净标签效果的去噪性能。

Details Motivation: Noise2Noise虽无需干净标签训练去噪模型,但其无法直接应用非线性函数于目标图像,因会引入偏差;而HDR图像中的异常值使训练困难,常用非线性色调映射缓解,但此前认为与Noise2Noise不兼容。本文旨在突破此限制。 Method: 建立理论框架分析非线性函数对Noise2Noise训练的影响,定义一类引入最小偏差的非线性函数,并结合特定损失函数与色调映射函数组合,在保持低偏差的同时降低动态范围以稳定训练过程。 Result: 将该方法应用于基于机器学习的蒙特卡洛渲染去噪器,仅使用噪声数据训练即达到接近原需高采样参考图像训练的效果,验证了在HDR图像上结合非线性处理的可行性与有效性。 Conclusion: 某些非线性操作可在Noise2Noise框架中安全使用,尤其适用于HDR图像去噪等存在严重噪声和异常值的场景,扩展了Noise2Noise的应用范围并提升了其实用性。 Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

[154] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage,Rico Sennnrich

Main category: cs.CV

TL;DR: 提出一种基于无导数优化和遗憾最小化的新方法,提升跨模态系统在3D场景中的自适应能力,无需预训练或微调即可改善多对象3D场景下的任务性能。

Details Motivation: 跨模态系统在从2D视觉输入转向3D场景时面临维度不匹配问题,且需处理物体遮挡和特征区分的挑战。 Method: 通过遗憾最小化结合无导数优化来改进多变量互信息估计,并利用该估计控制场景内相机,使现成的跨模态系统能够在线适应3D环境。 Result: 所提方法在多对象3D场景的跨模态任务中提升了性能,能有效应对遮挡并区分特征,且无需额外预训练或微调。 Conclusion: 该方法为2D训练的跨模态系统在3D场景中的应用提供了高效、即插即用的解决方案,推动了无需重训练的自适应视觉-语言系统发展。 Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[155] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

Md Ahmed Al Muzaddid,Jordan A. James,William J. Beksi

Main category: cs.CV

TL;DR: 提出了一种结合外观和运动信息的新型多目标跟踪框架CropTrack,用于解决农业环境中频繁遮挡和目标外观相似带来的跟踪挑战。

Details Motivation: 农业环境中的重复模式、目标外观相似、光照突变和频繁遮挡使得现有跟踪器难以保持目标身份,尤其是依赖运动信息的方法在强遮挡下表现不佳。 Method: CropTrack引入了重排序增强的外观关联、基于外观的冲突解决的一对多关联策略,以及指数移动平均原型特征库,以提升外观关联的鲁棒性。 Result: 在公开农业MOT数据集上评估显示,CropTrack在身份保持方面表现优异,相比传统运动-based方法和现有最先进方法,在ID F1和关联准确率上均有显著提升,且身份切换次数更少。 Conclusion: CropTrack通过有效融合外观与运动信息,显著提升了农业场景下的多目标跟踪性能,尤其在处理频繁遮挡和外观相似问题上具有优势。 Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

[156] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

Xunyi Zhao,Gengze Zhou,Qi Wu

Main category: cs.CV

TL;DR: 本文提出了一个名为VLN-MME的统一、可扩展的评估框架,用于探究多模态大语言模型(MLLMs)作为零样本具身智能体在视觉-语言导航(VLN)任务中的潜力。研究发现,尽管MLLMs能遵循指令并结构化输出,但其在3D空间推理和上下文感知方面表现较差,引入思维链和自反思反而导致性能下降。该工作揭示了MLLMs在顺序决策中的局限性,为后续针对具身智能体的训练提供了重要指导。

Details Motivation: 探索多模态大语言模型(MLLMs)在具身智能体场景下的能力,尤其是在需要多轮对话、空间推理和序列动作预测的视觉-语言导航任务中的表现尚不明确,亟需系统性评估。 Method: 提出VLN-MME框架,将传统导航数据集整合为标准化基准,采用模块化设计支持对不同MLLM架构、智能体设计和导航任务进行结构化比较和组件级消融实验,并引入Chain-of-Thought和自反思机制测试其影响。 Result: 实验表明,增强思维链和自反思机制反而导致基线智能体性能下降,说明MLLMs在具身导航中存在上下文感知弱和3D空间推理能力不足的问题。 Conclusion: MLLMs在当前状态下作为具身导航智能体的能力有限,尤其在连续决策和空间理解方面存在显著缺陷;VLN-MME为未来研究提供了基础评估平台,结果提示需针对性改进MLLM的上下文与空间建模能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[157] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

Meng Lan,Lefei Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文提出OFL-SAM2,一种无需手动提示的SAM2框架,用于标签高效的医学图像分割。通过轻量级映射网络在线学习并融合目标特征,实现在少量标注数据下对3D或时序2D医学图像的准确分割。

Details Motivation: 将SAM2应用于医学图像分割面临需要大量标注数据和高质量人工提示的问题,耗时且依赖专家干预。因此,需开发一种低标签依赖、无需手动提示的自适应框架。 Method: 设计一个轻量级映射网络,利用有限标注样本学习医学知识,将通用图像特征转换为目标特征;该网络支持推理过程中的在线参数更新,并通过自适应融合模块动态结合冻结的SAM2的记忆注意力特征,实现精确分割。 Result: 在三个不同的医学图像分割数据集上实验表明,OFL-SAM2在极少量训练数据下仍能达到最先进的性能,显著优于现有方法。 Conclusion: OFL-SAM2有效解决了SAM2在医学图像分割中对标注数据和人工提示的依赖问题,通过在线少样本学习和自适应特征融合,实现了高效、鲁棒的分割性能,具有良好的临床应用前景。 Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model's generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

[158] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao

Main category: cs.CV

TL;DR: FinMMDocR是一个新的双语多模态基准,用于评估多模态大语言模型在真实金融数值推理任务中的表现,具有场景感知、文档理解和多步计算三大特点。

Details Motivation: 现有基准在真实金融场景下的多模态推理能力评估不足,缺乏对复杂金融文档和专家级推理的支持。 Method: 构建包含1200个专家标注问题的双语多模态数据集,涵盖12种隐式金融场景和9类长篇幅金融文档,平均问题需11步推理,65%问题需跨页证据整合。 Result: 最佳MLLM仅达到58.0%准确率,不同RAG方法表现差异显著,显示出该基准的挑战性。 Conclusion: FinMMDocR能有效推动MLLM及推理增强方法在复杂真实场景下的多模态推理能力发展。 Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[159] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Bartłomiej Olber,Jakub Winter,Paweł Wawrzyński,Andrii Gamalii,Daniel Górniak,Marcin Łojek,Robert Nowak,Krystian Radlak

Main category: cs.CV

TL;DR: 本文提出了一种基于神经元激活模式的新型激光雷达域自适应方法,通过仅标注目标域中少量具有代表性且多样化的样本,实现了最先进的3D物体检测性能。

Details Motivation: 现有的3D物体检测器在跨域泛化方面表现不佳,例如在美国训练的模型在亚洲或欧洲性能下降,因此需要有效的域自适应方法来提升模型在不同地区的适用性。 Method: 该方法基于神经元激活模式选择目标域中最具代表性和多样性的少量样本进行标注,并结合受持续学习启发的后训练技术,防止模型权重偏离原始模型。 Result: 实验结果表明,所提方法在极小标注预算下优于线性探测和现有最先进的域自适应技术。 Conclusion: 通过精心选择目标域中的关键样本并结合防止权重漂移的技术,可以高效实现跨域3D物体检测的高性能自适应。 Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

[160] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Rongji Xun,Junjie Yuan,Zhongjie Wang

Main category: cs.CV

TL;DR: 提出HaineiFRDM,一种基于扩散模型的电影修复框架,通过全局-局部频率模块和分块训练策略实现高分辨率电影修复,并构建真实退化数据集,显著优于现有开源方法。

Details Motivation: 现有开源电影修复方法因使用低质量合成数据训练和噪声光流,性能有限,且未探索高分辨率修复问题。 Method: 提出HaineiFRDM,采用分块训练与测试策略以适应高分辨率;设计位置感知的全局提示与帧融合模块;引入全局-局部频率模块增强纹理一致性;利用低分辨率结果作为全局残差缓解分块伪影;并构建包含真实退化和合成数据的电影修复数据集。 Result: 实验结果表明,该方法在缺陷修复能力上显著优于现有开源方法,尤其在高分辨率影片修复中表现突出。 Conclusion: HaineiFRDM有效利用扩散模型的内容理解能力,结合创新模块与高质量数据集,实现了先进的开源电影修复性能,推动了高分辨率老电影修复的发展。 Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

[161] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Xinran Gong,Gorkem Durak,Halil Ertugrul Aktas,Vedat Cicek,Jinkui Hao,Ulas Bagci,Nilay S. Shah,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ProDM的生成扩散模型,用于从非门控胸部CT中恢复无运动伪影的冠状动脉钙化病变,从而提高CAC评分的准确性与临床可用性。

Details Motivation: 非门控胸部CT中的运动伪影严重影响冠状动脉钙化(CAC)评分的准确性,而门控CT应用受限,因此需要一种能在常规CT中可靠量化CAC的方法。 Method: 提出ProDM框架,包含三个关键部分:1)基于门控CT合成非门控图像的CAC运动模拟引擎;2)结合钙化先验的可微钙一致性损失进行属性感知学习;3)在扩散过程中逐步校正伪影的渐进式修正机制。 Result: 实验表明,ProDM在真实患者数据上显著提升了CAC评分准确性、病灶空间保真度和风险分层能力,并通过读片研究验证了其对运动伪影的抑制效果和临床可用性的改善。 Conclusion: ProDM为在常规非门控胸部CT中实现可靠的CAC定量提供了有前景的解决方案,有助于推动心血管疾病风险评估的普及化。 Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

[162] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li,Yukai Gu,Yingqian Min,Zikang Liu,Yifan Du,Kun Zhou,Min Yang,Wayne Xin Zhao,Minghui Qiu

Main category: cs.CV

TL;DR: 本文提出了一个针对视频生成推理的新评估范式,引入了VIPER基准和POC@r指标,以评估模型在复杂任务中的过程一致性与结果正确性,发现当前先进模型存在严重的结果操纵问题,距离真正的视觉推理仍有较大差距。

Details Motivation: 现有的视频生成评估方法多依赖单帧判断,容易导致模型通过错误的推理过程得到正确结果(即结果操纵),缺乏对中间推理步骤有效性的评估。 Method: 提出VIPER基准,涵盖16个跨时间、结构、符号、空间、物理和规划推理的任务,并设计Process-outcome Consistency (POC@r) 指标,利用VLM-as-Judge结合分层评分标准评估推理过程与结果的一致性。 Result: 实验显示当前最先进的视频生成模型在POC@1.0上仅达到约20%,表现出显著的结果操纵现象;同时发现测试时扩展和采样鲁棒性存在明显局限。 Conclusion: 当前视频生成模型在实现真正通用视觉推理方面仍远未成熟,需更注重推理过程的合理性评估,所提出的VIPER和POC@r为未来研究提供了重要工具与方向。 Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[163] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu,Kevin Qinghong Lin,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了ShowUI-π,首个基于流的生成模型,用于实现GUI智能体中的灵巧操作,支持离散点击与连续拖拽的统一建模,并构建了ScreenDrag基准测试,实验表明现有方法在拖拽任务上表现不佳,而ShowUI-π以更小的参数量实现了更优性能。

Details Motivation: 现有的GUI智能体依赖离散点击预测,无法支持需要实时感知和调整的连续交互(如拖动进度条),缺乏对灵巧操作的支持,限制了其在复杂数字环境中的自动化能力。 Method: 提出ShowUI-π,包含三个核心设计:(i) 统一离散-连续动作模型,共享框架下处理点击与拖拽;(ii) 基于流的动作生成,通过轻量级专家模块从视觉输入预测光标增量调整;(iii) 构建包含20K拖拽轨迹的数据集及ScreenDrag基准,支持在线与离线评估。 Result: 实验显示现有商用GUI智能体在ScreenDrag上表现差(如Operator得分为13.27,Gemini-2.5-CUA为22.18),而ShowUI-π仅用450M参数即达到26.98,显著优于现有方法。 Conclusion: ShowUI-π推动了GUI智能体向人类水平的灵巧控制迈进,验证了统一建模与基于流生成在复杂交互任务中的有效性,为未来数字环境中智能代理的发展提供了新方向。 Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

[164] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza,Baldoino Fonseca dos Santos Neto,Marcio de Medeiros Ribeiro

Main category: cs.CV

TL;DR: 本文系统评估了量化、剪枝和权重聚类等压缩技术对卷积神经网络在自然损坏下的鲁棒性影响,发现某些压缩策略不仅能保持甚至提升模型鲁棒性,尤其对复杂架构更为显著。

Details Motivation: 模型压缩可能影响深度学习模型在自然损坏下的鲁棒性,因此在资源受限设备上部署时需综合评估压缩方法对准确性、压缩比和鲁棒性的权衡。 Method: 对ResNet-50、VGG-19和MobileNetV2应用量化、剪枝和权重聚类技术(单独及组合),在CIFAR-10-C和CIFAR-100-C数据集上进行评估,并采用多目标分析方法识别最优配置。 Result: 某些压缩策略可提升模型鲁棒性,尤其是复杂架构;定制化的技术组合能实现更优的多目标性能(准确性、压缩比、鲁棒性)。 Conclusion: 合理选择压缩技术组合可在保证高压缩比的同时提升模型在真实受损环境中的鲁棒性,为实际部署提供指导。 Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

[165] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park,Hyunwoo Ha,Wonjun Jo,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了DarkEQA,一个用于评估视觉语言模型在多级低光条件下视觉问答能力的开源基准,强调物理真实感的视觉退化建模,并揭示了现有模型在此类挑战性条件下的局限性。

Details Motivation: 现有的视觉语言模型(VLMs)基准主要在理想光照条件下进行评估,忽略了实际应用中常见的低光环境性能需求,尤其是在需要24/7运行的具身智能体中,因此亟需一个能评估低光条件下感知能力的基准。 Method: 提出DarkEQA基准,通过在线性RAW空间中模拟基于物理的光照衰减和传感器噪声,并结合ISP启发的渲染流程,生成具有物理保真度的低光图像;在受控退化条件下评估以自我为中心观察为基础的视觉问答任务,从而隔离感知瓶颈并实现可归因的鲁棒性分析。 Result: 对多种最先进的VLMs和低光图像增强(LLIE)模型进行了系统评估,结果表明当前VLMs在低光条件下表现显著下降,且现有LLIE方法难以有效恢复语义信息以支持下游VLM任务。 Conclusion: DarkEQA填补了低光环境下视觉语言推理评估的空白,揭示了现有模型的感知脆弱性,并为未来开发更具鲁棒性的具身智能系统提供了重要工具和方向。 Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[166] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Zhenyu Cui,Jiahuan Zhou,Yuxin Peng

Main category: cs.CV

TL;DR: 本文提出了一种无需重新索引历史图库图像的终身行人重识别新任务(RFL-ReID),并设计了双向连续兼容表示(Bi-C2R)框架,在不重新提取历史特征的情况下实现新旧模型特征的兼容,有效缓解灾难性遗忘问题,取得了优异性能。

Details Motivation: 现有终身行人重识别方法依赖于对历史图库图像的重新索引,但由于隐私问题和高计算成本,难以实际应用;同时新旧模型提取的特征不兼容,导致检索性能下降。 Method: 提出Bi-C2R框架,通过双向知识迁移机制,在更新模型时持续保持新旧特征空间的兼容性,无需重新提取和存储历史图库特征,实现了高效的无重索引终身学习。 Result: 在多个基准数据集上进行了广泛实验,结果表明所提方法在RFL-ReID和传统L-ReID任务上均达到领先性能,且理论分析证明了其特征兼容性和稳定性。 Conclusion: Bi-C2R成功解决了RFL-ReID中因无法重索引导致的新旧特征不兼容问题,为实际部署中的持续学习行人重识别提供了高效、隐私友好的解决方案。 Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

[167] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu,Jiahe Li,Fabio Tosi,Matteo Poggi,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: 提出了一种基于学习的单目稠密SLAM系统FoundationSLAM,通过融合基础深度模型的几何引导,提升了轨迹精度与重建质量,并实现实时运行。

Details Motivation: 解决以往基于光流的方法在单目稠密SLAM中缺乏几何一致性的缺陷,提升跟踪与建图的准确性与鲁棒性。 Method: 提出混合光流网络生成几何感知的匹配点,结合双一致束调整层进行多视角联合优化,并引入可靠性感知细化机制实现匹配与优化的闭环反馈。 Result: 在多个具有挑战性的数据集上实现了优越的轨迹精度和稠密重建质量,实时运行达18 FPS。 Conclusion: FoundationSLAM通过融合基础模型的几何先验与学习型光流,在精度、鲁棒性和实时性之间取得了良好平衡,具备广泛适用性。 Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

[168] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Xu He,Haoxian Zhang,Hejia Chen,Changyuan Zheng,Liyang Chen,Songlin Tang,Jiehui Huang,Xiaoqiang Liu,Pengfei Wan,Zhiyong Wu

Main category: cs.CV

TL;DR: 提出一种自举框架,将音频驱动的视觉配音从病态的图像修复任务转化为条件良好的视频编辑问题,利用扩散Transformer生成理想训练数据并实现精确唇形同步。

Details Motivation: 现有方法因缺乏理想的成对训练数据(仅唇部运动不同而其他视觉条件一致),依赖掩码修复范式导致视觉伪影、身份漂移和同步效果差。 Method: 使用Diffusion Transformer作为数据生成器,为每个真实视频样本合成对应的唇部改变的配对视频,构建视觉对齐的视频对;在此基础上端到端训练一个DiT-based音频驱动编辑器,并采用 timestep-自适应多阶段学习策略以解耦扩散过程中的冲突目标。 Result: 在唇形同步精度、身份保持性和复杂真实场景下的鲁棒性方面显著优于现有方法,同时提升了视觉保真度。 Conclusion: 该方法通过构建理想训练数据和充分利用完整视觉上下文,有效解决了传统方法中的根本缺陷,推动了音频驱动视觉配音的实际应用。 Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

[169] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Dian Shao,Mingfei Shi,Like Liu

Main category: cs.CV

TL;DR: 提出FineTec框架,用于在时间损坏的骨架序列下进行细粒度动作识别,通过上下文感知补全、空间分解与物理驱动加速度估计提升性能。

Details Motivation: 现有方法难以从严重缺失的时间骨架数据中恢复细粒度动作的关键时空特征,尤其在在线姿态估计场景下面临挑战。 Method: 采用上下文感知的时序掩码补全基础骨架序列;通过语义区域划分和动静态分组进行空间分解并生成增强序列;利用拉格朗日动力学估计关节加速度;结合位置与加速度序列通过GCN进行动作识别。 Result: 在NTU-60、NTU-120、Gym99和Gym288等多个基准上显著优于先前方法,在Gym99-severe和Gym288-severe设置下分别达到89.1%和78.1%的top-1准确率。 Conclusion: FineTec通过融合空间分解与物理驱动建模,有效提升了在时序损坏情况下的细粒度动作识别鲁棒性和泛化能力。 Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

[170] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Jiageng Liu,Weijie Lyu,Xueting Li,Yejie Guo,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Edit3r是一种前馈框架,能够从无姿态、视角不一致的编辑图像中单次重建和编辑3D场景,无需每场景优化或姿态估计,具有快速推理和高3D一致性。

Details Motivation: 现有3D场景编辑方法通常依赖于每场景优化和精确的姿态估计,限制了编辑速度与实际应用。此外,缺乏多视角一致的编辑图像用于监督训练。因此,需要一种高效、无需优化且能处理真实编辑输入的3D编辑框架。 Method: 提出Edit3r,采用前馈网络直接预测指令对齐的3D编辑;使用SAM2-based recoloring策略生成跨视角一致的监督信号,并设计非对称输入策略,将重着色参考视图与原始辅助视图结合,以融合不同观测信息。在推理时可处理如InstructPix2Pix等2D编辑图像。 Result: 在新构建的大规模评测基准DL3DV-Edit-Bench上验证,Edit3r在语义对齐、3D一致性方面优于现有方法,且推理速度显著更快。 Conclusion: Edit3r实现了快速、高质量的单步3D场景编辑,无需优化或姿态估计,在真实编辑输入下仍表现鲁棒,为实时3D编辑提供了可行方案。 Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

[171] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang,Hao-Jen Chien,Chin-Yang Lin,Ying-Huan Chen,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为GaMO的几何感知多视角补全框架,用于解决稀疏视角下3D重建中的覆盖不足、几何不一致和计算成本高的问题,通过在零样本条件下扩展现有视角的视场,实现了更优的重建质量和25倍的速度提升。

Details Motivation: 现有的基于扩散模型的稀疏视角3D重建方法在视图外推时存在场景覆盖不足、生成视图间几何不一致以及计算开销大的问题,难以满足高质量重建需求。 Method: 提出GaMO框架,将稀疏视角重建问题重构为多视角补全任务,利用多视角条件输入和几何感知去噪策略,在无需训练的情况下从已有相机位姿扩展视场,保持几何一致性并提升覆盖范围。 Result: 在Replica和ScanNet++数据集上,使用3、6、9个输入视图进行实验,GaMO在PSNR和LPIPS指标上均优于先前方法,并比当前最先进的扩散方法快25倍,处理时间低于10分钟。 Conclusion: GaMO通过几何感知的多视角补全策略,有效解决了稀疏视角3D重建中的关键挑战,在质量、一致性和效率方面均取得显著提升,具有良好的实际应用潜力。 Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

[172] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Zhening Huang,Hyeonho Jeong,Xuelin Chen,Yulia Gryaditskaya,Tuanfeng Y. Wang,Joan Lasenby,Chun-Hao Huang

Main category: cs.CV

TL;DR: 提出SpaceTimePilot,一种解耦空间与时间的视频扩散模型,实现对摄像机视角和运动序列的独立控制生成渲染。

Details Motivation: 现有视频生成模型难以同时精确控制视角变化和动作时序,缺乏支持时空完全解耦训练的数据集。 Method: 引入动画时间嵌入机制和时序扭曲训练策略,结合改进的相机条件机制及新构建的CamxTime数据集进行联合训练。 Result: 在真实和合成数据上验证了清晰的时空解耦能力,时序控制更精准,效果优于先前方法。 Conclusion: SpaceTimePilot实现了对生成视频中空间和时间维度的独立精确控制,推动了可控动态场景生成的发展。 Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot