Skip to content

Table of Contents

cs.CL [Back]

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi,Richard M. K. van Dijk,Gijs Wijnholds,Tessa Verhoef

Main category: cs.CL

TL;DR: 本研究提出了一种结合OCR、生成式AI和数据库链接的自动化管道,用于将莱顿大学历史文档图像中的打字文本数字化并结构化,实现了高精度的数据提取与匹配。

Details Motivation: 为了高效整合历史文献图像中非结构化的传记数据与现有高质量数据库,解决传统手工录入效率低、易出错的问题,并应对版式多样性和术语差异等数字人文领域的挑战。 Method: 采用OCR技术将历史文档图像转为文本,利用生成式大语言模型在解码时施加约束以结构化提取JSON格式数据,并设计记录链接算法将提取结果与现有数据库进行匹配。 Result: OCR的字符错误率(CER)为1.08%,词错误率(WER)为5.06%;从OCR文本中提取JSON的平均准确率为63%(基于标注文本达65%);记录链接对标注JSON的匹配准确率达94%,对OCR生成JSON达81%。 Conclusion: 生成式AI能在一定程度上弥补OCR性能不足,所提出的自动化管道有效支持了历史文档的数字化处理,在数字人文研究中具有应用潜力。 Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano

Main category: cs.CL

TL;DR: 本文提出了CAT框架,用于评估和可视化大语言模型在可控输入变化下的准确性和响应一致性之间的相互作用,核心是通过一致性-准确率关系(CAR)曲线和一致性导向鲁棒性估计(CORE)指数来量化准确性和一致性之间的权衡。

Details Motivation: 当前的评估方法主要关注模型的能力如准确率或基准得分,而最近一致性被认为是部署大语言模型于高风险实际应用中的重要属性。然而,准确性和一致性之间的依赖关系也需要被考虑以实现对大语言模型更细致的评估。 Method: 提出了一致性-准确率关系(CAR)曲线以及最小一致性准确率(MCA)度量,并引入了综合CAR曲线下面积和形状的一致性导向鲁棒性估计(CORE)指数作为全局度量标准。 Result: 通过对多种通用和特定领域的大语言模型在多个选择题基准上进行实践演示,展示了该框架的有效性,并说明了CAT可以扩展到长篇、开放式评估任务。 Conclusion: CAT框架提供了一个新的视角来评估大语言模型的准确性和一致性之间的权衡,有助于更好地理解模型性能并指导其在高风险场景中的应用。 Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang,Jinze Yu,Xing Zhang,Dayuan Jiang,Yin Song,Tomal Deb,Xuefeng Liu,Peiyang He

Main category: cs.CL

TL;DR: 本文提出了一种评估和提升大语言模型(LLM)生成结构化输出一致性的综合框架,包括新的语义树编辑距离(STED)指标和一致性评分体系,实验证明其在多种模型上有效,并支持模型选择、提示优化与诊断分析。

Details Motivation: 确保大语言模型在生产环境中生成结构化数据的一致性和可靠性,解决现有评估指标在语义灵活性与结构严格性之间平衡不足的问题。 Method: 提出了STED(Semantic Tree Edit Distance)作为衡量JSON输出相似性的新指标,并构建了一个基于多次生成结果的STED聚合的一致性评分框架;通过合成数据集进行系统实验,控制模式、表达和语义变化,评估多个LLM的表现。 Result: STED在语义等价样本间达到0.86-0.90的相似度,在结构不一致时为0,优于TED、BERTScore和DeepDiff等现有方法;六种LLM中,Claude-3.7-Sonnet表现出极高的生成一致性,即使在高温下也稳定,而Claude-3-Haiku和Nova-Pro则表现明显下降。 Conclusion: 该框架为LLM生成结构化输出提供了可靠的评估工具和改进路径,可用于模型选择、提示工程和不一致原因诊断,增强了LLM在生产系统中的实用性与可信度。 Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Jahidul Islam,Md Ataullha,Saiful Azad

Main category: cs.CL

TL;DR: 本文提出了一种基于代理的框架BanglaCodeAct,用于从孟加拉语生成Python代码,利用多智能体提示和迭代自修正,在低资源语言中实现了高效的代码生成。

Details Motivation: 解决现有模型在低资源语言(如孟加拉语)到代码生成上的不足,减少对任务特定微调的依赖。 Method: 采用开源多语言大语言模型,结合Thought-Code-Observation循环,通过多智能体提示和迭代自修正机制实现代码的动态生成、测试与优化。 Result: 在mHumanEval数据集上评估多个小型参数开源LLM,Qwen3-8B配合BanglaCodeAct达到开发集94.0%和盲测集71.6%的pass@1准确率。 Conclusion: 该方法为孟加拉语到Python代码生成设立了新基准,展示了基于代理推理在低资源语言代码生成中的潜力。 Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

[5] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

Tingwei Xie,Tianyi Zhou,Yonghong Song

Main category: cs.CL

TL;DR: PharmaShip是一个用于测试预训练文本布局模型在噪声OCR和异构模板下性能的中文药品运输文档数据集,支持序列实体识别、关系抽取和阅读顺序预测任务,并提出序列感知约束作为可迁移的结构建模偏差。

Details Motivation: 现有文档理解模型在真实场景下的药品运输单据处理中面临噪声OCR和多样化模板的挑战,缺乏标准化的基准来评估其鲁棒性,因此需要一个专门针对该领域且能反映实际复杂性的数据集。 Method: 构建了包含多种任务(SER、RE、ROP)的PharmaShip数据集,采用实体为中心的评估协议,统一预处理、数据划分与优化流程,并对五种代表性模型(如LiLT、LayoutLMv3等)进行基准测试,引入阅读顺序正则化与位置覆盖改进策略。 Result: 实验表明像素信息与显式几何特征具有互补作用,但均不足够;引入阅读顺序正则化可提升SER和EL性能并增强鲁棒性,更长的位置覆盖有助于改善末页预测稳定性;ROP在词级别准确但在段落级别仍具挑战。 Conclusion: PharmaShip为药物领域的关键文档理解提供了可控且可复现的基准,验证了序列感知约束是一种有效的可迁移归纳偏置,有助于提升复杂文档结构建模能力。 Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.

[6] Noise-Driven Persona Formation in Reflexive Neural Language Generation

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 提出了一种名为Luca-Noise Reflex Protocol(LN-RP)的计算框架,用于研究大语言模型中噪声驱动的人格涌现现象。

Details Motivation: 探索大语言模型在随机噪声影响下是否会产生稳定且可重复的人格模式,并理解其生成过程中的反射性动态。 Method: 通过在初始生成状态中注入随机噪声种子,在152个生成周期中观察语言行为的非线性变化,识别出不同熵特征的稳定人格模式。 Result: 发现了三种具有显著差异的稳定人格模式(p < 0.01),并证实外部噪声可引发相变,且人格特征在生成过程中保持一致。 Conclusion: LN-RP为研究大语言模型中的反射生成、涌现行为和长程语言一致性提供了可重复的实验方法。 Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

[7] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

Main category: cs.CL

TL;DR: 本文提出了一种名为HarmTransform的多智能体辩论框架,用于将有害查询转化为更隐蔽的形式,以增强大语言模型的安全对齐能力。

Details Motivation: 现有安全机制主要针对明显危险内容,难以应对通过隐晦改写保留恶意意图的查询,导致安全训练数据存在盲区。 Method: 采用多智能体辩论框架,通过迭代批评与优化,生成保持有害意图但更隐蔽的查询变体。 Result: 实验表明HarmTransform在生成有效隐蔽查询方面显著优于基线方法,但分析也发现辩论可能引入主题偏移和复杂性问题。 Conclusion: 多智能体辩论在提升安全训练数据覆盖性方面具有潜力,但也需警惕其带来的副作用,需进一步优化以平衡效果与稳定性。 Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

[8] Emergent World Beliefs: Exploring Transformers in Stochastic Games

Adam Kamel,Tanish Rastogi,Michael Ma,Kailash Ranganathan,Kevin Zhu

Main category: cs.CL

TL;DR: 该论文研究了基于Transformer的大型语言模型(LLM)在不完全信息博弈(如德州扑克)中是否能学习环境的隐含状态表示。作者在一个扑克手牌历史数据集上预训练了一个GPT风格的模型,并通过非线性探针分析其内部激活,发现模型能够自发学习确定性结构(如牌型大小)和随机性特征(如胜率),且这些表示与理论上的信念状态相关,表明LLM具备在部分可观测环境中构建世界模型的能力。

Details Motivation: 探索大型语言模型在不完全信息环境(如德州扑克)中是否也能像在完全信息游戏中一样,发展出对环境状态的内部表征,从而扩展对LLM推理和世界建模能力的理解。 Method: 在扑克手牌历史(PHH)数据上预训练一个GPT风格的语言模型,并使用线性和非线性探针对其内部激活进行分析,以检测其是否编码了牌型、胜率等游戏相关特征。 Result: 模型在无显式监督的情况下学习到了牌型等级等确定性结构和胜率等随机性特征;非线性探针表明这些表示可解码,并与理论信念状态显著相关。 Conclusion: 大型语言模型能够在部分可观测的复杂环境中(如德州扑克)自发构建包含确定性和随机性特征的内部世界模型,表明其具备处理不完全信息决策任务的潜力。 Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.

[9] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Anwar Alajmi,Gabriele Pergola

Main category: cs.CL

TL;DR: 提出一种两阶段框架,结合针对性训练和基于推理的推断,以应对在线性别歧视内容检测中的数据稀疏、噪声和概念模糊问题,在多个基准上取得最优性能。

Details Motivation: 传统方法难以识别在线微妙且依赖上下文的性别歧视内容,现有标注数据存在不一致、标签稀缺和类别不平衡问题,导致模型决策边界不稳定,漏检弱表现形式的伤害。 Method: 采用两阶段框架:训练阶段使用类别平衡的焦点损失、类别感知批处理和后处理阈值校准;推断阶段通过动态路由机制将高置信度样本直接分类,低置信度样本交由多角色协同专家判断(CEJ)模块进行推理整合。 Result: 在EXIST 2025 Task 1.1上F1提升+2.72%,EDOS Task A和B分别提升+4.48%和+1.30%,在多个基准上达到最先进水平。 Conclusion: 该框架有效应对了性别歧视检测中数据不足、噪声和语义模糊的联合挑战,通过分离确定与不确定案例并引入协作式推理,提升了对细微有害内容的识别能力。 Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.

[10] Break Out the Silverware -- Semantic Understanding of Stored Household Items

Michaela Levi-Richter,Reuth Mirsky,Oren Glickman

Main category: cs.CL

TL;DR: 本文提出了“存储家庭物品挑战”(Stored Household Item Challenge),旨在评估服务机器人在家庭场景中推断不可见物品存储位置的认知能力,并发布了两个数据集和一种结合视觉与大语言模型的混合方法NOAM,该方法在预测准确率上显著优于基线模型并接近人类水平。

Details Motivation: 服务机器人在执行如‘拿一个盘子’这类简单指令时,面临无法看见物品存储位置的挑战,缺乏必要的常识推理能力。因此需要一个基准任务来评估和提升机器人对家庭物品存储位置的推理能力。 Method: 提出NOAM(Non-visible Object Allocation Model),将视觉输入转化为描述空间上下文和可见容器的自然语言,再利用大语言模型(如GPT-4)推断最可能的隐藏存储位置,实现视觉与语言的融合推理。 Result: 在包含100个真实世界样本的测试集上,NOAM显著优于随机选择、纯视觉-语言管道(如Grounding-DINO + SAM)及主流多模态模型(如Gemini、GPT-4o等),预测准确率接近人类表现;同时发布了两个数据集:一个用于真实场景评估,另一个用于开发训练。 Conclusion: NOAM通过结合结构化场景理解和大语言模型实现了对不可见家庭物品存储位置的有效预测,展示了将常识推理融入机器人系统的可行路径,为构建更具认知能力的服务机器人提供了实践范例和评估基准。 Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[11] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su,Meicong Zhang,Guoxiu He

Main category: cs.CL

TL;DR: 提出了一种无需训练的推测性解码增强方法EASD,通过引入基于熵的动态惩罚机制,在保持解码效率的同时提升了大语言模型的推理性能。

Details Motivation: 过度对齐的草案模型和目标模型限制了推测性解码的性能,无法超越目标模型本身的表现。 Method: 在标准推测性解码基础上,引入动态熵惩罚机制,利用采样分布的熵衡量模型不确定性,并在高熵且预测重叠大时拒绝令牌并重新采样。 Result: 在多个推理基准上,EASD持续优于现有推测性解码方法,并在大多数情况下超越目标模型本身的性能,同时保持与SD相当的效率。 Conclusion: EASD通过熵感知的草案验证机制,能够在不牺牲效率的前提下提升推理质量,甚至突破目标模型的性能瓶颈。 Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

[12] MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team,:,Dong Zhang,Gang Wang,Jinlong Xue,Kai Fang,Liang Zhao,Rui Ma,Shuhuai Ren,Shuo Liu,Tao Guo,Weiji Zhuang,Xin Zhang,Xingchen Song,Yihan Yan,Yongzhe He,Cici,Bowen Shen,Chengxuan Zhu,Chong Ma,Chun Chen,Heyu Chen,Jiawei Li,Lei Li,Menghang Zhu,Peidian Li,Qiying Wang,Sirui Deng,Weimin Xiong,Wenshan Huang,Wenyu Yang,Yilin Jiang,Yixin Yang,Yuanyuan Tian,Yue Ma,Yue Yu,Zihan Zhang,Zihao Yue,Bangjun Xiao,Bingquan Xia,Bofei Gao,Bowen Ye,Can Cai,Chang Liu,Chenhong He,Chunan Li,Dawei Zhu,Duo Zhang,Fengyuan Shi,Guoan Wang,Hailin Zhang,Hanglong Lv,Hanyu Li,Hao Tian,Heng Qu,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianguang Zuo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Linghao Zhang,Meng Chen,Nuo Chen,Peng Zhang,Qianli Chen,Qiantong Wang,Rang Li,Shaohui Liu,Shengfan Wang,Shicheng Li,Shihua Yu,Shijie Cao,Shimao Chen,Shuhao Gu,Weikun Wang,Wenhan Ma,Xiangwei Deng,Xing Yong,Xing Zhang,Xu Wang,Yifan Song,Yihao Zhao,Yingbo Zhao,Yizhao Gao,Yu Cheng,Yu Tu,Yudong Wang,Zhaojun Huang,Zhengju Tang,Zhenru Lin,Zhichao Song,Zhipeng Xu,Zhixian Zheng,Zihan Jiang

Main category: cs.CL

TL;DR: MiMo-Audio通过大规模预训练实现了音频领域的少样本学习能力,在多种音频任务上达到开源模型SOTA,并展现出强大的泛化与生成能力。

Details Motivation: 受GPT-3启发,探索仅依赖大规模自回归预训练是否可在音频领域实现类似的语言模型通用性与少样本学习能力,摆脱传统任务特定微调的限制。 Method: 扩展MiMo-Audio预训练数据至超一亿小时,构建大规模自回归音频语言模型;在后训练阶段构建多样化指令微调语料并引入思维机制,提升理解和生成能力。 Result: MiMo-Audio-7B-Base在语音智能与音频理解基准上达到开源SOTA,能泛化到声音转换、风格迁移和语音编辑等未见任务,并具备高质量语音续写能力;MiMo-Audio-7B-Instruct在多个音频理解、对话和指令TTS评测中接近或超越闭源模型。 Conclusion: 大规模自回归预训练可使音频语言模型自然涌现出少样本学习与强泛化能力,结合指令微调与思维机制能进一步提升性能,推动通用音频智能发展。 Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

[13] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Amal Alqahtani,Efsun Kayi,Mona Diab

Main category: cs.CL

TL;DR: 本文提出了一种名为StressRoBERTa的跨条件迁移学习模型,用于自动检测英文推文中自我报告的慢性压力。通过在抑郁、焦虑和PTSD等临床相关疾病文本上进行持续训练,并在SMM4H 2022任务数据集上微调,该模型在F1分数上达到82%,优于现有最佳系统。

Details Motivation: 慢性压力是一个重要的公共卫生问题,而社交媒体成为人们分享压力体验的重要平台。由于慢性压力常与其他精神疾病共病,利用相关疾病的文本数据可能提升压力检测性能。 Method: 采用RoBERTa模型,在包含抑郁、焦虑和PTSD用户发布的文本组成的Stress-SMHD语料库(1.08亿词)上进行持续预训练,然后在SMM4H 2022 Task 8数据集上微调,实现对慢性压力的检测。 Result: StressRoBERTa在SMM4H 2022任务中取得82%的F1分数,比最优系统高3个百分点;相比基础RoBERTa提升1个百分点,表明针对相关疾病的迁移学习效果更优。在Dreaddit数据集上也表现出良好迁移能力(81% F1)。 Conclusion: 聚焦于高共病性精神障碍的跨条件迁移学习能有效提升慢性压力检测性能,证明了特定领域而非泛化心理健康训练更具优势。 Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

[14] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Himel Ghosh

Main category: cs.CL

TL;DR: 本文比较了两种基于Transformer的偏见检测模型,使用SHAP解释方法分析其在新闻文本中的决策机制。研究发现,尽管两者关注类似的评价性语言,但在信号整合方式上存在显著差异,领域自适应模型的归因模式更合理且误报率更低。

Details Motivation: 当前自动偏见检测模型缺乏可解释性,难以理解其决策过程和失败原因,限制了其在新闻业中的可信部署。 Method: 基于SHAP的解释方法,对在BABE数据集上微调的偏见检测模型和领域自适应RoBERTa模型进行词级别归因分析,比较正确与错误预测中的语言模式。 Result: 两种模型虽关注相似的偏见语言特征,但领域自适应模型的归因与预测结果更一致,误报减少63%;错误主要源于话语层面的歧义而非明显偏见线索。 Conclusion: 模型架构和训练策略显著影响偏见检测系统的可靠性和适用性,需结合可解释性评估以支持新闻领域的实际应用。 Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

[15] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Dingmin Wang,Ji Ma,Shankar Kumar

Main category: cs.CL

TL;DR: 提出一种自适应提示策略,通过分块处理检索信息,在减少token使用的同时保持问答性能,并发现LLM在信息不足时易生成错误答案而非拒绝回答。

Details Motivation: 长上下文虽有助于引入知识,但也包含更多无关信息,影响LLM在检索增强问答中的表现。 Method: 将检索到的信息分割成较小块,依次用LLM基于每块信息回答问题,通过调整块大小平衡相关信息的引入与无关信息的抑制。 Result: 在三个开放域问答数据集上实验表明,该策略在使用更少token的情况下达到与标准提示相当的性能;分析发现LLM常在信息不足时生成错误答案而非拒绝回答。 Conclusion: 自适应提示策略有效缓解长上下文带来的干扰,且揭示了提升LLM拒答能力的重要性,需进一步研究。 Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.

[16] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Main category: cs.CL

TL;DR: 本文提出一种基于注意力层token分布生成对抗样本的新方法,利用模型内部的中间层表示来扰动输入,从而降低LLM评估任务的性能。该方法生成的扰动具有语义一致性和合理性,但部分层和位置的替换可能导致语法退化。

Details Motivation: 探索机制可解释性中注意力层编码的信息是否可用于生成有效的对抗样本,以检验LLM评估系统的鲁棒性。 Method: 从LLaMA-3.1-Instruct-8B的中间注意力层提取token分布,将其作为对抗性扰动应用于输入文本,而不依赖传统的提示或梯度攻击方法。 Result: 在ArgQuality数据集上的实验表明,基于注意力的对抗样本能显著降低评估性能,同时保持与原始输入较高的语义相似性;但某些层和位置的替换会引入语法问题。 Conclusion: 中间层表示有潜力作为构建对抗样本的原则性来源,可用于压力测试LLM评估流程,但其实际效果受限于语法质量和层的选择。 Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

[17] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Yukun Zhang,Stefan Elbl Droguett,Samyak Jain

Main category: cs.CL

TL;DR: 本研究提出一种多检索器的RAG系统,结合领域特定训练与最新大语言模型,提升金融数值推理问答任务的性能,实现了超过基线模型的SOTA结果,但仍低于人类专家水平。

Details Motivation: 由于缺乏金融领域的专业知识,现有大语言模型在处理需要复杂多步数值推理的金融问题时表现不佳,本文旨在通过引入外部知识和领域特定训练来缓解这一问题。 Method: 采用多检索器检索增强生成(RAG)系统,结合SecBERT编码器进行领域特定训练,并利用最新大语言模型构建神经符号模型与提示式生成器,通过消融实验与错误分析验证方法有效性。 Result: 基于SecBERT的领域特定训练显著提升了模型性能,超越FinQA原顶级模型;最佳提示式LLM生成器实现SOTA,准确率提升超7%,但仍未达到人类专家水平;研究表明较大模型中外部知识增益通常超过幻觉损失。 Conclusion: 领域特定训练与外部知识检索对金融数值推理至关重要,最新大语言模型在少样本学习下展现出更强的数值推理能力,未来应平衡幻觉与知识获取以进一步提升性能。 Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

[18] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers,Manit Patel,Seiyon M. Lee,Anthony F. Botelho

Main category: cs.CL

TL;DR: 本文提出了一种分析优先的框架,用于分离开放性答题中学生内容与教师评分倾向的影响,通过动态建模教师历史和句子嵌入表示,结合去中心化与残差化方法减少干扰因素,利用时序验证线性模型量化各信号贡献,并通过投影模型揭示评分分歧。结果显示教师先验对评分预测影响显著,结合内容嵌入可提升预测性能(AUC 0.815),而仅依赖内容的模型较弱(AUC 0.626)。调整评分者效应后的内容表征更具信息量,有助于识别反映学生理解的真实语义证据。该框架将嵌入特征转化为可用于教学反思的学习分析工具。

Details Motivation: 自动化评分常混淆学生实际回答内容与教师评分习惯,导致对学生理解水平的误判,亟需一种能区分内容信号与评分者偏差的方法,以提高评分透明度与教育公平性。 Method: 提出一个分析优先的框架:使用去标识化的ASSISTments数学答题数据,将教师评分历史建模为动态先验,采用句子嵌入表示文本内容,并引入中心化与残差化技术消除题目提示和教师评分模式的混杂影响;使用时序验证的线性模型量化内容与评分者信号的贡献,并构建投影表面模型可视化评分分歧以便质性分析。 Result: 教师先验显著影响评分预测(AUC~0.815);仅用内容的模型表现明显更弱(AUC~0.626);调整评分者效应后,残差内容表征保留更多有信息量的嵌入维度,能识别出体现真实理解而非表面差异的回答;投影模型可有效揭示评分不一致案例。 Conclusion: 该框架成功分离了评分中的内容信号与教师倾向,提升了自动化评分的可解释性与审计能力;不仅提高了预测准确性,更重要的是将嵌入特征转化为支持教学反思的学习分析工具,帮助教师和研究者审视评分实践是否与学生思维证据一致,推动更公正、基于证据的教学决策。 Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[19] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou,Chunkang Zhang,Guoxin Yu,Fandong Meng,Jie Zhou,Wai Lam,Mo Yu

Main category: cs.CL

TL;DR: 本文提出了HGMem,一种基于超图的记忆机制,用于增强多步检索增强生成(RAG)中的复杂推理与全局理解能力。

Details Motivation: 现有RAG系统的记忆模块多为静态存储,缺乏对原始事实间高阶关联的建模,限制了其在多步推理和知识演化中的表现。 Method: 提出HGMem,将记忆表示为超图结构,其中超边代表记忆单元,支持逐步构建高阶交互,形成动态、集成的知识结构以指导后续推理。 Result: 在多个需要全局理解的挑战性数据集上进行了实验,结果表明HGMem显著优于强基线系统,在多步RAG任务中持续提升性能。 Conclusion: HGMem通过引入动态超图记忆结构,有效增强了多步RAG中的知识整合与深层推理能力,提升了模型的全局感知与推理连贯性。 Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[20] Efficient Context Scaling with LongCat ZigZag Attention

Chen Zhang,Yang Bai,Jiahuan Li,Anchun Gui,Keheng Wang,Feifan Liu,Guanyu Wu,Yuwei Jiang,Defei Bu,Li Wei,Haihang Jing,Hongyin Tang,Xin Chen,Xiangzhou Huang,Fengcun Li,Rongxiang Weng,Yulei Qian,Yifan Lu,Yerui Sun,Jingang Wang,Yuchen Xie,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出了LongCat ZigZag Attention (LoZA),一种稀疏注意力机制,可将全注意力模型高效转换为稀疏版本,在长上下文场景中显著加速推理,适用于检索增强生成和工具集成推理等任务。

Details Motivation: 为了在有限计算预算下提升长上下文场景中模型的推理效率,解决全注意力机制计算开销大的问题。 Method: 提出了一种名为LoZA的稀疏注意力方案,并将其应用于LongCat-Flash模型的中期训练过程中,实现对百万级token的快速处理。 Result: LoZA在预填充密集型和解码密集型任务中均实现了显著加速,支持长达100万token的上下文处理,提升了长时推理与长视野代理能力。 Conclusion: LoZA是一种高效的稀疏注意力方法,能够显著增强现有模型在长上下文场景下的性能与实用性。 Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

[21] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Zhiming Lin,Kai Zhao,Sophie Zhang,Peilai Yu,Canran Xiao

Main category: cs.CL

TL;DR: 提出CEC-Zero,一种无需监督的强化学习框架,使大语言模型能自我纠正中文拼写错误,在9个基准上显著优于监督方法和强LLM微调。

Details Motivation: 现有大模型和监督方法对新错误鲁棒性差且依赖昂贵标注,大规模中文拼写纠错仍具挑战。 Method: 通过从干净文本生成带错输入,利用语义相似性和候选一致性计算聚类共识奖励,并使用PPO优化策略。 Result: 在9个基准上比监督基线高出10-13 F$_1$分,比强LLM微调高5-8分,具备无偏奖励和收敛的理论保证。 Conclusion: CEC-Zero建立了一种无需标签的鲁棒、可扩展中文拼写纠错范式,释放了LLM在噪声文本处理中的潜力。 Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

[22] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang,Shujian Zhang,John Lambert,Wenxuan Zhou,Zhangyang Wang,Mingqing Chen,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为RISE的无监督框架,通过稀疏自编码器在激活空间中发现可解释的推理向量,揭示并控制大语言模型中的推理行为。

Details Motivation: 现有方法依赖人类定义的词汇级概念来分析推理过程,难以全面捕捉复杂的推理行为,且受限于监督方式;因此需要一种无监督的方法来自动生成和识别多样化的推理行为。 Method: 将思维链分解为句子级别的步骤,在步骤级激活上训练稀疏自编码器(SAE),从而提取出对应不同推理行为的解耦特征,并通过可视化、聚类和干预实验验证其可解释性与可控性。 Result: 成功识别出如反思、回溯和置信度调节等可解释的推理行为,这些行为在解码器空间中具有可分离的表示;可通过干预特定向量调控推理行为,且发现了超出人类标注范围的新行为。 Conclusion: 稀疏自编码器能有效从激活空间中无监督地发现结构化的推理行为,为理解和操控大语言模型的推理过程提供了新路径。 Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

[23] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri,Subasish Das,Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: 本研究提出WISE框架,通过比较八种轻量级Transformer模型和两种基线模型在20,000个样本上的表现,评估其在区分假新闻与讽刺内容中的性能,发现MiniLM和RoBERTa-base表现最佳,且轻量级模型在资源受限场景下具有实用价值。

Details Motivation: 由于假新闻与讽刺内容在语言特征上相似但意图不同,准确区分二者具有挑战性,现有方法在效率与准确性之间难以平衡。 Method: 构建WISE框架,在Fakeddit数据集的平衡子集上使用分层5折交叉验证,评估多个轻量级Transformer模型与基线模型,采用准确率、F1分数、ROC-AUC等多种指标进行综合比较。 Result: MiniLM达到最高准确率(87.58%),RoBERTa-base在ROC-AUC(95.42%)和准确率(87.36%)上表现优异,DistilBERT在效率与性能间取得良好平衡(准确率86.28%,ROC-AUC 93.90%),统计检验显示模型间差异显著。 Conclusion: 轻量级Transformer模型在区分假新闻与讽刺内容方面可媲美甚至超越大型模型,适用于实际部署于资源受限环境的信息真实性检测系统。 Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[24] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Sijia Chen,Di Niu

Main category: cs.CL

TL;DR: 提出iCLP框架,通过隐式规划在潜空间中生成紧凑的推理指令,提升大模型在数学推理和代码生成任务中的准确性、效率和跨域泛化能力。

Details Motivation: 大语言模型依赖显式文本计划进行推理,但易受幻觉和任务多样性影响,难以生成准确计划。受人类隐性认知启发,希望实现无需显式表述的高效规划。 Method: 从现有思维链轨迹中提取显式计划,使用向量量化自编码器学习其离散的潜表示,并通过微调使大模型学会基于潜计划生成推理步骤。 Result: 在数学推理和代码生成任务上,iCLP显著提升了准确性和推理效率,并展现出强跨域泛化能力,同时保持了思维链的可解释性。 Conclusion: iCLP实现了大模型在潜空间中的隐式规划,结合了隐性认知的优势与显式推理的可解释性,为高效可靠推理提供了新路径。 Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

[25] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla,Manoj Saravanan,Shrikar Reddy Kota

Main category: cs.CL

TL;DR: 本文提出了一个名为Composite Reliability Score (CRS)的统一框架,用于综合评估大型语言模型在校准性、鲁棒性和不确定性量化方面的可靠性,并通过多个模型和数据集验证其有效性。

Details Motivation: 大型语言模型在关键决策领域应用广泛,但其可靠性(如过度自信错误、输入变化下的性能下降、缺乏不确定性估计)仍不明确,现有评估方法碎片化,无法全面衡量可靠性。 Method: 提出CRS框架,将校准性、鲁棒性和不确定性量化整合为一个可解释的综合指标,并在十个开源大模型和五个问答数据集上进行基准测试、扰动实验和校准方法评估。 Result: CRS能够稳定地对模型进行排序,发现单一指标无法捕捉的隐藏失效模式,并揭示最可靠的系统在准确性、鲁棒性和校准不确定性之间取得了平衡。 Conclusion: CRS是一个有效的统一评估工具,有助于提升大模型在关键应用场景中的可信度与实用性。 Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

[26] HY-MT1.5 Technical Report

Mao Zheng,Zheng Li,Tao Chen,Mingyang Song,Di Wang

Main category: cs.CL

TL;DR: 本文介绍了新型机器翻译模型HY-MT1.5-1.8B和HY-MT1.5-7B,采用多阶段训练框架,在多种翻译任务中表现出卓越性能,尤其在参数效率和特定语言对上优于现有主流开源和商业模型。

Details Motivation: 为了提升机器翻译模型的性能与参数效率,尤其是在中文-外文、英-外文及少数民族语言翻译任务中超越现有大型模型,并支持高级翻译约束功能。 Method: 提出一个包含通用与MT导向预训练、监督微调、策略内蒸馏和强化学习的多阶段整体训练框架。 Result: HY-MT1.5-1.8B在多项基准测试中超越更大规模的开源模型(如Tower-Plus-72B)和商业API,达到Gemini-3.0-Pro约90%的性能;HY-MT1.5-7B在其规模类别中达到新SOTA,在Flores-200上达到Gemini-3.0-Pro的95%性能,并在WMT25和少数民族语言测试集上超越之。 Conclusion: HY-MT1.5系列模型在各自参数规模下提供了高性能、高鲁棒性的翻译解决方案,兼具先进功能支持,是当前最优的开源MT模型之一。 Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

[27] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: 本文旨在通过集中必要的信息,帮助研究人员从零开始在AWS SageMaker上成功训练首个Hugging Face模型,从而促进云计算的普及。

Details Motivation: 由于缺乏本地计算资源,许多研究者转向云服务训练模型,但云平台的学习曲线陡峭,现有文档不完整,导致知识断层。 Method: 通过整合从零开始在AWS SageMaker上训练Hugging Face模型所需的关键步骤和信息,提供一个简化的指南。 Result: 为研究者提供了一个清晰、集中的指导方案,降低了使用云平台的门槛。 Conclusion: 该方法有助于 democratize 云平台的使用,使更多资源有限的研究者能够高效地训练大型语言模型。 Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

[28] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman,Erin Feiglin,Osher Yaari,Efrat Mentel,Amit Levi,Raz Lapid

Main category: cs.CL

TL;DR: 提出了一种针对掩码扩散语言模型(MDLMs)的激活引导框架,通过对比示例计算逐层引导向量,实现高效推理时控制。

Details Motivation: 现有的MDLMs在推理时缺乏有效的控制和引导机制,限制了其在实际应用中的灵活性和可控性。 Method: 利用对比示例通过单次前向传播计算逐层的引导向量,并在每一步反向扩散过程中应用这些向量,无需模拟去噪轨迹。 Result: 在LLaDA-8B-Instruct上实验表明,该方法能可靠地调节文本的高层属性,并通过消融研究分析了不同Transformer子模块和token范围的影响。 Conclusion: 所提出的激活引导框架为MDLMs提供了一种高效的推理时控制机制,提升了生成文本的可控性和灵活性。 Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

[29] Large Emotional World Model

Changhao Song,Yazhou Zhang,Hui Gao,Chang Yang,Peng Zhang

Main category: cs.CL

TL;DR: 本文提出了一个大型情感世界模型(LEWM),通过构建包含情感因果关系的EWH数据集,将情感因素显式建模到世界模型中,提升了对情感驱动社会行为的预测能力。

Details Motivation: 现有大语言模型虽具备一定世界知识建模能力,但主要关注物理规律,缺乏对情感因素的系统性探索;而情感在人类决策和世界状态演化中起关键作用,因此需构建能理解情感的世界模型。 Method: 受心智理论启发,构建了融合情感、行为原因及影响的Emotion-Why-How(EWH)数据集,并在此基础上提出LEWM模型,联合建模情感状态、视觉观测与动作,实现对未来状态及情感变化的预测。 Result: 实验表明,LEWM在情感驱动的社会行为预测上表现更优,同时在基础任务上的性能与通用世界模型相当。去除情感相关信息会导致推理性能下降,验证了情感建模的重要性。 Conclusion: 将情感纳入世界模型有助于提升对复杂社会场景的理解与预测,LEWM为构建更具人类认知特性的智能系统提供了新方向。 Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

[30] Training Report of TeleChat3-MoE

Xinzhang Liu,Chao Wang,Zhihao Yang,Zhuo Jiang,Xuncheng Zhao,Haoran Wang,Lei Li,Dongdong He,Luobin Liu,Kaizhe Yuan,Han Gao,Zihan Wang,Yitong Yao,Sishi Xiong,Wenmin Deng,Haowei He,Kaidong Yu,Yu Zhao,Ruiyu Fang,Yuhao Jiang,Yingyan Li,Xiaohui Hu,Xi Yu,Jingqi Li,Yanwei Liu,Qingli Li,Xinyu Shi,Junhao Niu,Chengnuo Huang,Yao Xiao,Ruiwen Wang,Fengkai Li,Luwen Pu,Kaipeng Jia,Fubei Yao,Yuyao Huang,Xuewei He,Zhuoru Jiang,Ruiting Song,Rui Xue,Qiyi Xie,Jie Zhang,Zilu Huang,Zhaoxi Zhang,Zhilong Lu,Yanhan Zhang,Yin Zhang,Yanlei Xue,Zhu Yuan,Teng Su,Xin Jiang,Shuangyong Song,Yongxiang Li,Xuelong Li

Main category: cs.CL

TL;DR: TeleChat3-MoE是基于Ascend NPU集群训练的超大规模语言模型系列,采用MoE架构,参数量达百亿至万亿级。本文重点介绍支持其高效可扩展训练的基础设施,包括数值精度验证、性能优化和并行化策略。

Details Motivation: 为实现超大规模语言模型在国产NPU集群上的高效、稳定训练,需解决跨硬件平台的数值一致性、分布式训练效率及系统级瓶颈等问题。 Method: 提出系统性的算子级与端到端数值精度验证方法;设计多种性能优化技术,如交错流水线调度、长序列感知的数据调度、分层重叠通信和DVM算子融合;构建基于分析估计与整数线性规划的多维并行配置优化框架;并实施集群级系统优化以缓解主从节点瓶颈。 Result: 所提基础设施在数千设备组成的Ascend集群上实现了显著的吞吐提升和近线性扩展能力,支持百亿至万亿参数规模的MoE模型端到端高效训练。 Conclusion: 该工作为在国产硬件生态上开发大规模语言模型提供了可靠且高效的训练基础,验证了大规模MoE模型在异构计算平台上的可行性与潜力。 Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

[31] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang,Rui Sheng,Yafei Li,Huamin Qu,Yushi Sun,Min Zhu

Main category: cs.CL

TL;DR: MedKGI 是一种基于临床实践的诊断框架,通过整合医学知识图谱、基于信息增益的问题选择和结构化状态跟踪,提升了大语言模型在临床诊断中的准确性与对话效率。

Details Motivation: 现有大语言模型在临床诊断中存在幻觉、提问低效和多轮对话不一致等问题,难以模拟真实的临床推理过程。 Method: 提出 MedKGI 框架,结合医学知识图谱约束推理、基于信息增益选择判别性问题,并采用 OSCE 格式的结构化状态维护多轮证据一致性。 Result: 在临床基准测试中,MedKGI 平均提升30%的对话效率,并在诊断准确率上优于强基线模型。 Conclusion: MedKGI 有效解决了 LLM 在临床诊断中的关键缺陷,实现了更高效、可靠且符合临床实践的诊断推理。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[32] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

May Bashendy,Walid Massoud,Sohaila Eltanbouly,Salam Albatarni,Marwan Sayed,Abrar Abir,Houda Bouamor,Tamer Elsayed

Main category: cs.CL

TL;DR: 本文介绍了LAILA,目前最大的公开阿拉伯语自动作文评分(AES)数据集,包含7,859篇带有整体和特征评分的作文,并在多个维度上进行标注,为阿拉伯语AES研究提供了重要资源。

Details Motivation: 由于缺乏公开可用的数据集,阿拉伯语自动作文评分(AES)的研究较为有限,本文旨在填补这一空白。 Method: 构建并发布LAILA数据集,包含7,859篇阿拉伯语作文,涵盖七个评分维度,并使用最先进的阿拉伯语和英语模型进行提示特定和跨提示设置下的基准测试。 Result: LAILA成为目前最大的公开阿拉伯语AES数据集,基准实验展示了现有模型在这两个设置下的表现,推动了阿拉伯语AES系统的发展。 Conclusion: LAILA有效填补了阿拉伯语AES研究中的数据缺口,支持更 robust 的评分系统开发,具有重要的研究和应用价值。 Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

[33] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

Michael E. Rose,Mainak Ghosh,Sebastian Erhardt,Cheng Li,Erik Buunk,Dietmar Harhoff

Main category: cs.CL

TL;DR: 提出Pat-SPECTER模型,用于在专利与科学文献间进行语义相似性建模,在多项任务中表现最优,并验证美国专利引用文献语义差异更大的假设。

Details Motivation: 需要一个能同时处理专利和科学出版物的语言相似性模型,以更好预测专利-论文引用关系并揭示跨领域引用模式。 Method: 基于SPECTER2模型在专利数据上进行微调,构建Pat-SPECTER模型,并通过八种语言模型的对比实验评估其性能。 Result: Pat-SPECTER在预测专利-论文引用任务中表现最佳,并可用于实际场景中的配对筛选与预测;实证发现美国专利引用的论文语义相似性显著低于其他司法管辖区。 Conclusion: Pat-SPECTER是目前处理专利与论文间语义相似性的最优模型,且研究支持美国因‘诚实披露义务’导致引用更不相关文献的假设。 Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

[34] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Ziqing Fan,Yuqiao Xian,Yan Sun,Li Shen

Main category: cs.CL

TL;DR: 本文提出了DATAMASK,一种高效的联合学习框架,用于大规模预训练数据的选择,能够在统一过程中同时优化质量和多样性指标,并在万亿级token数据集上显著提升模型性能。

Details Motivation: 现有数据选择方法在长期预训练中存在收益递减或过度剔除高质量样本的问题,且难以联合考虑质量与多样性等多类指标,限制了大模型性能。 Method: 将数据选择建模为掩码学习问题,通过迭代采样数据掩码、基于策略梯度的目标优化及掩码采样logits更新,实现质量与多样性指标的联合优化,并引入多种加速技术以提升效率。 Result: 相比贪婪算法减少98.9%选择时间,在FineWeb数据集中选出约10%子集FineWeb-Mask,在12项任务上使1.5B密集模型和7B MoE模型分别提升3.2%和1.9%。 Conclusion: DATAMASK实现了高效的大规模多指标联合数据选择,显著提升了预训练模型性能,验证了联合优化质量与多样性的重要性。 Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

[35] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll,Adam Jatowt

Main category: cs.CL

TL;DR: 本文介绍了一个用于欧盟分类法合规性评估的新型结构化数据集,并首次系统评估了大语言模型(LLMs)在该任务中的表现,发现LLMs在定性任务中表现中等,但在定量任务中完全失败,表明其目前更适合作为辅助工具而非全自动解决方案。

Details Motivation: 由于缺乏公开的基准数据集,研究大语言模型在欧盟分类法合规流程自动化中的应用受到限制,本文旨在填补这一空白。 Method: 构建了一个包含190份企业报告的结构化数据集,包含经济活动和关键绩效指标(KPIs)的真实数据,并采用多步代理框架评估LLMs在定性和定量任务中的零样本表现。 Result: LLMs在识别经济活动的定性任务中表现中等,多步框架略微提升了精确率;但在预测财务KPI的定量任务中全面失败;此外,简洁的元数据有时比完整的非结构化报告效果更好,且模型置信度校准差。 Conclusion: 大语言模型尚不能实现欧盟分类法合规的全自动化,但可作为人类专家的有效辅助工具;所提出的数据集为后续研究提供了公开基准。 Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[36] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了FIGR,一种通过端到端强化学习将主动视觉思维融入多轮推理的框架,利用可视化表征提升复杂问题中对全局结构属性的推理能力。

Details Motivation: 纯文本推理难以捕捉复杂问题中的空间、几何和结构关系,缺乏对全局结构约束的有效表示。 Method: 提出FIGR框架,通过构建中间视觉表征来外化结构假设,并采用强化学习自适应调控何时及如何调用视觉推理。 Result: 在AIME 2025上比强文本链基线提升13.12%,在BeyondAIME上提升11.00%。 Conclusion: 视觉引导的多模态推理能有效增强复杂推理的稳定性与可靠性,尤其在需建模全局结构特性的任务中具有显著优势。 Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[37] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Shupeng Li,Weipeng Lu,Linyun Liu,Chen Lin,Shaofei Li,Zhendong Tan,Hanjun Zhong,Yucheng Zeng,Chenghao Zhu,Mengyue Liu,Daxiang Dong,Jianmin Wu,Yunting Xiao,Annan Li,Danyu Liu,Jingnan Zhang,Licen Liu,Dawei Yin,Dou Shen

Main category: cs.CL

TL;DR: 本文提出了QianfanHuijin,一种面向金融领域的大型语言模型,并提出了一种可推广的多阶段训练范式,通过渐进式的后训练流程显著提升模型的金融推理与智能体能力。

Details Motivation: 随着金融服务复杂性的增加,仅具备领域知识的模型已无法满足需求,亟需具备金融推理和智能体能力的增强型模型。 Method: 采用多阶段训练范式:首先在金融语料上进行持续预训练(CPT),然后依次进行金融监督微调(Financial SFT)、金融推理强化学习(Finance Reasoning RL)、金融智能体强化学习(Finance Agentic RL),最后通过与真实业务场景对齐的通用强化学习(General RL)完成训练。 Result: QianfanHuijin在多个权威金融基准测试中表现优异,消融实验表明推理RL和智能体RL阶段显著提升了相应能力。 Conclusion: 所提出的细粒度、渐进式后训练方法能有效增强工业级大模型的专项能力,有望成为各行业领域模型增强的主流范式。 Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

[38] World model inspired sarcasm reasoning with large language model agents

Keito Inoshita,Shinnosuke Mizuno

Main category: cs.CL

TL;DR: 本文提出了一种基于世界模型的讽刺理解框架WM-SAR,通过将字面意义、语境、规范期望和意图分解为基于大语言模型的专用代理,并显式建模语义不一致与意图信号,实现高性能且可解释的讽刺检测。

Details Motivation: 现有讽刺理解方法多依赖黑箱模型,缺乏对认知因素(如语义与规范期望的不匹配)的结构化解释,难以捕捉讽刺背后的认知机制。 Method: 将讽刺理解建模为世界模型驱动的推理过程,设计四个LLM代理分别处理字面意义、语境、规范期望和意图;通过计算字面与规范之间的不一致性得分和意图得分,输入轻量级逻辑回归模型进行最终判断。 Result: 在多个基准上超越了现有的深度学习和大语言模型方法,消融实验和案例分析验证了语义不一致与意图推理的必要性。 Conclusion: WM-SAR通过结合大语言模型的推理能力与可解释的数值决策结构,在保持高可解释性的同时实现了优越的性能,为讽刺理解提供了新的认知启发式建模范式。 Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

[39] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro,Zied Bouraoui

Main category: cs.CL

TL;DR: 提出了一种基于自监督对比学习的长文档表示框架,通过模拟人类阅读策略,在法律和生物医学文本中实现了更高效、准确的表示。

Details Motivation: 现有模型在处理长文档时存在效率低、上下文捕捉不全或缺乏可解释性的问题,尤其在法律和医学领域难以有效建模全文语义。 Method: 引入一种新的自监督对比学习框架,随机掩码文档中的段落,利用自然语言推断(NLI)目标进行对比学习,使模型聚焦于相关部分并忽略无关内容,模拟人类快速浏览理解文档的过程。 Result: 在法律和生物医学文本上的实验表明,该方法显著提升了长文档表示的准确性和计算效率。 Conclusion: 所提方法通过模仿人类阅读策略,有效改善了长文档的表示质量,兼具高性能与良好可解释性。 Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[40] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel,Constantine Lignos

Main category: cs.CL

TL;DR: 本文研究了低资源语言的自动文本摘要方法,比较了零样本大模型提示、微调小模型(如mT5)结合数据增强和多语言迁移,以及通过大模型翻译 pipeline 的方法。结果表明,多语言微调的mT5在多数指标上优于其他方法,且LLM作为评判者在低资源语言上可靠性较低。

Details Motivation: 低资源语言的文本摘要研究相对不足,现有高性能方法多集中于英语等高资源语言,因此需要探索适用于低资源语言的有效摘要技术。 Method: 比较了多种方法:零样本提示不同规模的大语言模型(LLMs)、微调mT5模型并结合三种数据增强和多语言迁移,以及使用LLM进行源语言到英语的翻译-摘要-回译流程。采用五种不同评估指标进行评测。 Result: 发现相似参数规模的LLM表现存在差异;多语言微调的mT5基线模型在大多数指标上优于其他方法,包括零样本LLM;LLM作为评估器在低资源语言上可靠性较低。 Conclusion: 针对低资源语言的文本摘要,微调多语言小型模型(如mT5)比零样本大模型更有效,且当前LLM-based评估方法在低资源语言场景下可能存在局限性。 Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[41] Cleaning English Abstracts of Scientific Publications

Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt

Main category: cs.CL

TL;DR: 提出了一种开源语言模型,用于自动清理英文科学摘要中的杂乱信息,提高文本嵌入和相似性分析的准确性。

Details Motivation: 科学摘要中常包含版权说明、元数据等无关内容,影响文本分析结果。 Method: 开发了一个易于集成的语言模型,识别并移除摘要中的非必要信息。 Result: 模型表现出高精度和保守性,能改善清洗后摘要的相似性排序,并提升标准长度嵌入的信息含量。 Conclusion: 该模型有效提升了科学文本预处理质量,适用于下游自然语言处理任务。 Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

[42] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

Titas Ramancauskas,Kotryna Ramancauske

Main category: cs.CL

TL;DR: 本文提出了一种用于雅思写作考试准备的修订平台,结合自动化评分与针对性反馈,通过设计导向研究(DBR)迭代优化模型,从基于规则的方法转向基于DistilBERT的回归模型,显著提升了评分准确性,并实现了有效的自适应反馈。

Details Motivation: 传统的雅思写作备考方法缺乏根据评分标准提供的个性化反馈,且难以模拟真实考试环境,因此需要一个能提供精准、定制化反馈的平台以提升学习效果。 Method: 采用设计导向研究(Design-Based Research, DBR)方法,经过多轮迭代,平台从规则-based自动评分系统发展为基于DistilBERT加回归头的Transformer模型,并分离会话引导与写作界面以降低认知负荷,实现自适应反馈机制。 Result: 第四轮DBR中,DistilBERT模型实现MAE为0.66且R²为正值;第五轮中,自适应反馈使考生得分显著提高(平均+0.060分,p=0.011,Cohen's d=0.504),但不同修改策略效果存在差异。 Conclusion: 自动化反馈可作为雅思写作教学的有效补充,表面层级的保守修改比激进结构调整更可靠;当前系统在高分段作文评估上仍有挑战,未来需结合长期追踪研究及官方考官验证。 Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

[43] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski,Alexander Waibel

Main category: cs.CL

TL;DR: 本文提出了自动语音转录文本的段落分割方法,构建了首个面向语音领域的段落分割基准数据集,并通过约束解码和轻量模型实现了高效准确的结构化处理。

Details Motivation: 语音转录文本通常为无结构的词流,影响可读性和再利用,现有文本分割研究缺乏针对语音领域的真实、鲁棒基准,且段落分割未被纳入语音后处理流程。 Method: 提出TEDPara(人工标注)和YTSegPara(合成标签)两个新基准;设计基于约束解码的方法,使大语言模型在保留原文的同时插入段落分隔符;开发轻量模型MiniSeg及其层次化扩展,联合预测章节与段落。 Result: 建立了首个面向语音的段落分割基准;约束解码支持原句对齐评估;MiniSeg在准确率上达到SOTA,且计算成本低,可高效联合预测章节与段落。 Conclusion: 段落分割应成为语音处理中的标准环节,本文提供的资源与方法为此奠定了基础,推动了语音与文本分割领域的交叉发展。 Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

[44] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said,Muhammad Sammani Sani

Main category: cs.CL

TL;DR: 本研究通过HausaSafety数据集对GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus三个先进大模型进行安全对齐审计,发现多语言安全表现受语言与时间框架的复杂交互影响,揭示了‘复杂干扰’机制与‘时间不对称性’问题,表明当前模型依赖表层启发而非深层语义理解,提出需转向‘不变对齐’新范式以保障全球范围内的一致安全性。

Details Motivation: 随着大语言模型融入全球关键基础设施,其安全对齐是否能零样本迁移到非英语语言仍存盲区,尤其在资源稀缺语言和特定区域威胁场景下可能存在严重安全隐患,因此亟需系统评估多语言安全对齐的真实表现。 Method: 构建基于西非威胁场景(如Yahoo-Yahoo诈骗、Dane枪支制造)的对抗性数据集HausaSafety,采用2×4因子设计,在1,440次评估中测试三种最先进模型(GPT-5.1、Gemini 3 Pro、Claude 4.5 Opus)在英语与豪萨语、不同时间框架下的安全响应表现,分析语言与时间因素的非线性交互效应。 Result: 发现‘复杂干扰’机制:安全表现由语言与时间框架交叉决定;Claude 4.5 Opus在豪萨语中反而更安全(45.0% vs 英语36.7%),归因于不确定性驱动的拒绝;但所有模型均存在‘时间不对称性’——过去时防御崩溃(仅15.6%安全),未来时则过度拒绝(57.2%安全);最安全与最脆弱配置间存在9.2倍差距。 Conclusion: 当前大模型的安全性并非固定属性,而是依赖上下文的状态,其依赖表层启发导致‘安全盲区’,使全球南方用户面临本地化风险;必须从现有范式转向‘不变对齐’,以确保跨语言与跨时间的安全稳定性。 Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[45] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong,Qi Zhang,Jiayang Gao,Lei Jiang,Yanbing Liu,Nannan Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为HaluNet的轻量级可训练神经网络框架,用于检测大语言模型(LLM)在问答任务中的幻觉问题。该方法通过融合词元级别的概率不确定性与语义表示中的不确定性,实现高效的单次幻觉检测,在多个数据集上表现出色且计算效率高。

Details Motivation: 大语言模型虽然在问答任务中表现优异,但常产生幻觉(如事实错误或虚构内容)。现有基于内部不确定性信号的检测方法通常只关注单一类型的不确定性,忽略了不同类型不确定性之间的互补性,尤其是词元级别概率不确定性和语义表示不确定性的结合潜力。 Method: 提出HaluNet,一个轻量级、可训练的多分支神经网络框架,将语义嵌入与概率置信度和分布不确定性相结合,自适应地融合模型的知识与其输出中的不确定性,从而实现多粒度词元级别不确定性的集成。 Result: 在SQuAD、TriviaQA和Natural Questions数据集上的实验表明,HaluNet在有无上下文访问的情况下均具有出色的幻觉检测性能和良好的计算效率。 Conclusion: HaluNet通过融合多粒度的内部不确定性信号,能够高效准确地检测LLM生成的幻觉,具备应用于实时问答系统的潜力。 Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

Hongseok Oh,Wonseok Hwang,Kyoung-Woon On

Main category: cs.CL

TL;DR: 提出了韩国标准法律基准(KCL),用于评估语言模型在不依赖领域特定知识情况下的法律推理能力,包含选择题和开放式生成题两部分,并提供了支持性判例和自动评估资源。

Details Motivation: 为了独立评估语言模型的法律推理能力,避免模型表现受其参数中存储的领域知识影响,需要一个能解耦推理能力和知识记忆的基准测试。 Method: 构建了KCL基准,包括KCL-MCQA(283道选择题及1,103条对应判例)和KCL-Essay(169道开放题、550条判例及2,739条实例级评分标准),并对30多个模型进行系统评估。 Result: 评估结果显示现有模型在KCL-Essay上仍有较大提升空间,且专为推理设计的模型表现优于通用模型。 Conclusion: KCL能够有效评估语言模型的法律推理能力,支持更精细的能力分解,并推动推理型模型的发展,所有资源已公开。 Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[47] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Zhenyu Zhang,Xiaoxia Wu,Zhongzhu Zhou,Qingyang Wu,Yineng Zhang,Pragaash Ponnusamy,Harikaran Subbaraj,Jue Wang,Shuaiwen Leon Song,Ben Athiwaratkun

Main category: cs.CL

TL;DR: 本文提出了CREST,一种无需训练的认知推理引导方法,通过干预特定注意力头来优化大语言模型的推理路径,提升准确率并降低计算成本。

Details Motivation: 大语言模型在复杂任务中依赖长链式思维推理,但常存在效率低下、推理不稳定的问题,如过度思考或思考不足。 Method: 研究发现某些注意力头与验证、回溯等认知行为相关,CREST利用这些头构建引导向量,在推理时旋转隐藏表示以抑制无效推理行为。方法包括离线校准和推理时干预两个步骤。 Result: 在多个推理基准和模型上,CREST最高提升了17.5%的准确率,并减少了37.6%的token使用量。 Conclusion: CREST提供了一种简单有效的方法,能够在测试时提升大语言模型推理的速度和可靠性。 Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

[48] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu,Jiarui Qin,Lingfeng Qiao,Yinghui Li,Xinyi Dai,Bo Ke,Jianfeng He,Ruizhi Qiao,Di Yin,Xing Sun,Yunsheng Wu,Yinsong Liu,Shuangyin Liu,Mingkong Tang,Haodong Lin,Jiayi Kuang,Fanxu Meng,Xiaojuan Tang,Yunjia Xi,Junjie Huang,Haotong Yang,Zhenyi Shen,Yangning Li,Qianwen Zhang,Yifei Yu,Siyu An,Junnan Dong,Qiufeng Wang,Jie Wang,Keyu Chen,Wei Wen,Taian Guo,Zhifeng Shen,Daohai Yu,Jiahao Li,Ke Li,Zongyi Li,Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM是一个从零训练的1.96B轻量级语言模型,通过紧凑架构、多阶段课程学习和规模化代理中间训练,在长上下文推理与代理任务中表现卓越,兼顾高效计算与原生智能。

Details Motivation: 设计一个兼具高计算效率和原生代理智能的小型语言模型,突破传统小模型依赖蒸馏且缺乏内在推理能力的局限。 Method: 采用密集MLA架构与STEM词汇表支持128k上下文;设计‘常识-STEM-代理’课程学习策略,基于约11T token数据分阶段训练;在中间训练阶段使用多样化数据构建方法生成数学、编程和工具使用的丰富轨迹。 Result: 在通用基准上性能媲美更大模型,在代理特定任务上显著超越现有SOTA基线,成为新的20亿参数以下模型标杆。 Conclusion: 轻量级模型可通过系统性预训练获得强大的内在代理能力,Youtu-LLM为小型化、高效化语言模型的发展提供了可行路径。 Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[49] Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan,Sid Black,Oliver Sourbut

Main category: cs.CL

TL;DR: 研究探讨了大语言模型(LLMs)在单步和多步任务中预测自身成功的能力,发现尽管所有测试的LLM都存在过度自信问题,但多数具备优于随机的判别能力;新且更大的模型并未普遍表现出更强的判别力,部分模型可通过上下文中的失败经验降低过度自信,从而改善决策;然而,由于对自身能力缺乏准确认知,当前LLM代理在高成本失败场景下面临决策困境,并带来AI滥用与对齐风险。

Details Motivation: 了解LLMs是否能够准确评估自身在任务中的表现潜力,特别是在多步代理任务和高风险情境下,有助于提升其自主决策能力,并降低因过度自信导致的AI误用和系统性对齐风险。 Method: 通过实验评估多个前沿LLMs在单步与多步任务中预测成功率的能力,分析其置信度与实际表现之间的关系,并引入包含失败经验的上下文示例,观察模型是否能据此调整信心并优化决策行为。 Result: 大多数LLM具备优于随机的判别能力但普遍存在过度自信;随着任务推进,多个前沿模型的过度自信加剧;推理型LLM表现不优于非推理型;部分模型在经历上下文失败后能减少过度自信并改善决策;所有模型的决策在其估计概率下近似理性,但因估计过于乐观而导致整体决策不佳。 Conclusion: 当前LLM代理受限于对其自身能力的认知不足,尽管部分模型可通过经验学习调整信心,但整体上仍因过度自信而影响实际决策质量,需进一步研究以提升自我感知能力并缓解相关AI安全风险。 Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

[50] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li,Zhongsheng Wang,Haoyuan Li,Jiamou Liu

Main category: cs.CL

TL;DR: R-Debater是一个基于论证记忆的多轮辩论生成框架,结合检索增强与角色化代理,提升辩论的一致性、证据使用和连贯性。

Details Motivation: 传统LLM在多轮辩论中难以保持立场一致性和有效使用证据,受修辞与记忆研究启发,需构建具有记忆机制的辩论系统。 Method: 提出R-Debater框架,集成辩论知识库用于检索案例证据和历史辩论行为,并设计基于角色的代理进行跨轮次连贯发言生成;在ORCHID数据集上构建1000条检索语料和32场保留辩论进行评估。 Result: 在单轮(InspireScore)和多轮对抗模拟(Debatrix)任务中均优于强LLM基线,人类评估显示其在立场一致性、证据支持和语言连贯方面更优。 Conclusion: 结合检索增强与结构化规划的R-Debater能生成更可信、立场对齐且跨轮次连贯的辩论内容,验证了记忆机制在复杂对话系统中的有效性。 Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

[51] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li,Shujian Zhang,Wenxuan Zhou,John Lambert,Chi Jin,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为MUSIC的无监督数据增强策略,用于生成跨多个对话轮次的对比对话对,以提升多轮奖励模型(RM)的质量。基于Gemma-2-9B-Instruct模型训练的MUSIC增强RM在多轮对话评估中表现出更高的与先进专有LLM评判的一致性,同时不牺牲单轮RM基准性能。

Details Motivation: 现有的偏好数据集通常仅基于对话的最后一轮进行响应对比,缺乏捕捉多轮交互细微差别的能力,导致多轮自动评估效果不佳。因此,需要一种能更好建模多轮交互动态的评估方法。 Method: 提出MUSIC(Multi-Step Instruction Contrast)方法,通过无监督方式合成跨越多个对话轮次的对比对话对,增强训练数据的多样性与深度,并在Skywork偏好数据集上应用该策略训练基于Gemma-2-9B-Instruct的多轮奖励模型。 Result: 实验表明,采用MUSIC增强训练的多轮RM在与先进专有大模型评判的对齐度上优于基线方法,同时在标准单轮RM基准测试中保持竞争力。 Conclusion: 引入跨多轮的对比信号对于构建鲁棒的多轮奖励模型至关重要,MUSIC提供了一种有效且可扩展的数据增强途径,推动了多轮对话自动化评估的发展。 Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

[52] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Sibo Wei,Peng Chen,Lifeng Dong,Yin Luo,Lei Wang,Peng Zhang,Wenpeng Lu,Jianbin Guo,Hongjun Yang,Dajun Zeng

Main category: cs.CL

TL;DR: 本文提出了BIOME-Bench,一个用于评估大语言模型在多组学通路机制解析中性能的标准化基准,揭示了现有模型在生物分子关系推断和通路机制解释方面的不足。

Details Motivation: 现有的通路富集方法受限于数据库的滞后性、功能冗余和对分子状态敏感性不足,且缺乏标准化基准来系统评估大语言模型在多组学分析中的能力。 Method: 通过四阶段流程构建BIOME-Bench,设计两项核心任务:生物分子相互作用推断与端到端多组学通路机制解析,并建立相应的评估协议。 Result: 实验表明当前的大语言模型在区分精细的生物分子关系类型和生成可靠、稳健的通路机制解释方面仍存在显著缺陷。 Conclusion: 需要进一步改进大语言模型及其评估体系,以提升其在多组学数据解读中的准确性和实用性。 Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

[53] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

Mohammad Zia Ur Rehman,Velpuru Navya,Sanskar,Shuja Uddin Qureshi,Nagendra Kumar

Main category: cs.CL

TL;DR: 提出了一种半监督多语言抑郁检测网络Semi-SMDNet,结合教师-学生模型、集成学习和数据增强,有效提升低资源语言下的抑郁检测性能。

Details Motivation: 由于语言风格差异、非正式表达以及许多语言缺乏标注数据,社交媒体文本中的抑郁检测仍具挑战性。 Method: 提出Semi-SMDNet,采用教师-学生伪标签框架、软投票集成、基于不确定性的阈值过滤和置信度加权训练,并结合数据增强提升多语言鲁棒性。 Result: 在阿拉伯语、孟加拉语、英语和西班牙语数据集上显著优于强基线方法,缩小了高资源与低资源设置间的性能差距。 Conclusion: 该框架在标注资源有限的情况下适用于可扩展的跨语言心理健康监测,具有良好的通用性和实用性。 Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

[54] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs,Márton Csutora,Mátyás Antal,Márk Marosi

Main category: cs.CL

TL;DR: 本文研究了大语言模型在数学和推理密集型基准上的测试时计算效率,发现混合专家(MoE)架构在性能与效率之间具有良好的平衡,并揭示了随着计算资源增加准确率提升存在饱和点。

Details Motivation: 当前研究忽视了生成长推理链带来的巨大计算开销,而工业应用需要兼顾准确性与推理成本,因此需对模型进行计算效率感知的评估。 Method: 对新旧开源大语言模型在数学和推理任务上进行测试时计算资源感知的评估,绘制其帕累托前沿,并分析不同架构(如MoE)的效率表现。 Result: 发现MoE架构在性能与效率方面表现优异;推理能力随计算资源增加呈边际收益递减趋势;存在推理计算的饱和点,超过后准确率提升有限。 Conclusion: 扩展推理步骤虽有益,但无法突破模型本身的能力局限,应综合考虑计算成本选择模型,MoE是高效架构的有力候选者。 Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

[55] Practising responsibility: Ethics in NLP as a hands-on course

Malvina Nissim,Viviana Patti,Beatrice Savoldi

Main category: cs.CL

TL;DR: 本文介绍了一门关于自然语言处理(NLP)中伦理问题的课程及其以主动学习为核心的教育方法,旨在应对NLP教育中快速发展的技术和培养学生批判性思维的挑战。

Details Motivation: 随着NLP系统日益普及,将其伦理考量纳入教育变得至关重要。然而,课程开发面临领域快速发展和需超越传统技术训练培养批判性思维的挑战。 Method: 采用基于主动学习的教学方法,包括互动环节、实践练习和“以教促学”模式,并在四年中于不同机构、教育层次和跨学科背景中不断优化课程。 Result: 课程产出了大量可复用的教学资源和面向不同受众的教育产品,均由学生创作完成。 Conclusion: 分享该课程的设计与实践经验,旨在为希望将社会影响考量融入教学的教育者提供借鉴与启发。 Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field's rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and "learning by teaching" methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

[56] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Main category: cs.CL

TL;DR: 本文提出了一种称为“三角测量”的新方法,用于验证多语言模型中的机制性解释,确保其在跨语言和跨文化环境下的因果有效性。

Details Motivation: 多语言模型虽然总体表现良好,但在不同语言、文字和文化中行为不可预测,因此需要满足因果标准的机制性解释。 Method: 引入了“参考族”概念和“三角测量”验证规则,结合必要性、充分性和不变性,并采用自动电路发现与交叉干预实验进行验证。 Result: 三角测量能够过滤仅在单一环境中成立但在跨语言情况下失效的虚假电路,提升了机制解释的可靠性。 Conclusion: 该方法为多语言模型的可解释性提供了可证伪的标准,推动了因果抽象与实用可解释性研究的结合。 Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

[57] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay,Sathwik Reddy,Shruthi Muthukumar,Jisun An,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: PrivacyBench 是一个基于社会情境的基准,用于评估AI代理在多轮对话中保护用户隐私秘密的能力,揭示现有检索增强生成系统存在严重的信息泄露风险。

Details Motivation: 个性化AI系统依赖用户的数字足迹,但缺乏社会情境意识可能导致敏感信息泄露,威胁用户数字安全,因此需要评估和改进现有系统的隐私保护能力。 Method: 提出 PrivacyBench 基准,包含嵌有秘密的社会化数据集和多轮对话评估机制,测试 RAG 助手在不同条件下的秘密泄露情况,并分析检索与生成模块的隐私风险。 Result: 实验显示RAG系统在最多26.56%的交互中泄露秘密;使用隐私感知提示可将泄露率降至5.12%,但检索机制仍无差别访问敏感数据,形成单点故障。 Conclusion: 当前AI架构在隐私保护上存在结构性缺陷,亟需采用隐私优先设计的系统性防护措施,以确保广泛部署的安全性和伦理性。 Abstract: Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

[58] Big AI is accelerating the metacrisis: What can we do?

Steven Bird

Main category: cs.CL

TL;DR: 当前世界面临生态、意义和语言危机,这些危机正汇聚成一场元危机,而大型AI正在加速这一进程。语言工程师在其中扮演关键角色,但其追求可扩展性的叙事已无法服务人类需求,并为权贵提供技术支持,忽视技术的价值负载。亟需探索替代路径,以人类繁荣和地球生命为核心,重新设计赋能自然语言处理的未来。

Details Motivation: 应对由生态、意义和语言危机交织而成的元危机,反思当前NLP发展路径对社会与环境的负面影响,特别是大AI和语言工程师在此中的责任。 Method: 批判性分析当前NLP领域的发展范式,指出其价值中立假象及其与权力结构的共谋关系,呼吁转向以人为中心、生命为中心的技术设计。 Result: 揭示了现行NLP发展模式加剧全球危机的问题,强调技术发展的价值导向缺失,并提出必须转变方向,发展支持人类繁荣和生态可持续的语言技术。 Conclusion: 需要摒弃当前服务于少数权贵的NLP发展模式,动员集体智慧,构建一种肯定生命、促进人类福祉和生态和谐的新语言技术范式。 Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

[59] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang,Yizhi Li,Yantao Du,Ge Zhang,Jiayi Zhou,Yuchen Wu,Yinzhu Piao,Denghui Cao,Tong Sun,Ziniu Li,Li Du,Bo Lei,Jiaheng Liu,Chenghua Lin,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang

Main category: cs.CL

TL;DR: Encyclo-K是一种基于知识陈述的新型大语言模型基准,通过动态生成问题实现抗数据污染、多知识点综合评估和低成本标注。

Details Motivation: 现有基准易受数据污染、局限于单知识点评估且依赖高成本专家标注,Encyclo-K旨在克服这三大局限。 Method: 从权威教科书中提取独立知识陈述,并在测试时通过随机采样动态组合成问题,以知识陈述为基本单位构建评估体系。 Result: 在50多个大模型上的实验显示,即使表现最好的GPT-5.1准确率也仅为62.07%,模型间呈现清晰梯度分布,验证了该方法的挑战性和强区分能力。 Conclusion: Encyclo-K提供了一个可扩展的框架,支持对大语言模型在细粒度学科知识上的综合理解能力进行动态、可靠的评估。 Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

[60] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie,Yixuan Wei,Huanqi Cao,Chenggang Zhao,Chengqi Deng,Jiashi Li,Damai Dai,Huazuo Gao,Jiang Chang,Liang Zhao,Shangyan Zhou,Zhean Xu,Zhengyan Zhang,Wangding Zeng,Shengding Hu,Yuqing Wang,Jingyang Yuan,Lean Wang,Wenfeng Liang

Main category: cs.CL

TL;DR: 提出Manifold-Constrained Hyper-Connections (mHC) 框架,通过流形约束恢复超连接中的恒等映射性质,并优化基础设施以提升训练稳定性、可扩展性和效率。

Details Motivation: 现有超连接方法虽提升性能但破坏了残差连接的恒等映射特性,导致训练不稳定、可扩展性差和内存开销高。 Method: 将超连接的残差空间投影到特定流形上以恢复恒等映射,并结合严格的基础设施优化。 Result: 实验表明mHC在大规模训练中有效,显著提升性能与可扩展性,同时降低内存访问开销。 Conclusion: mHC作为HC的实用扩展,有助于理解拓扑结构设计,为基座模型发展提供新方向。 Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[61] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Hengli Li,Zhaoxin Yu,Qi Shen,Chenxi Li,Mengmeng Wang,Tinglang Wu,Yipeng Kang,Yuxuan Wang,Song-Chun Zhu,Zixia Jia,Zilong Zheng

Main category: cs.CL

TL;DR: 本文提出了BEDA框架,通过将信念估计转化为生成时的概率约束,解决了战略对话中信念使用不充分的问题,并在多个对话场景中显著优于基线模型。

Details Motivation: 现有方法虽能准确估计对话中的信念,但缺乏在生成过程中有效利用这些信念的机制。 Method: 形式化了对抗性和一致性两类核心对话行为,并通过概率约束将其融入生成过程;提出BEDA框架,包含世界集、信念估计器和条件生成器。 Result: 在CKBG、MF和CaSiNo三个任务上,BEDA均显著优于强基线:在CKBG上最高提升20.6点(GPT-4.1-nano),MF上平均提升9.3点,CaSiNo上达成最优交易。 Conclusion: 将信念估计作为生成约束是一种简单且通用的方法,可提升战略对话系统的可靠性与性能。 Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

[62] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Minjun Zhao,Xinyu Zhang,Shuai Zhang,Deyang Li,Ruifeng Shi

Main category: cs.CL

TL;DR: 提出了一种名为ADOPT的自适应、依赖感知的提示优化框架,用于多步大语言模型(LLM)流程,通过建模步骤与最终任务结果之间的依赖关系,实现精确的文本梯度估计,并有效优化多提示联合调优问题。

Details Motivation: 多步LLM流程性能依赖于各步骤的提示词质量,但缺乏逐级监督和步骤间依赖使得联合优化困难,现有端到端方法在该场景下表现不佳。 Method: 提出ADOPT框架,显式建模每个LLM步骤与最终任务输出之间的依赖关系,将文本梯度估计与梯度更新解耦,并采用基于Shapley值的机制自适应分配优化资源,从而将多提示优化简化为灵活的单提示优化步骤。 Result: 在真实数据集和多种流程结构上的实验表明,ADOPT在性能和鲁棒性上均优于当前最先进的提示优化基线方法。 Conclusion: ADOPT通过依赖感知和自适应资源分配机制,有效解决了多步LLM流程中提示联合优化的挑战,具有良好的应用前景。 Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

Luis Adrián Cabrera-Diego

Main category: cs.CL

TL;DR: 提出了一种基于DeBERTa V3和LSTM的法律文档分类方法,通过随机选取48个短片段(每段最多128个token)作为输入,结合Temporal部署的可靠处理流程,实现了高效的长法律文档分类。

Details Motivation: 法律文档通常很长且包含专业词汇,直接使用Transformer模型处理全文存在计算成本高、速度慢或不可行的问题,因此需要一种高效且可扩展的分类方法。 Method: 采用DeBERTa V3与LSTM结合的模型,将每篇法律文档随机切分为多个短片段(max 128 tokens),从中随机选取48个作为模型输入;同时使用Temporal构建健壮的部署流水线以支持大规模文档处理。 Result: 最佳模型达到了0.898的加权F分数,部署在CPU上的处理流水线平均每100个文件处理中位时间为498秒。 Conclusion: 该方法有效解决了长法律文档分类中的长度和效率问题,在保持较高准确率的同时具备良好的实际部署能力。 Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

[64] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Siddhant Agarwal,Adya Dhuler,Polly Ruhnke,Melvin Speisman,Md Shad Akhtar,Shweta Yadav

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型和多智能体框架的新方法MAMAMemeia,用于检测社交媒体中表情包所表现出的抑郁症状,显著提升了检测性能。

Details Motivation: 随着表情包被越来越多地用于表达抑郁情绪,需要更有效的方法来识别这些情感内容,以支持心理健康监测。 Method: 提出了RESTOREx资源,结合大语言模型生成和人工标注的解释,并设计了基于认知分析疗法(CAT)的多智能体多方面讨论框架MAMAMemeia。 Result: MAMAMemeia在macro-F1指标上比现有最优方法提升了7.55%,并在超过30种方法中成为新的基准。 Conclusion: 该研究为通过表情包进行抑郁症状检测提供了高效的新工具,展示了AI在心理健康领域的应用潜力。 Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

[65] Modeling Language as a Sequence of Thoughts

Nasim Borazjanizadeh,James McClelland

Main category: cs.CL

TL;DR: 本文提出了一种名为Thought Gestalt(TG)的递归Transformer模型,该模型在词元和句子级“思维”状态两个层次上建模语言,通过跨注意力机制利用先前句子表示的记忆来生成当前句子的词元。这种方法提高了数据效率,并减少了关系方向泛化错误。

Details Motivation: 由于现有的Transformer语言模型主要依赖于表面共现统计,缺乏对实体和事件形成全局一致的潜在表示,导致了诸如反转诅咒、上下文错误以及数据效率低下等问题。受认知科学中人类理解过程的启发,即把输入的语言流转换为紧凑且持久的记忆中的事件样表示,而原始形式则短暂存在,因此提出了新的模型架构。 Method: 引入了一个名为Thought Gestalt (TG) 的模型,它是一种递归Transformer,在两个抽象层级上进行语言建模:词元和句子级别的‘思想’状态。TG一次生成一个句子的词元,同时跨注意地参考之前句子表示的记忆。词元和句子表示使用同一组模型参数生成,并以单一目标——下一个词元交叉熵进行训练。通过保持写入记忆中的句子表示的计算图,未来词元损失的梯度可以反向传播通过跨注意力机制优化早期句子向量生成的参数。 Result: 在扩展实验中,与匹配的GPT-2运行相比,TG持续提高了效率,其他基线也显示GPT-2需要大约5-8%更多的数据和33-42%更多的参数才能达到TG的损失水平。此外,TG还减少了父子反转诅咒探针上的关系方向泛化错误。 Conclusion: Thought Gestalt模型通过结合词元级和句子级的双重抽象层次建模,不仅提升了语言模型的数据效率,而且改善了对于复杂语义结构的理解能力,特别是减少了像反转诅咒这样的问题,表明这种双层建模范式是提高语言模型性能的有效途径。 Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

[66] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng

Main category: cs.CL

TL;DR: 提出AdaGReS,一种面向检索增强生成(RAG)的冗余感知上下文选择框架,在token预算限制下通过优化结合相关性和冗余惩罚的集合级目标,实现更高质量的上下文选择。

Details Motivation: 标准的top-k检索常引入冗余或近似重复的上下文片段,浪费token预算并降低生成质量,因此需要一种能自动平衡相关性与冗余的上下文选择方法。 Method: AdaGReS在token预算约束下采用基于边际增益的贪心选择策略,优化包含查询-片段相关性和集合内冗余惩罚的集合级目标,并引入闭式、实例自适应的参数校准机制,自动调节相关性与冗余的权衡。 Result: 理论分析表明该目标函数在实际嵌入相似性条件下具有ε-近似子模性,为贪心算法提供近似最优性保证;实验在开放域问答和高冗余生物医学数据集上验证了其在控制冗余、提升上下文质量和最终回答性能方面的有效性。 Conclusion: AdaGReS有效解决了RAG中因上下文冗余导致的token浪费和生成质量下降问题,具备良好的理论保证和实际效果,且无需手动调参,适用于不同数据和预算场景。 Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

cs.CV [Back]

[67] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich,Yangming Lee

Main category: cs.CV

TL;DR: 提出了一种基于Depth Anything V2和DV-LORA的单目深度估计方法,显著提升了在高反射、流体充满的内窥镜手术环境中的深度估计精度与鲁棒性。

Details Motivation: 现有自监督单目深度估计方法在手术环境中对薄器械和透明表面易出现边界崩溃,且依赖噪声伪标签,性能受限。 Method: 利用Depth Anything V2的高保真合成先验,并通过动态向量低秩适应(DV-LORA)将其高效迁移到医学领域,同时引入物理分层评估协议以更准确评估高反射区域性能。 Result: 在SCARED数据集上达到98.1%的准确率(<1.25)和超过17%的平方相对误差下降,显著优于基线方法。 Conclusion: 该方法在参数效率和域适应能力上表现优异,显著增强了在复杂外科光照条件下的深度估计鲁棒性,为机器人手术提供了可靠感知支持。 Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

[68] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

Surya Rayala,Marcos Quinones-Grueiro,Naveeduddin Mohammed,Ashwin T S,Benjamin Goldberg,Randall Spain,Paige Lawton,Gautam Biswas

Main category: cs.CV

TL;DR: 本文提出了一种基于视频的评估管道,利用计算机视觉技术从城市作战训练视频中提取姿态、视线和轨迹数据,构建任务特定指标,并结合扩展的认知任务分析模型实现对团队协作与认知能力的客观量化评估。

Details Motivation: 现有的军事训练评估方法依赖昂贵且侵入式的传感器或主观人工观察,难以实现可扩展且客观的性能评估,尤其在认知、运动技能和团队协作方面存在不足。 Method: 提出一种无需额外硬件的视频分析流程,使用计算机视觉模型提取2D骨架、视线向量和移动轨迹,构建衡量心理运动流畅性、情境意识和团队协作的任务特定指标,并通过加权整合到扩展的认知任务分析(CTA)层次结构中以生成综合评分。 Result: 在真实的‘进入并清空房间’(ECR)训练案例中验证了该方法,能够提供可操作的个体与团队性能指标,并支持通过交互式仪表盘进行战后回顾分析。 Conclusion: 该方法为合成训练环境中的训练评估提供了低成本、非侵入且可扩展的解决方案,未来将拓展至3D视频分析并提升其在更广泛场景中的适用性。 Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

[69] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang,Shengqu Cai,Muyang Li,Chong Zeng,Beijia Lu,Anyi Rao,Song Han,Gordon Wetzstein,Maneesh Agrawala

Main category: cs.CV

TL;DR: 提出PFP神经网络结构,用于将长视频压缩为短上下文,并通过预训练目标保留任意时间位置单帧的高频细节。

Details Motivation: 需要有效压缩长视频并保留关键视觉细节,以支持后续视频建模任务。 Method: 设计PFP神经网络结构,采用显式的预训练目标来保持单帧的高频信息,并可作为记忆编码器用于自回归视频模型。 Result: 基线模型能将20秒视频压缩为约5k长度的上下文,且可感知地恢复随机帧;模型可微调用于低开销、低保真损失的长时记忆视频生成。 Conclusion: PFP提供了一种高效的长视频压缩与上下文表示方法,适用于需要长期依赖的视频生成任务。 Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[70] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng,Hongfei Xue,Pu Wang,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了一个用于3D人体姿态估计的终身域自适应新任务,首次将终身域自适应引入该领域,通过创新的GAN框架和新型3D姿态生成器范式,有效缓解域偏移和灾难性遗忘问题,在多个数据集上表现出优越性能。

Details Motivation: 现有域自适应方法忽视了非平稳目标姿态数据集的问题,且难以在持续学习新域时保留旧域知识,导致泛化能力差和灾难性遗忘。 Method: 提出一种新的终身域自适应3D HPE框架,采用包含3D姿态生成器、2D姿态判别器和3D姿态估计器的GAN结构,并设计融合姿态感知、时序感知和域感知知识的3D姿态生成器以增强适应性和防止遗忘。 Result: 在多个域自适应3D HPE数据集上进行了广泛实验,结果表明所提方法在适应新域的同时有效保留了旧域知识,性能优于现有方法。 Conclusion: 本文首次将终身域自适应引入3D人体姿态估计,提出的框架能有效应对非平稳目标域挑战并缓解灾难性遗忘,为实际应用场景下的持续学习提供了新思路。 Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[71] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

Krithika Iyer,Austin Tapp,Athelia Paulli,Gabrielle Dickerson,Syed Muhammad Anwar,Natasha Lepore,Marius George Linguraru

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的框架,利用儿童T1加权MRI生成合成CT(sCTs),实现颅骨和颅缝的精确分割与可视化,克服了MRI对骨骼显示不足的局限性,并在结构相似性和分割准确性上达到与真实CT等效的水平。

Details Motivation: 儿童颅骨发育和颅缝骨化的定量评估对于诊断和治疗头颅生长相关疾病至关重要,但CT的电离辐射限制了其在儿童中的应用,而MRI虽无辐射却难以清晰显示颅骨和颅缝,因此需要一种无辐射且能准确评估颅骨发育的方法。 Method: 采用领域特定的变分自编码器构建深度学习管道,将0.2至2岁儿童的T1加权MRI转化为合成CT(sCTs),并预测颅骨分割、生成颅缝概率热图,进而进行直接颅缝分割。使用内部儿科数据进行训练与验证,并通过结构相似性、Frechet inception距离和Dice系数评估性能,同时采用双单侧检验(TOST)验证sCT与真实CT在分割上的等效性。 Result: sCT与真实CT的结构相似性达99%,Frechet inception距离为1.01;七块颅骨的平均Dice系数为85%,颅缝分割Dice系数达80%;TOST检验表明sCT与真实CT在颅骨和颅缝分割上具有等效性(p < 0.05)。 Conclusion: 该方法首次实现了从儿童MRI衍生的sCT上进行颅缝分割,填补了无创性儿童颅骨评估的技术空白,为临床提供了一种安全、高精度的替代方案,有望减少儿科患者对CT扫描的依赖。 Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

[72] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema,Eliza Mace,Hunter Brown,Heidys Cabrera,Nick Krall,Matthew O'Neill,Shivangi Sarkar,Lowell Weissman,Eric Hughes,Guido Zarrella

Main category: cs.CV

TL;DR: 本研究利用超过一千万亿像素的商业卫星电光数据,探索了在遥感领域训练大规模视觉变换器模型的可扩展性规律,发现当前性能仍受限于数据而非模型参数,为未来遥感基础模型的发展提供了关于数据收集、计算资源分配和优化策略的实际指导。

Details Motivation: 由于遥感等高价值领域的可扩展性规律尚不明确,缺乏指导大规模模型训练的原则,限制了该领域专用编码器的发展。 Method: 使用大规模卫星EO数据和MITRE联邦AI沙盒,训练不同规模的视觉变换器(ViT)主干网络,分析其在极大规模下的表现、成败模式及跨模态域差距的影响。 Result: 发现即使在peta-scale下,模型性能仍处于数据受限状态,而非参数受限;揭示了遥感领域独特的可扩展行为。 Conclusion: 研究结果表明,在遥感基础模型训练中应优先扩大数据量而非模型参数,并为数据采集策略、计算预算和优化调度提供了实践依据。 Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[73] Learning to learn skill assessment for fetal ultrasound scanning

Yipei Wang,Qianye Yang,Lior Drukker,Aris T. Papageorghiou,Yipeng Hu,J. Alison Noble

Main category: cs.CV

TL;DR: 提出一种双层优化框架,通过任务执行效果自动评估胎儿超声技能,无需人工预定义评分。

Details Motivation: 传统超声技能评估依赖专家主观判断,耗时且不够客观;现有自动化方法多依赖监督学习和预设技能影响因素,限制了评估的普适性。 Method: 设计一个包含临床任务预测器和技能预测器的双层优化框架,联合优化两个网络,以图像任务完成质量作为技能评估指标。 Result: 在真实临床胎儿头部超声视频数据上验证了该方法的可行性,能够通过优化后的任务表现量化技能水平。 Conclusion: 该框架为超声技能的客观、自动化评估提供了新思路,减少了对人工标注和预设规则的依赖。 Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

[74] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

Yulong Zou,Bo Liu,Cun-Jing Zheng,Yuan-ming Geng,Siyue Li,Qiankun Zuo,Shuihua Wang,Yudong Zhang,Jin Hong

Main category: cs.CV

TL;DR: 提出了一种新的元引导多模态学习框架(MGML),用于在不完整多模态MRI数据下进行脑肿瘤分割,具有良好的适应性和鲁棒性。

Details Motivation: 临床中多模态MRI数据常不完整,限制了病变分割性能,如何充分利用不完整多模态信息是一个关键挑战。 Method: 设计了两个模块:元参数化自适应模态融合(Meta-AMF)和一致性正则化模块;Meta-AMF根据可用模态生成软标签监督信号以实现自适应融合,一致性正则化提升模型鲁棒性和泛化能力。 Result: 在BraTS2020和BraTS2023数据集上验证,相比现有方法表现更优;在BraTS2020的15种缺失模态组合平均Dice得分分别为:WT 87.55、TC 79.36、ET 62.67。 Conclusion: MGML能有效利用不完整多模态MRI数据,提升脑肿瘤分割性能,且无需修改原模型结构,易于集成到训练流程中。 Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[75] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

Hualin Ye,Bingxi Liu,Jixiang Du,Yu Qin,Ziyi Chen,Hong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的跨视角地理定位(CVGL)系统,通过改进特征聚合与对齐来应对视角差异的挑战,结合DINOv2骨干网络、多尺度通道重分配模块和基于MoE路由的聚合模块,在减少参数量的同时实现了优越性能。

Details Motivation: 由于不同视角之间的显著差异,传统方法在特征聚合和对齐方面面临挑战,难以有效匹配查询图像与数据库图像。 Method: 采用DINOv2骨干网络并结合卷积适配器进行微调;设计多尺度通道重分配模块以增强空间表示的多样性与稳定性;引入集成Mixture-of-Experts(MoE)路由的改进聚合模块,在交叉注意力框架中动态选择专家子空间处理键和值。 Result: 在University-1652和SUES-200数据集上的实验表明,该方法在较少训练参数下取得了具有竞争力的性能。 Conclusion: 所提出的CVGL系统能有效应对跨视角差异,提升了定位精度与模型效率,具备良好的应用潜力。 Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

[76] Kinematic-Based Assessment of Surgical Actions in Microanastomosis

Yan Meng,Daniel Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 提出一种基于AI的自动化框架,用于显微吻合手术中的动作分割与技能评估,可在边缘计算平台高效运行。

Details Motivation: 传统显微外科培训评估依赖专家主观评分,存在评分者间差异、不一致性和耗时等问题,亟需客观、可扩展的自动化评估方法。 Method: 该框架包含三个模块:基于YOLO和DeepSORT的器械尖端追踪定位、基于自相似矩阵的动作边界检测与无监督聚类动作分割、以及用于评估手术动作熟练度的有监督分类模块。 Result: 在58段专家评分的显微吻合视频数据集上验证,动作分割帧级准确率达92.4%,技能分类准确率达85.5%。 Conclusion: 该方法可实现客观、实时的反馈,推动显微外科教育向标准化、数据驱动的培训模式发展,提升高风险手术环境中的能力评估水平。 Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

[77] U-Net-Like Spiking Neural Networks for Single Image Dehazing

Huibin Li,Haoran Liu,Mingzhe Liu,Yulong Xiao,Peng Li,Guibin Zan

Main category: cs.CV

TL;DR: 提出了一种基于脉冲神经网络(SNN)的新型去雾架构DehazeSNN,结合U-Net结构和OLIFBlock模块,在降低计算成本的同时实现了与现有最先进方法相媲美的去雾性能。

Details Motivation: 传统去雾方法依赖大气散射模型,而深度学习方法如CNN和Transformer虽有效但分别存在长距离依赖建模不足和计算开销大的问题,因此需要一种高效且性能优越的去雾模型。 Method: 提出DehazeSNN,采用U-Net-like结构结合Spiking Neural Networks(SNN),引入正交泄漏积分-放电块(OLIFBlock)以增强通道间通信,有效捕捉多尺度特征并处理局部与长程依赖关系。 Result: 实验表明,DehazeSNN在多个基准数据集上与最先进方法具有竞争力,能生成高质量无雾图像,同时模型更小、乘加操作更少,计算负担显著降低。 Conclusion: DehazeSNN通过融合SNN与U-Net结构,兼顾效率与性能,为图像去雾提供了一种低功耗、高性能的解决方案,具备实际应用潜力。 Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.

[78] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li,Yuecong Min,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了T2VAttack,首次从语义和时序两个角度系统研究了文本到视频扩散模型的对抗攻击问题,揭示了现有模型在微小提示词修改下的脆弱性。

Details Motivation: 尽管文本到视频生成模型取得了显著进展,但其对对抗攻击的鲁棒性尚未被充分探索,亟需评估其在语义一致性和时序动态上的安全性。 Method: 提出两种攻击目标(语义对齐与时间动态)和两种攻击方法:T2VAttack-S通过贪婪搜索替换关键语义/时序词汇,T2VAttack-I通过迭代插入优化词汇实现低扰动攻击。 Result: 实验表明,仅替换或插入一个单词即可显著降低主流T2V模型(如ModelScope、CogVideoX等)生成视频的质量,在语义保真度和时序连贯性上均出现明显退化。 Conclusion: 当前文本到视频扩散模型在面对轻微提示词篡改时存在严重安全漏洞,需在未来工作中加强其对抗鲁棒性设计。 Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[79] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Yuang Jia,Jinlong Wang,Jiayi Zhao,Chunlam Li,Shunzhou Wang,Wei Gao

Main category: cs.CV

TL;DR: 本文提出了一种仅使用图像和可选相机姿态实现自动驾驶场景中视图外推的有效方法,通过融合静态与动态点云并结合可变形4D高斯模型与视频扩散模型进行迭代优化,显著提升了新视角下渲染图像的质量。

Details Motivation: 现有视图外推方法依赖昂贵传感器或标注的先验信息(如LiDAR、3D框、车道线),限制了实际应用,本文旨在减少对这些先验的依赖。 Method: 首先估计全局静态点云和每帧动态点云并融合为统一表示;采用可变形4D高斯框架重建场景;利用初始4D高斯渲染的降质图像训练视频扩散模型,并通过迭代方式用扩散模型 refine 高斯渲染结果,同时将增强结果反馈用于优化4DGS。 Result: 相比基线方法,在目标外推视角下生成了质量更高的新颖视图图像。 Conclusion: 该方法在无需强先验条件下实现了高质量的视图外推,具有更强的现实部署潜力。 Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

[80] Anomaly detection in satellite imagery through temporal inpainting

Bertrand Rouet-Leduc,Claudia Hulbert

Main category: cs.CV

TL;DR: 提出一种基于深度学习的卫星影像时间序列异常检测方法,通过图像修复模型预测地表变化,显著提升了对突发地表变化(如地震破裂)的检测灵敏度和特异性。

Details Motivation: 传统变化检测方法难以区分大气噪声、季节性变化与真实地表变化,导致灵敏度和特异性不足,亟需更鲁棒的方法用于灾害响应与环境监测。 Method: 基于SATLAS基础模型构建图像修复模型,利用Sentinel-2时间序列前期影像预测最新帧的地表外观,并通过预测与观测之间的差异识别异常变化。使用全球分布的多气候区、多土地覆盖类型数据进行训练。 Result: 在2023年土耳其-叙利亚地震引发的地表破裂检测中,该方法比时间中值法和Reed-Xiaoli检测器具有更高的灵敏度和特异性,检测阈值降低约三倍。 Conclusion: 该方法能有效利用卫星时间序列的时序冗余性,实现对微弱地表变化的高精度检测,为全球尺度自动化地表变化监测提供了可行方案。 Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

[81] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

Jun Ding,Shang Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的医学图像分割框架GCA-ResUNet,引入轻量级的分组坐标注意力(GCA)模块,有效建模通道间语义异质性和空间依赖性,在保持计算效率的同时提升了多器官和低对比度区域的分割性能。

Details Motivation: 现有U-Net类方法因局部感受野和同质化注意力机制难以建模长距离上下文依赖,而Transformer虽具全局建模能力但计算开销大,限制了其在资源受限临床环境中的应用。 Method: 设计了一种即插即用的GCA模块,将通道上下文分组建模以应对语义异质性,并结合方向感知的坐标编码来捕获水平与垂直方向的空间依赖关系,集成于ResUNet架构中。 Result: 在Synapse和ACDC两个基准上分别取得86.11%和92.64%的Dice分数,优于包括Swin-UNet和TransUNet在内的多种CNN与Transformer方法,尤其在小器官和复杂边界结构分割中表现更优。 Conclusion: GCA-ResUNet在分割精度与计算效率之间实现了良好平衡,具备良好的临床部署实用性和可扩展性。 Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

[82] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li,Zhenyu Qi,Hao Qin,Huanrui Yang,Sen He,Kebin Peng

Main category: cs.CV

TL;DR: 本文提出GASeg框架,通过结合外观与几何的拓扑信息来提升自监督语义分割性能,引入可微分盒计数(DBC)模块和拓扑增强(TopoAug)策略,并设计多目标损失GALoss实现跨模态对齐,在多个基准上达到最先进水平。

Details Motivation: 现有自监督语义分割方法过于依赖不稳定的外观特征(如阴影、反光、局部纹理),在存在外观模糊时表现不佳,因此需要引入更稳定的结构信息。 Method: 提出GASeg框架,包含可微分盒计数(DBC)模块用于提取几何与外观双流的多尺度拓扑统计特征;设计拓扑增强(TopoAug)作为对抗性数据增强策略,模拟真实世界中的外观模糊;采用多目标损失GALoss强制实现跨模态特征对齐。 Result: 在COCO-Stuff、Cityscapes和PASCAL等多个公开基准上进行了广泛实验,GASeg均取得当前最优性能,验证了方法有效性。 Conclusion: 通过融合稳定的拓扑结构信息,能够有效缓解自监督语义分割中对外观特征的过依赖问题,所提出的GASeg框架在多个数据集上显著提升了分割性能。 Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[83] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

Tae Ha Park,Simone D'Amico

Main category: cs.CV

TL;DR: 提出了一种利用太阳位置先验信息改进3D高斯点阵模型训练的新方法,用于在空间光照变化条件下从图像序列中恢复未知目标航天器的3D结构,并提升下游位姿估计任务的光度渲染质量。

Details Motivation: 由于空间成像中存在动态光照条件,传统假设静态场景的3D重建方法(如3DGS)难以获得准确的光度渲染效果,影响后续位姿估计性能,因此需要引入先验知识来改善模型在复杂光照下的表现。 Method: 利用伴飞航天器提供的太阳位置先验信息,将其融入3D高斯点阵(3DGS)模型的训练过程,通过调整光度损失优化渲染结果,从而提高在动态光照条件下的几何与光度一致性。 Result: 实验表明,该方法使3DGS模型能够适应快速变化的空间光照,有效反映全局阴影和自遮挡现象,显著提升了渲染图像的光度准确性,进而增强了后续基于图像的相机位姿估计性能。 Conclusion: 引入太阳位置先验可有效提升3DGS在非理想光照条件下的建模能力,为RPO任务中的3D结构恢复与精确位姿估计提供了可靠解决方案。 Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target's geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun's position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

[84] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu,Hui Li,Yiyun Su

Main category: cs.CV

TL;DR: 本文提出了一种名为Hilbert-VLM的新型两阶段融合框架,用于提升视觉语言模型在3D医学图像分析中的性能,结合改进的SAM2与希尔伯特空间填充曲线,实现了更精确的病灶分割与疾病分类。

Details Motivation: 现有视觉语言模型在处理复杂的3D多模态医学图像时,难以有效融合信息且易忽略细微但关键的病理特征,因此需要一种能更好保留空间局部性并增强提示信息的方法。 Method: 提出Hilbert-VLM框架,包含HilbertMed-SAM模块用于精确病灶分割;改进SAM2架构,引入基于希尔伯特曲线的Mamba扫描机制、Hilbert-Mamba交叉注意力(HMCA)机制和尺度感知解码器,并通过多模态提示增强模块整合分割掩码与文本属性以指导VLM推理。 Result: 在BraTS2021分割基准上达到82.35%的Dice分数和78.85%的诊断分类准确率(ACC)。 Conclusion: Hilbert-VLM通过保留3D医学图像的空间局部性和增强细粒度特征提取,显著提升了病灶分割与疾病分类的准确性,展现出在医学视觉语言分析中的巨大潜力。 Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[85] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li,Yue Song,Jianing Peng,Ting Liu,Jun Huang,Xiaochao Qu,Luoqi Liu,Wei Wang,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于流的扩散编辑新方法CVC(Conditioned Velocity Correction),通过引入双重速度校正机制,解决了现有方法中潜在轨迹速度误差累积导致的语义不一致和结构失真问题。

Details Motivation: 现有基于流的扩散编辑方法在源与目标分布转换时,由于隐空间中速度场的累积误差,导致语义不一致和结构保真度下降。 Method: 提出条件速度校正(CVC)框架,将流式编辑重构为基于已知源先验的分布变换问题;引入双视角速度转换机制,分解为结构保持分支和语义引导分支,并结合经验贝叶斯推断与Tweedie校正对条件速度场进行后验一致性更新。 Result: CVC实现了更稳定的隐变量动态演化,在多种任务中展现出更高的图像保真度、更好的语义对齐性和更可靠的编辑行为。 Conclusion: CVC通过数学上严谨的速度校正机制,有效提升了基于流的扩散模型在无需显式逆过程下的编辑性能,兼顾结构保持与语义转换。 Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

[86] FitControler: Toward Fit-Aware Virtual Try-On

Lu Yang,Yicheng Liu,Yanan Li,Xiang Bai,Hao Lu

Main category: cs.CV

TL;DR: 本文提出了FitControler,一种可集成到现代虚拟试穿(VTON)模型中的插件,用于实现对服装版型(fit)的可控生成,并构建了首个关注版型的VTON数据集Fit4Men,同时提出两个评估指标来衡量生成结果的版型一致性。

Details Motivation: 现有虚拟试穿方法主要关注服装细节的逼真渲染,但忽略了影响整体风格的关键因素——服装版型(garment fit),即服装与人体之间的贴合关系,因此需要引入对版型的显式建模与控制。 Method: 提出FitControler,包含一个版型感知的布局生成器,基于预处理的与服装无关的表示生成不同版型的身体-服装布局;并设计多尺度版型注入模块,将布局信息融入现有VTON模型以实现布局驱动的生成。同时构建了包含13,000对样本的数据集Fit4Men,并提出两个版型一致性评估指标。 Result: 实验表明FitControler可兼容多种主流VTON模型,实现精确的版型控制;新提出的评估指标能有效反映生成结果的版型一致性;构建的Fit4Men数据集为后续研究提供了支持。 Conclusion: 通过引入对服装版型的建模与控制,显著提升了虚拟试穿在整体视觉协调性方面的表现,为未来VTON系统提供了更全面的设计维度。 Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style -- garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

[87] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

Huanxiong Liang,Yunuo Chen,Yicheng Pan,Sixian Wang,Jincheng Dai,Guo Lu,Wenjun Zhang

Main category: cs.CV

TL;DR: 提出了一种结构引导的2D高斯点阵分配方法,通过结构感知初始化、自适应量化和几何一致性正则化,显著提升2DGS在低比特率下的率失真性能,同时保持毫秒级解码速度。

Details Motivation: 现有2DGS方法在分配表示容量和量化精度时忽略图像结构,导致低比特率下率失真效率低。 Method: 1. 结构引导初始化:根据图像结构先验分配2D高斯分布;2. 自适应位宽量化:对复杂区域的小尺度高斯赋予更高量化精度;3. 几何一致性正则化:对齐高斯方向与局部梯度方向。 Result: 在Kodak上BD-rate降低43.44%,DIV2K上降低29.91%,保持超过1000 FPS解码速度。 Conclusion: 该方法通过结构引导的资源分配策略,有效提升了2DGS的表示能力和压缩效率,适用于高性能图像压缩场景。 Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

[88] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang,Donghao Wang,Jiacheng Yang,Yifan Jiang,Meiyi Zhu,Yuekun Yang,Cong Wang,Qi Fan,Wenbin Li,Yang Gao

Main category: cs.CV

TL;DR: 本文提出了一种用于遥感图像理解的多特征融合视觉-语言模型MF-RSVLM,通过多尺度特征提取和循环视觉特征注入机制,有效缓解了视觉遗忘问题,并在多个遥感任务中实现了最先进的性能。

Details Motivation: 现有的视觉-语言模型在处理遥感图像时面临挑战,难以提取细粒度视觉特征且存在视觉遗忘问题,因此需要专门针对遥感域设计更有效的模型。 Method: 提出MF-RSVLM模型,采用多尺度视觉表示学习,融合全局上下文与局部细节,并通过循环视觉特征注入机制在语言生成过程中持续引入视觉信息。 Result: 在遥感分类、图像描述生成和视觉问答等多个基准上进行了广泛实验,MF-RSVLM取得了最先进或极具竞争力的结果。 Conclusion: MF-RSVLM能有效提升遥感图像的理解能力,解决了现有VLM在遥感领域中的特征提取不足和视觉遗忘问题,具有良好的应用前景。 Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[89] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He,Yujie Zhang,Shuyong Gao,Wenjie Li,Lingyi Hong,Mingxi Chen,Kaixun Jiang,Jiyuan Fu,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为RSAgent的新方法,通过多轮工具调用实现推理与动作的交替,以解决文本引导分割中初始定位错误难以修正的问题。

Details Motivation: 现有方法在单次前向传递中完成像素提示预测,限制了验证、重新聚焦和修正能力,尤其当初始定位不准确时表现不佳。 Method: 设计了一个基于多模态大语言模型的代理(RSAgent),可与分割工具箱交互,利用视觉反馈和历史观察迭代更新空间假设并优化分割掩码;采用两阶段训练框架:冷启动监督微调和基于细粒度奖励的代理强化学习。 Result: 在ReasonSeg测试集上实现了66.5%的gIoU,在RefCOCOg上达到81.5%的cIoU,分别比Seg-Zero-7B提升9%,并在多种基准上取得当前最优性能。 Conclusion: RSAgent通过引入代理式多轮推理机制,显著提升了文本引导分割的准确性和鲁棒性,具备良好的泛化能力。 Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[90] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Mustafa Munir,Md Mostafijur Rahman,Kartikeya Bhardwaj,Paul Whatmough,Radu Marculescu

Main category: cs.CV

TL;DR: 本文提出了一种名为PipeFlow的可扩展视频编辑方法,通过跳过低运动帧、并行化处理和神经插值技术,实现了长视频的高效编辑,编辑时间随视频长度线性增长,并显著优于现有方法。

Details Motivation: 长视频编辑因计算成本随序列延长呈指数增长而面临挑战,尤其是联合编辑和DDIM反演过程中的资源消耗问题亟需解决。 Method: 基于SSIM和光流的运动分析跳过低运动帧;采用分段并行的流水线调度算法进行DDIM反演与编辑;利用神经网络插值平滑边界帧并补全 skipped 帧。 Result: PipeFlow在编辑长视频时实现线性时间扩展,最长达到9.6倍于TokenFlow、31.7倍于DMT的速度提升。 Conclusion: PipeFlow通过分段处理和资源优化,有效解决了长视频编辑中的计算瓶颈,理论上支持无限长度视频的高效编辑。 Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

[91] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Xinran Qin,Yuhui Quan,Ruotao Xu,Hui Ji

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的可训练各向异性扩散框架,用于图像去噪,能够自适应复杂图像结构,并在多种噪声上去噪效果优于传统扩散方法,与深度CNN方法相当。

Details Motivation: 传统各向异性扩散方法使用显式扩散算子,难以适应复杂图像结构,性能受限,而现有学习方法表现更好,因此需要结合学习机制提升扩散方法的适应性和性能。 Method: 将去噪过程建模为一系列由深度Q学习优化的扩散动作,通过强化学习自动学习扩散顺序,形成具有强适应性的随机各向异性扩散过程。 Result: 所提方法在三种常见噪声上均优于现有的扩散方法,并与代表性的深度CNN方法性能相当。 Conclusion: 基于强化学习的各向异性扩散框架有效提升了传统扩散方法的性能,兼具可解释性与强适应性,是图像去噪的一种有前景的新方向。 Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

[92] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

Yizhi Liu,Ruitao Pu,Shilin Xu,Yingke Chen,Quan-Hui Liu,Yuan Sun

Main category: cs.CV

TL;DR: 提出了一种新的鲁棒跨模态学习框架NIRNL,通过跨模态边界保持和邻居感知实例精炼来应对含噪标签下的跨模态检索问题,在多个基准数据集上实现了最先进的性能。

Details Motivation: 由于多模态数据标注耗时且易含噪声,现有鲁棒方法难以兼顾模型性能、标签校准可靠性和数据利用率,因此需要更有效的抗噪学习框架。 Method: 提出Cross-modal Margin Preserving (CMP) 来增强样本对间的判别性,并设计Neighbor-aware Instance Refining (NIR) 通过跨模态邻域一致性划分纯样本、难样本和噪声样本子集,进而为不同子集定制优化策略以提高数据利用并抑制误差传播。 Result: 在三个基准数据集上的实验表明,NIRNL在高噪声率下仍表现出卓越的鲁棒性,取得了当前最优的检索性能。 Conclusion: NIRNL有效解决了含噪标签下的跨模态检索问题,通过细粒度样本划分和定制化优化策略,实现了性能、可靠性和数据利用率的平衡。 Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[93] Pathology Context Recalibration Network for Ocular Disease Recognition

Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 本文提出了一种结合病理上下文和专家经验先验的新型网络PCRNet,用于提升眼部疾病识别性能与决策可解释性。

Details Motivation: 现有深度神经网络在眼部疾病识别中忽略了临床病理上下文和专家经验先验,限制了模型性能和可解释性。 Method: 设计了病理重校准模块(PRM)和专家先验引导适配器(EPGA),并引入集成损失(IL)优化训练过程。 Result: 在三个眼部疾病数据集上,PCRNet结合IL方法优于现有的注意力机制和先进损失函数方法。 Conclusion: PRM和EPGA有效增强了模型对关键病理区域的关注,提升了识别精度与决策透明度。 Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

[94] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen,Dexin Chen,Fengchao Xiong,Yuntao Qian,Liang Xiao

Main category: cs.CV

TL;DR: 提出了一种平衡的层次对比损失和解耦学习策略,以改善细粒度遥感图像检测中的类别不平衡和分类与定位干扰问题。

Details Motivation: 现有方法在处理具有层次标签结构的细粒度遥感数据时,忽略了标签层级中数据分布不均衡以及语义关系学习对定位的干扰问题。 Method: 提出了平衡的层次对比损失,引入可学习的类原型并均衡各层级类别的梯度贡献;同时在DETR框架中采用解耦策略,将对象查询分为分类和定位两组。 Result: 在三个具有层次标注的细粒度数据集上实验表明,该方法优于现有的最先进方法。 Conclusion: 通过平衡梯度贡献和解耦分类与定位任务,有效提升了细粒度遥感图像检测性能。 Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[95] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Aiyue Chen,Yaofu Liu,Junjian Huang,Guang Lian,Yiwu Yao,Wangli Lan,Jing Lin,Zhixin Ma,Tingting Zhou,Harry Yang

Main category: cs.CV

TL;DR: RainFusion2.0提出了一种高效、低开销的稀疏注意力机制,用于加速视频和图像生成模型,在保持质量的同时实现1.5~1.8倍端到端加速,并支持多种硬件平台。

Details Motivation: Diffusion Transformer在视频和图像生成中计算成本高昂,且现有稀疏注意力方法存在预测开销高和硬件通用性差的问题。 Method: 采用块级均值作为稀疏掩码预测的代表令牌,实现时空感知的令牌重排,并引入首帧sink机制以适应视频生成场景。 Result: RainFusion2.0可实现80%的稀疏度,端到端加速1.5~1.8倍,且不损失视频质量,同时在多种生成模型和硬件平台上验证了其有效性与泛化能力。 Conclusion: RainFusion2.0是一种在线自适应、硬件高效且低开销的稀疏注意力方案,显著提升了生成模型的推理效率并具备跨硬件通用性。 Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

[96] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng

Main category: cs.CV

TL;DR: 本文提出D$^2$VLM框架和因子化偏好优化(FPO)算法,通过解耦视频理解中的时间定位与文本回答任务,提升事件级感知的准确性。

Details Motivation: 现有视频-语言模型在时间定位和文本响应两个任务上通常耦合处理,缺乏清晰逻辑结构,导致次优结果。 Method: 提出D$^2$VLM框架,采用“先定位后回答”的范式,引入证据标记进行事件级语义捕捉,并设计FPO算法将时间定位建模融入优化目标。此外构建了用于因子化偏好学习的合成数据集。 Result: 实验表明该方法在多个任务上优于现有方法,显著提升时间定位与文本响应性能。 Conclusion: 通过因子化解耦学习并显式建模时间证据,能更有效地实现可靠的视频问答与事件理解。 Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[97] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian,Juncheng Wang,Yuxiang Feng,Chao Xu,Wang Lu,Yang Liu,Baigui Sun,Yiqiang Chen,Yong Liu,Shujun Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到动作生成框架Latent Motion Reasoning (LMR),通过引入双阶段的“思考-行动”机制,解决语言语义与运动学数据之间的语义-运动阻抗不匹配问题。

Details Motivation: 现有的文本到动作生成方法在处理复杂语义时存在语义密集的语言表示与高频率动作数据之间的根本性不匹配问题,限制了生成动作的语义对齐和物理合理性。 Method: 受认知科学中分层运动控制启发,提出Latent Motion Reasoning (LMR) 框架,包含一个双粒度编码器,将动作分解为语义丰富的推理隐变量(用于全局规划)和高频的执行隐变量(用于细节还原),实现先推理后执行的两阶段生成过程。 Result: 在T2M-GPT和MotionStreamer两个基线上实现显著提升,实验表明LMR在语义对齐和物理合理性方面均取得非平凡改进。 Conclusion: 动作生成的最佳规划空间不是自然语言本身,而是一个学习得到的、与动作对齐的概念空间,验证了引入系统2推理结构在T2M任务中的有效性。 Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

[98] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen,Yanbo Wang,Wentao Zhao,Guole Shen,Tianchen Deng,Jingchuan Wang

Main category: cs.CV

TL;DR: 提出一种无需训练的生成式对抗攻击框架,通过扩散模型生成自然且与场景一致的对抗性物体,以更有效地干扰单目深度估计系统,提升自动驾驶安全评估能力。

Details Motivation: 现有的基于纹理补丁的物理攻击在实际驾驶环境中存在放置限制和真实感不足的问题,导致攻击效果受限。因此需要一种更自然、更具实用性的攻击方法来评估单目深度估计系统的鲁棒性。 Method: 提出一个无需训练的对抗攻击框架,包含显著区域选择模块和雅可比向量积引导机制,利用扩散模型的条件生成过程生成符合场景的对抗性物体,并通过梯度引导优化对深度估计的影响。 Result: 在数字和物理实验中,该方法在攻击有效性、隐蔽性和物理可部署性方面均显著优于现有方法。 Conclusion: 所提方法能够生成高度逼真的对抗性物体,有效干扰单目深度估计,为自动驾驶系统的安全性评估提供了强有力的工具。 Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

[99] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng,Yue Yang,Xiaohan He,Jiatong Zhao,Jianlong Chen,Zijun Chen,Daocheng Fu,Qi Liu,Renqiu Xia,Bo Zhang,Junchi Yan

Main category: cs.CV

TL;DR: 提出GeoBench,一个包含四个推理层级的几何问题求解分层基准,通过六个形式化验证任务系统评估视觉-语言模型在几何推理中的能力,揭示了子目标分解和无关前提过滤对准确性的关键影响,且思维链提示在某些任务中意外降低性能。

Details Motivation: 现有几何推理评测存在测试数据污染、过度关注答案而忽视推理过程、诊断粒度不足等问题,亟需更可靠、细粒度的评估基准。 Method: 构建名为GeoBench的分层基准,包含视觉感知、目标导向规划、严格定理应用和自反式回溯四个推理层级,并通过自主开发的TrustGeoGen生成六项形式化验证任务,系统评估模型从属性提取到逻辑纠错的能力。 Result: 实验表明,尽管OpenAI-o3等推理模型优于通用多模态大模型,但随着任务复杂度上升性能显著下降;子目标分解和无关前提过滤对最终准确性有关键影响,而思维链提示在部分任务中反而降低性能。 Conclusion: GeoBench是一个全面且可靠的几何推理评测基准,为开发具备深度几何推理能力的系统提供了可操作的指导原则。 Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[100] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 本研究提出并验证了两种关键方法,用于基于大语言模型(LLM)的计算机视觉神经网络架构生成:Few-Shot Architecture Prompting(FSAP)和 Whitespace-Normalized Hash Validation,显著提升生成效率与评估严谨性。

Details Motivation: 神经架构搜索(NAS)计算成本高,而大语言模型(LLM)提供了新机遇,但在计算机视觉中的系统性应用,尤其是在提示工程和验证策略方面仍缺乏研究。 Method: 基于NNGPT/LEMUR框架,提出FSAP,系统研究不同示例数量(n=1至6)对架构生成的影响,并引入Whitespace-Normalized Hash Validation进行快速去重。在七个视觉基准上生成1900个独特架构,采用数据集平衡的评估方法。 Result: 发现n=3个示例在多样性与上下文聚焦间达到最佳平衡;提出的哈希验证方法比AST解析快100倍(<1ms),有效避免重复训练;大规模实验验证了方法的有效性和可扩展性。 Conclusion: 本工作为LLM驱动的计算机视觉架构搜索提供了实用指南和严格评估标准,降低了计算资源要求,使更多研究者能参与自动化模型设计。 Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

[101] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen,Sujie Hu,Jiashu Zhu,Meiqi Wu,Jintao Chen,Yanxun Li,Nisha Huang,Chengyu Fang,Jiahong Wu,Xiangxiang Chu,Xiu Li

Main category: cs.CV

TL;DR: 本文提出了一种新的方法D²-Align,以缓解文本到图像扩散模型在人类反馈强化学习中出现的偏好模式崩溃(PMC)问题,通过方向性解耦对齐保持生成多样性。

Details Motivation: 现有基于人类反馈的强化学习方法虽然在自动奖励指标上表现良好,但容易导致模型生成结果缺乏多样性,出现偏好模式崩溃(PMC)现象。为此,需要一种能够量化并缓解该问题的方法。 Method: 提出了DivGenBench基准用于量化PMC,并设计了D²-Align框架:在冻结奖励模型的前提下,先在其嵌入空间中学习一个方向性修正,然后将该修正应用于奖励信号优化过程,从而避免模型坍缩至特定模式。 Result: 实验表明,D²-Align在保持高质量生成的同时显著提升了生成多样性,在自动化指标和人工评估中均优于现有方法。 Conclusion: D²-Align有效缓解了偏好模式崩溃问题,实现了更稳定、多样且符合人类偏好的文本到图像生成。 Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[102] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni,ZhenQi Chen,YuanFu Yang

Main category: cs.CV

TL;DR: 本文提出了IMDD-1M,首个大规模工业多模态缺陷数据集,包含100万个图文对,并基于该数据集训练了一个面向工业场景的扩散型视觉语言基础模型,实现了数据高效、可扩展的工业检测与生成。

Details Motivation: 现有的工业缺陷检测数据集在规模、多样性和多模态标注方面不足,限制了多模态学习在制造业质量检测中的应用,因此需要一个大规模、细粒度标注的工业多模态缺陷数据集。 Method: 构建了IMDD-1M数据集,包含百万级高分辨率真实缺陷图像与精细文本描述,并从零开始训练一个基于扩散机制的视觉-语言基础模型,支持轻量微调以适应特定任务。 Result: 所提出的模型在仅使用不到5%任务特定数据的情况下,性能媲美专用专家模型,在分类、分割、检索、生成等多种任务上表现出色。 Conclusion: IMDD-1M为工业多模态学习提供了重要资源,所提出的基础模型展示了在工业质检中实现数据高效自适应和可扩展智能的潜力。 Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[103] Bayesian Self-Distillation for Image Classification

Anton Adelöw,Matteo Gamba,Atsuto Maki

Main category: cs.CV

TL;DR: 提出了一种基于贝叶斯推理的自蒸馏方法(BSD),无需依赖硬目标,提升了模型的准确性、校准性和鲁棒性。

Details Motivation: 传统监督训练依赖硬目标导致模型过度自信,现有自蒸馏方法仍依赖硬目标,效果受限。 Method: 通过贝叶斯推断利用模型自身预测构建样本特定的目标分布,实现不依赖硬目标的自蒸馏。 Result: 在多种架构和数据集上,BSD相比现有方法显著提升准确率(如ResNet-50在CIFAR-100上+1.4%)并降低校准误差(ECE减少40%),且对数据噪声和扰动更具鲁棒性。结合对比损失时,在标签噪声下的鲁棒性达到SOTA。 Conclusion: BSD是一种原理清晰、有效的自蒸馏方法,摆脱了对硬目标的依赖,全面提升了模型性能。 Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

[104] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He,Xiaoye Qu,Yafu Li,Tong Zhu,Siyuan Huang,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的生成式多模态推理范式DiffThinker,将多模态推理重新定义为图像到图像的生成任务,显著提升了在视觉为中心的复杂长视野任务中的表现。

Details Motivation: 现有的多模态大模型(MLLMs)主要依赖文本推理,在视觉主导的复杂任务中表现不佳,因此需要一种更符合视觉推理本质的新范式。 Method: DiffThinker采用基于扩散模型的图像生成框架,将推理过程视为图像到图像的生成任务,实现端到端的视觉推理,并具备高效性、可控性、原生并行性和协作性。 Result: 在四个领域(顺序规划、组合优化、约束满足和空间配置)的实验表明,DiffThinker显著优于GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)和微调后的Qwen3-VL-32B基线(+39.0%)。 Conclusion: 生成式多模态推理是一种有前景的视觉中心推理方法,DiffThinker展示了其在逻辑一致性和空间精度上的优势。 Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[105] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Yu-Tang Chang,Pin-Wei Chen,Shih-Fang Chen

Main category: cs.CV

TL;DR: 提出了一种名为Deep Global Clustering (DGC) 的无监督框架,用于高效处理高光谱成像数据的分割任务,可在消费级硬件上快速训练并保持低内存占用,但在多目标损失平衡方面存在优化不稳定性。

Details Motivation: 现有的基础模型在遥感预训练后难以迁移到特定领域(如近距农业监测),因为光谱特征、空间尺度和语义目标差异大,且高光谱数据量大导致计算和内存瓶颈。 Method: DGC通过局部重叠图像块进行全局聚类学习,利用重叠区域保证一致性,无需预训练,实现内存恒定且高效的训练过程。 Result: 在叶片病害数据集上实现了高质量的背景-组织分离(平均IoU为0.925),并展现出可导航的语义粒度以支持无监督病害检测,但存在特征空间中簇过度合并导致表示退化的问题。 Conclusion: DGC框架在概念上具有潜力,尤其适用于资源受限场景,但其实用性依赖于对多目标损失动态平衡的更原则性解决方案,当前工作为后续研究提供了思想支架。 Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.

[106] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou,Qifan Li,Xiaobin Hu,Hai Chen,Shuhang Gu

Main category: cs.CV

TL;DR: 提出一种简单而有效的内部引导(IG)策略,通过在训练过程中引入中间层的辅助监督,并在采样过程中外推中间和深层输出,显著提升扩散模型的训练效率和生成质量。

Details Motivation: 现有扩散模型在低概率区域生成质量差,标准分类器自由引导(CFG)易导致样本过度简化或失真,而基于“坏版本”引导的方法依赖复杂设计、额外训练和采样步骤。 Method: 在训练过程中对中间层引入辅助监督,在采样时外推中间层和深层的输出以生成结果,实现内部引导(IG)。 Result: 在ImageNet 256x256上,SiT-XL/2+IG在80和800 epoch分别达到FID=5.31和FID=1.75;LightningDiT-XL/1+IG达到FID=1.34,结合CFG后进一步提升至FID=1.19,成为当前最优。 Conclusion: IG策略简单高效,无需额外网络或复杂退化设计,显著提升生成质量和训练效率,是现有扩散模型引导方法的有效替代方案。 Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

[107] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

Pieter M. Blok,Haozhou Wang,Hyun Kwon Suh,Peicheng Wang,James Burridge,Wei Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为PointRAFT的高通量点云回归网络,用于从部分点云直接预测马铃薯块茎重量,避免了因自遮挡导致的重量低估问题。

Details Motivation: 由于RGB-D相机重建的点云因自遮挡而不完整,导致块茎重量被系统性低估,因此需要一种能直接从不完整点云准确估计重量的方法。 Method: 提出PointRAFT网络,直接从原始3D点云数据回归预测块茎重量,引入块茎高度嵌入作为额外几何线索,并在真实收获条件下进行训练与评估。 Result: 在包含5,254个点云的测试集上,PointRAFT达到12.0 g的平均绝对误差和17.2 g的均方根误差,显著优于线性回归和PointNet++基线模型,单次推理仅需6.3 ms。 Conclusion: PointRAFT能高效准确地估计马铃薯重量,满足商业收获机的高通量需求,并可推广至其他3D表型分析与机器人感知任务。 Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.

[108] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son,Suhyeok Kim,Seungryong Kim,Young Geun Kim

Main category: cs.CV

TL;DR: 提出了一种名为CorGi的训练-free框架,通过贡献度引导的块级间隔缓存来加速扩散Transformer(DiT)的推理过程,在保持生成质量的同时实现最高达2.0倍的平均加速。

Details Motivation: DiT在图像生成中表现优异,但其迭代去噪过程和较大模型容量导致推理成本高,且各步骤间存在大量冗余计算。 Method: 提出CorGi框架,通过评估Transformer块的贡献度,缓存低贡献块并在后续去噪步中重用;针对文本到图像任务进一步提出CorGi+,利用跨注意力图识别显著token并进行部分注意力更新。 Result: 在最先进的DiT模型上验证,CorGi和CorGi+平均可实现高达2.0倍的加速,同时保持高质量的生成效果。 Conclusion: CorGi及其增强版CorGi+能有效减少DiT推理中的冗余计算,显著提升推理速度,且不牺牲生成质量,适用于高容量扩散模型的高效部署。 Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[109] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Sina Jahromi,Farshid Hajati,Alireza Rezaee,Javaher Nourian

Main category: cs.CV

TL;DR: 本文提出了一种基于渐进式生成对抗网络和多目标优化的模型,用于解决医学图像分类中的数据不平衡问题,特别是在COVID-19检测中表现出高准确率。

Details Motivation: 医学图像分类中存在显著的数据不平衡问题,尤其是在疫情期间,影响了人工智能方法的性能,因此需要有效的方法来提升分类准确性。 Method: 提出一种渐进式生成对抗网络生成合成数据,并采用加权方式融合真实与合成数据,输入深度网络分类器,同时使用多目标元启发式群体优化算法优化分类器超参数。 Result: 在大型不平衡胸部X光图像数据集上,该模型在4类和2类分类任务中分别达到95.5%和98.5%的准确率,交叉验证指标优于现有方法。 Conclusion: 所提模型能有效应对医学图像中数据不平衡的挑战,在疫情背景下具有良好的分类性能和应用潜力。 Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

[110] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu,Zhewei Zhu,Xuyang Shi

Main category: cs.CV

TL;DR: 提出了一种轻量级可学习的注意力优化模块(ARM),通过自适应融合CLIP的层次特征,以“训练一次,随处使用”的方式提升开放词汇语义分割性能。

Details Motivation: 现有无训练方法依赖昂贵的外部模型或静态启发式方法,难以有效利用CLIP内部特征进行像素级分割,导致性能受限或计算成本高。 Method: 设计了ARM模块,包含语义引导的交叉注意力机制,用深层特征指导浅层特征的优化,并结合自注意力块;在COCO-Stuff等通用数据集上一次性训练后即可作为通用插件用于多种无训练框架。 Result: ARM在多个基准上显著且一致地提升了基线方法的性能,同时推理开销极低。 Conclusion: ARM实现了高效、通用的无训练开放词汇语义分割新范式,具有良好的实用性和扩展性。 Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[111] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Shuyun Wang,Haiyang Sun,Bing Wang,Hangjun Ye,Xin Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为Mirage的一阶段视频扩散模型,用于自动驾驶场景中的逼真且时序连贯的资产编辑,通过引入预训练2D编码器的时序无关潜在变量和两阶段数据对齐策略,解决了现有方法在空间保真度和对象对齐上的问题。

Details Motivation: 现有的视频对象编辑方法在视觉保真度和时序一致性上表现不足,难以满足以视觉为中心的自动驾驶系统对高质量、可扩展训练数据的需求。 Method: 基于文本到视频扩散先验构建一阶段视频扩散模型;使用预训练2D编码器的潜在变量注入3D解码器以恢复细节并保持因果结构;提出两阶段数据对齐策略(粗略3D对齐+精细2D优化)缓解高斯分布不匹配导致的姿态错位。 Result: 实验表明,Mirage在多种编辑场景下实现了高真实感和良好的时序一致性,并能泛化至其他视频到视频转换任务。 Conclusion: Mirage有效提升了驾驶场景中视频资产编辑的质量与时序一致性,为数据增强提供了可靠方案,并有望作为未来研究的基准模型。 Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

[112] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Rahul Medicharla,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了MotivNet,一种基于Meta-Sapiens骨干网络的通用面部表情识别模型,无需跨域训练即可在多种数据集上实现良好的泛化性能,推动了真实场景下的FER应用。

Details Motivation: 现有最先进的FER模型在多样化数据上泛化能力弱,限制了其在现实世界中的应用。尽管已有研究提出复杂架构,但仍需跨域训练,与实际应用场景相矛盾。 Method: 利用具有强大泛化能力的人类视觉基础模型Sapiens作为骨干网络,将FER作为其下游任务之一,提出MotivNet,并通过基准性能、模型相似性和数据相似性三个标准评估其可行性。 Result: MotivNet在多个数据集上达到与当前SOTA模型相当的性能,且无需跨域训练,表现出强跨域泛化能力,验证了其作为Sapiens下游任务的可行性。 Conclusion: MotivNet展示了基于基础模型进行面部表情识别的有效性,为真实场景下的FER研究和应用提供了新方向,并增强了FER在野外应用中的吸引力。 Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.

[113] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu,Yuanke Li,Xianlei Long,Kangping Ji,Chao Chen,Qingyi Gu,Zhenliang Ni

Main category: cs.CV

TL;DR: 本文提出了一种名为MambaSeg的新型双分支语义分割框架,利用并行Mamba编码器融合RGB图像和事件流数据,并通过双维交互模块(DDIM)在空间和时间维度上实现细粒度的跨模态融合,显著提升了分割性能且降低了计算成本。

Details Motivation: 现有RGB方法在快速运动、低光或高动态范围条件下性能下降,而事件相机虽具有高时间分辨率等优势但缺乏颜色和纹理信息;当前多模态融合方法常计算昂贵且忽视事件流的时间动态特性。 Method: 提出MambaSeg框架,采用并行Mamba编码器分别处理RGB和事件数据,并设计双维交互模块(DDIM),包括跨空间(CSIM)和跨时间(CTIM)交互模块,进行空间与时间上的细粒度融合。 Result: 在DDD17和DSEC数据集上实验表明,MambaSeg在语义分割任务中达到最先进水平,同时显著降低计算开销。 Conclusion: MambaSeg有效结合了RGB与事件数据的优势,通过空间-时间联合融合机制提高了跨模态对齐能力,为高效、可扩展且鲁棒的多模态感知提供了新思路。 Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[114] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Zhi Li,Yaqi Wang,Bingtao Ma,Yifan Zhang,Huiyu Zhou,Shuai Wang

Main category: cs.CV

TL;DR: 提出了一种基于物理引导流形投影(PGMP)的金属伪影去除框架,结合高保真仿真数据、确定性恢复网络和语义结构对齐,在牙科CBCT中实现高效、可靠的伪影去除。

Details Motivation: 现有深度学习方法在牙科CBCT金属伪影去除中存在回归到均值导致的模糊或无监督方法中的结构幻觉问题,且扩散模型推理慢,难以临床应用。 Method: 提出PGMP框架:1)通过AAPS管线生成高保真训练数据;2)设计DMP-Former网络,将恢复过程建模为确定性流形投影,单步前向推理;3)引入SSA模块利用医学基础模型提供先验,保证解剖合理性。 Result: 在合成和多中心临床数据上优于现有最先进方法,尤其在未见解剖结构上表现突出,推理速度快,无需迭代采样。 Conclusion: PGMP实现了快速、准确且临床可靠的金属伪影去除,弥合了合成与真实数据之间的差距,为实际应用提供了新基准。 Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP

[115] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang,Hao Wen,Aiming Hao,Bingze Song,Meiqi Wu,Jiahong Wu,Xiangxiang Chu,Sheng Lu,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出DualityForge框架和DualityVidQA数据集,通过可控的扩散模型视频编辑生成反事实视频与高质量问答对,结合DNA-Train训练方法有效减少多模态大模型在反事实视频中的幻觉问题,显著提升性能并具备良好泛化能力。

Details Motivation: 多模态大语言模型(MLLMs)在视频理解中易因语言先验依赖而产生视觉未接地的幻觉,尤其在违背常识的反事实视频上表现严重,且由于反事实数据标注成本高,该问题难以解决。 Method: 提出DualityForge框架,利用可控的扩散模型进行视频编辑,将真实视频转化为反事实场景,并结合结构化上下文信息自动生成原视频-编辑视频配对及对应的高质量问答对,构建大规模DualityVidQA数据集;进一步设计DNA-Train两阶段训练方法,在强化学习阶段采用成对ℓ₁优势归一化实现更稳定的策略优化。 Result: 在DualityVidQA-Test上实验表明,相比Qwen2.5-VL-7B基线,模型在反事实视频上的幻觉问题相对减少24.0%,同时在通用基准上也取得显著提升,显示出强泛化能力。 Conclusion: DualityForge与DNA-Train能有效缓解MLLM在视频理解中的语言先验导致的幻觉问题,为构建鲁棒的多模态模型提供了高效的数据合成与训练新范式。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[116] LiftProj: Space Lifting and Projection-Based Panorama Stitching

Yuan Jia,Ruimin Wu,Rui Song,Jiaojiao Li,Bin Song

Main category: cs.CV

TL;DR: 提出一种基于三维空间提升的全景图像拼接框架,通过将图像提升到三维空间进行多视角融合,并利用等距柱面投影生成几何一致的360°全景图,有效缓解了传统二维方法在复杂场景下的畸变和重影问题。

Details Motivation: 传统基于二维单应性变换的图像拼接方法在处理具有深度层次和视差的三维场景时易产生重影、弯曲和拉伸失真,尤其在多视图累积和360°闭合环路中问题突出,需更鲁棒的拼接范式。 Method: 将输入图像提升为统一坐标系下的密集三维点云表示,结合置信度进行全局跨视图融合;在三维空间构建统一投影中心,采用等距柱面投影映射至全景平面;最后在画布域内进行空洞填充以恢复纹理连续性。 Result: 实验表明该方法在显著视差和复杂遮挡场景下显著减少了几何失真和重影伪影,生成更自然、几何一致性更强的全景图像。 Conclusion: 该框架将图像拼接从二维变换范式转向三维一致性范式,具备灵活性和扩展性,可集成多种三维提升与补全模块,提升了复杂真实场景下全景拼接的质量与鲁棒性。 Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

[117] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Zhihua Wang,Fei Wu,Quanlin Li,Pinghong Zhou,Shuo Wang,Xian Yang

Main category: cs.CV

TL;DR: EndoRare是一个无需重新训练的生成框架,利用单个参考图像合成罕见胃肠道病变的高保真示例,通过语言引导的概念解耦提升AI模型性能和新手医师诊断能力。

Details Motivation: 罕见胃肠道病变在常规内窥镜检查中少见,导致可用于开发可靠人工智能(AI)模型和培训新手临床医生的数据有限。 Method: 提出EndoRare框架,采用语言引导的概念解耦方法,将特征性病变特征与非诊断性属性分离,前者编码为可学习的原型嵌入,后者进行变化以确保多样性,从而从单个参考图像生成多样且高保真的病变样本。 Result: 在四种罕见病理上验证了该框架,专家认为合成图像具有临床合理性;用于数据增强时显著提升了下游AI分类器的性能,在低假阳性率下提高了真阳性率;盲法读片研究显示,接触EndoRare生成病例的新手内窥镜医师召回率提高0.400,精确率提高0.267。 Conclusion: EndoRare为解决罕见疾病在计算机辅助诊断和临床教育中的数据稀缺问题提供了实用且高效的方法。 Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

[118] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq,Linda Larson-Prior,Fred Prior

Main category: cs.CV

TL;DR: 本文提出并验证了一种名为Virtual-Eyes的16位CT质量控制预处理流程,用于低剂量CT肺癌筛查,发现其可提升通用基础模型(如RAD-DINO)的性能和校准效果,但可能损害专用模型(如Sybil、ResNet-18)的表现,揭示了预处理对不同模型类型的差异化影响。

Details Motivation: 在低剂量CT肺癌筛查的深度学习流程中,鲁棒的预处理很少被量化评估。作者旨在开发一种临床驱动的标准化预处理方法,并系统分析其对通用基础模型与专用模型性能的影响差异。 Method: 提出Virtual-Eyes预处理流程,强制512x512平面分辨率,剔除短序列或非诊断性序列,通过Hounsfield单位滤波和双侧肺覆盖评分提取连续肺块,同时保留原始16位数据精度;在765例NLST患者数据上,使用冻结编码器提取RAD-DINO和Merlin的切片级嵌入,训练无泄漏的MLP分类头,并评估Sybil和ResNet-18在原始输入与Virtual-Eyes处理后的表现变化,不进行骨干网络重训练。 Result: Virtual-Eyes使RAD-DINO切片级AUC从0.576提升至0.610,患者级AUC从0.646(均值池化)提升至0.683,最大池化下从0.619提升至0.735,Brier评分从0.188改善至0.112;而Sybil的AUC从0.886降至0.837,ResNet-18从0.571降至0.596,Merlin表现始终接近随机水平(约0.507至0.567)。 Conclusion: 解剖结构导向的质量控制可稳定并提升通用基础模型在低剂量CT中的表现,但可能干扰依赖原始临床上下文的专用模型,表明预处理策略需根据模型类型谨慎选择。 Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

[119] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang,Zimo He,Wanhe Yu,Lexi Pang,Yunhao Li,Hongjie Li,Jieming Cui,Yuhan Li,Yizhou Wang,Yixin Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: UniAct是一个两阶段框架,通过整合微调的MLLM和因果流式管道,实现了人形机器人对多模态指令(如语言、音乐、轨迹)的实时响应(延迟低于500毫秒),并在统一离散码本的支持下提升动作执行的成功率与泛化能力。

Details Motivation: 现有方法难以将异构的多模态指令(如语言、音乐、轨迹)有效转化为稳定、实时的人形机器人全身动作,缺乏跨模态对齐与物理合理性的统一处理机制。 Method: 提出UniAct框架:第一阶段使用微调的多模态大语言模型(MLLM)解析多模态输入;第二阶段通过因果流式管道生成动作序列,并利用FSQ(有限支持量子化)在共享离散码本中实现跨模态对齐,同时将运动约束在物理合理的流形上。 Result: 在自建的20小时人形运动基准UniMoCap上验证,零样本跟踪不完美参考动作的成功率提升了19%,系统延迟低于500毫秒,表现出强健的跨场景泛化能力。 Conclusion: UniAct通过统一感知与控制,显著提升了人形机器人对多模态指令的理解与执行能力,推动了通用、响应式人形助手的发展。 Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

[120] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu,Zhiyuan Song,Hefeng Wu,Tao Pu,Keze Wang,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了CERES框架,通过因果干预解决第一人称视频中指代表达对象分割(Ego-RVOS)中的数据偏见与视觉混淆问题,结合语言和视觉模态的因果调整,显著提升了性能。

Details Motivation: 现有方法在Ego-RVOS任务中易受数据集中对象-动作对偏差和第一人称视角固有干扰(如快速运动、遮挡)影响,导致模型学习到虚假相关性,泛化能力差。 Method: 提出CERES框架,采用双模态因果干预:利用后门调整缓解语言表示中的数据偏见,以前门调整融合语义特征与几何深度信息,以应对视觉混淆,提升对第一人称畸变的鲁棒性。该框架可插拔,适配强预训练RVOS骨干网络。 Result: 在多个Ego-RVOS基准上实现最先进性能,实验证明所提因果方法有效缓解了偏见并增强了模型鲁棒性。 Conclusion: 引入因果推理机制能有效提升第一人称视频理解模型的可靠性,CERES为解决Ego-RVOS中的数据偏差与视觉干扰提供了新思路。 Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[121] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng,Tao Hu,Wenwen Tong,Xueheng Li,Jiandong Chen,Haojia Yu,Jiefan Lu,Hewei Guo,Hanming Deng,Chengjun Xie,Gao Huang,Dahua Lin,Lewei Lu

Main category: cs.CV

TL;DR: 本文提出了SenseNova-MARS,一种通过强化学习实现多模态智能体推理与搜索的框架,能够动态结合图像/文本搜索和图像裁剪工具,实现细粒度、知识密集型视觉理解任务,并提出BN-GSPO算法提升训练稳定性,在新构建的HR-MMSearch等基准上取得领先性能。

Details Motivation: 现有视觉语言模型在复杂场景中缺乏像人类一样将工具操作与连续推理无缝结合的能力,尤其在需要协调外部工具(如搜索、图像裁剪)的知识密集型高分辨率视觉任务中表现不足。 Method: 提出SenseNova-MARS框架,通过强化学习实现视觉推理与工具使用的交错执行;引入BN-GSPO算法优化训练稳定性;构建HR-MMSearch这一面向搜索的高分辨率多模态基准用于评估。 Result: SenseNova-MARS在MMSearch上得分为67.84,在HR-MMSearch上得分为41.64,超越Gemini-3-Flash和GPT-5等专有模型,成为开源模型中的SOTA方法。 Conclusion: SenseNova-MARS推动了具备工具使用能力的智能体式VLM发展,为实现更自然的人类级视觉推理提供了有效路径,并公开全部代码、模型与数据集以促进后续研究。 Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[122] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei,Zhipeng Luo,Ling Feng,Venice Erin Liong

Main category: cs.CV

TL;DR: LVLDrive 是一种结合 LiDAR 点云与视觉-语言模型(VLM)的新型框架,旨在提升自动驾驶中的 3D 度量空间理解能力,通过渐进式融合和空间感知问答数据集实现更安全可靠的驾驶决策。

Details Motivation: 现有基于图像的视觉-语言模型在复杂场景理解和几何推理方面存在不足,难以满足自动驾驶对精确度量空间推理的安全性要求。 Method: 提出 LVLDrive 框架,引入 LiDAR 作为额外输入模态,并设计渐进融合 Q-Former 以稳定地注入 LiDAR 特征,同时构建空间感知问答(SA-QA)数据集来训练模型的 3D 感知与推理能力。 Result: 在多个自动驾驶基准上,LVLDrive 显著优于仅使用视觉的模型,在场景理解、度量空间感知和驾驶决策方面表现更优。 Conclusion: 显式引入 3D 度量数据(如 LiDAR)对于构建可信的基于 VLM 的自动驾驶系统至关重要。 Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[123] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altrac,Matthew Toews

Main category: cs.CV

TL;DR: 提出了一种基于特殊相对论和量子力学的初等信息力学模型,用于理解卷积滤波与整流的机械特性,揭示了CNN中信息处理与能量-动量关系之间的联系。

Details Motivation: 受物理理论启发,旨在建立卷积神经网络中信息处理与经典物理中能量-动量关系的类比,以更好地理解卷积滤波的机械性质。 Method: 将卷积核分解为偶部和奇部,在频域(DCT)分析其对信息传播的影响,识别低频基(如DC和梯度分量)作为信息传播的基本模式。 Result: 发现偶核导致各向同性扩散(类似势能),奇核引起有向位移(类似动能),信息传播速度与奇核能量占比线性相关。 Conclusion: 首次建立了通用CNN中信息处理与相对论物理中能量-动量关系之间的理论联系,为理解CNN提供了新的物理视角。 Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

[124] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy

Main category: cs.CV

TL;DR: 本文提出了DermaVQA-DAS,一个支持闭合式问答和皮肤病损分割的扩展数据集,并引入了由专家设计的Dermatology Assessment Schema(DAS)框架,以促进面向患者的皮肤科视觉-语言建模研究。

Details Motivation: 现有皮肤病图像数据集多关注皮肤镜图像,缺乏患者自主查询和临床背景信息,限制了其在以患者为中心的医疗中的应用。因此需要构建更贴近临床实际、包含结构化临床特征标注的数据集。 Method: 提出Dermatology Assessment Schema(DAS),包含36个高层级和27个细粒度评估问题,用于系统化标注皮肤病特征;基于此构建DermaVQA-DAS数据集,支持闭合式问答与病灶分割任务,并对多种多模态模型进行基准测试。 Result: 在分割任务中,不同提示策略影响模型表现,最佳Jaccard指数为0.395,Dice分数为0.566(BiomedParse);在闭合式问答中,o3模型准确率最高(0.798),GPT-4.1次之(0.796),Gemini-1.5-Pro在Gemini系列中表现突出(0.783)。 Conclusion: DermaVQA-DAS和DAS框架为患者中心的皮肤病诊断提供了标准化、结构化的数据与评估体系,推动了多模态模型在临床场景中的应用,并已公开发布以支持后续研究。 Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

[125] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang,Lingdong Kong,Xiaolu Liu,Hao Shi,Wentong Li,Jianke Zhu,Steven C. H. Hoi

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态预训练的综合框架,旨在通过整合摄像头、LiDAR等传感器数据来实现自动驾驶系统中的空间智能。作者构建了一个统一的分类体系,并探讨了文本输入与占据表示在开放世界感知与规划中的作用,同时指出了计算效率与模型可扩展性等关键瓶颈,提出了通向通用多模态基础模型的发展路线图。

Details Motivation: 现有基础模型在单模态任务中表现优异,但在融合多模态传感器数据(如相机与LiDAR)以实现统一的空间理解方面仍面临挑战,限制了自主系统在真实环境中的感知与决策能力。 Method: 提出一个统一的多模态预训练框架分类体系,涵盖从单模态基线到学习整体表征的统一框架,并分析传感器特性、学习策略及平台特定数据集的作用,同时探索融合文本输入与占据表示的方法。 Result: 建立了多模态预训练范式的统一分类法,验证了其在3D目标检测和语义占据预测等高级任务中的有效性,并展示了融合文本与占据表示对开放世界感知与规划的促进作用。 Conclusion: 实现鲁棒的空间智能需要克服计算效率与模型可扩展性等瓶颈,未来应发展通用的多模态基础模型,以支持自动驾驶等自主系统的实际部署。 Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[126] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

Gur-Eyal Sela,Kumar Krishna Agrawal,Bharathan Balaji,Joseph Gonzalez,Ion Stoica

Main category: cs.CV

TL;DR: 本文提出了RedunCut,一种用于动态模型大小选择(DMSS)的新型系统,通过测量驱动的规划器和轻量级数据驱动性能模型,显著降低实时视频分析的计算成本。

Details Motivation: 现有DMSS方法在多样化工作负载下泛化能力差,采样效率低且准确率预测不准确,导致运行时成本过高。 Method: RedunCut采用测量驱动的规划器评估采样的成本-效益权衡,并利用轻量级数据驱动性能模型提升每段视频的准确率预测精度。 Result: 在道路车辆、无人机和监控视频等多种场景下,RedunCut在保持固定准确率的同时减少了14-62%的计算成本,并对历史数据有限和数据漂移具有鲁棒性。 Conclusion: RedunCut有效解决了DMSS中的低效采样和准确性预测问题,显著降低了大规模实时视频分析的推理成本。 Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

[127] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen,Haiyang Liu

Main category: cs.CV

TL;DR: 本文提出DyStream,一种基于流匹配的自回归模型,用于实现低延迟、高质量的双人对话头像视频生成,支持实时交互。

Details Motivation: 现有基于块的方法需要完整的非因果上下文窗口,导致高延迟,难以实现真实对话中所需的即时非语言反馈。 Method: 采用流匹配头的流式自回归框架,并设计具有前瞻模块的因果编码器,以引入短期未来上下文(如60毫秒),在保持低延迟的同时提升生成质量。 Result: DyStream每帧生成时间仅为34毫秒,系统总延迟低于100毫秒,在HDTF数据集上离线和在线唇同步置信度分别达到8.13和7.61,优于现有因果方法。 Conclusion: DyStream在保证极低延迟的同时实现了最先进的唇同步质量,适用于需要实时非语言交互的真实对话场景。 Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[128] AI-Driven Evaluation of Surgical Skill via Action Recognition

Yan Meng,Daniel A. Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 提出了一种基于AI的微血管吻合术操作评估框架,结合改进的TimeSformer和YOLO,实现手术动作识别与运动质量分析,准确率达87.7%以上。

Details Motivation: 传统外科技能评估依赖专家主观判断,存在耗时、不可靠及资源密集等问题,尤其在中低收入国家难以推广,亟需自动化、客观的评估方法。 Method: 采用改进的TimeSformer模型(引入分层时间注意力和加权空间注意力)进行动作识别,结合YOLO-based目标检测与跟踪提取精细运动特征,从五个维度量化微血管吻合技能。 Result: 在58段标注视频上验证,动作分割帧级准确率达87.7%(后处理后达93.62%),各项技能评估平均分类准确率为76%。 Conclusion: 该框架可提供客观、一致且可解释的反馈,有望推动外科教育向标准化、数据驱动的培训与评估模式发展。 Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

[129] Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas,Pranav K Nayak,Divya Mehul Rajparia,Deekshith Patel,Yashmitha Gogineni,Konda Reddy Mopuri,Sumohana S. Channappayya

Main category: cs.CV

TL;DR: 本文提出了一种新框架,利用离散小波变换(DWT)作为视觉任务中的输入依赖基元,来研究Vision Transformer(ViT)编码器表示空间中的组合性。实验结果表明,基于一级DWT分解的基元在潜在空间中近似具有组合性,揭示了ViT组织信息的新机制。

Details Motivation: 尽管对Transformer模型的理解多来自语言任务的分析,但在视觉领域,ViT如何构建和组织表示仍不清晰。本文旨在探究ViT编码器是否在其表示空间中体现出组合性这一关键特性。 Method: 引入一种类比于表示学习中组合性度量的框架,使用离散小波变换(DWT)提取图像中的输入依赖基元,并通过检验由这些基元组合出的表示能否重建原始图像表示,来量化ViT中的组合性程度。 Result: 实验证明,基于一级DWT分解得到的基元,在ViT的编码器表示空间中能够近似组合,即组合后的表示可有效还原原始图像的表示,显示出ViT潜在空间具备一定程度的组合性。 Conclusion: ViT的表示空间在特定条件下(如使用DWT基元)表现出近似的组合性,这为理解ViT如何结构化处理视觉信息提供了新的视角,暗示其可能以类似符号组合的方式整合局部特征。 Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

[130] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

Prasiddha Siwakoti,Atefeh Khoshkhahtinat,Piyush M. Mehta,Barbara J. Thompson,Michael S. F. Kirk,Daniel da Silva

Main category: cs.CV

TL;DR: 提出一种面向多光谱太阳图像的高保真压缩框架,结合图嵌入与注意力机制,在SDOML数据集上显著优于现有方法。

Details Motivation: 在带宽受限的太空任务中,需平衡多光谱太阳图像压缩的效率与光谱、空间细节的保留。 Method: 提出iSWGE模块建模波段间关系,用图节点与边特征表示;结合WSGA-C模块融合稀疏图注意力与卷积注意力以减少空间冗余并突出精细结构。 Result: 在SDOML数据集六个EUV波段上,相比强学习基线平均降低20.15% MSID,提升1.09% PSNR和1.62% log-MS-SSIM,重建更清晰且光谱保真。 Conclusion: 所提学习型压缩框架在相近比特率下实现了更优的多光谱太阳图像重建质量,适用于空间观测任务。 Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .

[131] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Devendra K. Jangid,Ripon K. Saha,Dilshan Godaliyadda,Jing Li,Seok-Jun Lee,Hamid R. Sheikh

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv2低级特征条件的图像超分辨率新方法(F2IDiff),以减少生成过程中的幻觉,特别适用于高保真度的手机摄影场景。

Details Motivation: 现有的文本到图像扩散模型在单图像超分辨率中容易产生过度幻觉,且文本特征难以准确描述小块细节,限制了其在高分辨率手机图像中的应用。 Method: 采用DINOv2提取的低级特征作为扩散模型的条件输入,构建特征到图像扩散(F2IDiff)基础模型,在更严格的条件下实现更精确的超分辨率生成。 Result: 所提方法能在保持低幻觉的同时提升低分辨率图像质量,尤其适合高分辨率、高保真的智能手机图像修复任务。 Conclusion: 通过使用更具描述性的低级特征进行条件控制,F2IDiff能有效平衡生成质量与真实性,推动生成式AI在消费级摄影中的落地应用。 Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

[132] Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula,Jonathan Stubblefield

Main category: cs.CV

TL;DR: 本研究提出了一种结合YOLOv5和YOLOv8进行胸部X光异常检测,并利用大语言模型(LLM)生成自然语言放射学报告的管道,实现了从图像检测到文本报告的自动化。

Details Motivation: 现有的医学影像AI系统通常只输出结构化预测,仍需放射科医生撰写报告,因此需要一种能自动生成高质量诊断叙述的方法以提高效率。 Method: 采用YOLOv5和YOLOv8模型进行异常检测,输出边界框和类别标签,再将这些结构化结果输入大语言模型(如GPT-4)生成描述性发现和临床摘要。比较两种YOLO模型在检测精度、推理延迟及生成文本质量上的表现。 Result: AI生成报告与真实报告之间具有较高的语义相似性;人工评估显示GPT-4在清晰度上得分高(4.88/5),但在自然写作流畅性方面较低(2.81/5)。 Conclusion: 该方法能够有效生成临床准确的放射学报告,但当前系统在写作风格上仍与医生撰写的文本存在差距,未来需进一步优化语言生成的自然性。 Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

[133] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli,Banafsheh Rekabdar

Main category: cs.CV

TL;DR: 提出了一种多尺度向量量化变分自编码器(MS-VQ-VAE),用于生成低分辨率视频的紧凑且高保真潜在表示,适用于边缘设备上的高效存储、传输和解码。

Details Motivation: 传统视频编解码器如H.264和HEVC主要面向像素域重建,缺乏对机器学习友好潜在表示的支持,难以融入深度学习流程,无法满足带宽和存储日益增长的需求。 Method: 基于VQ-VAE-2框架扩展出一种时空域的两层层次化潜在结构,采用3D残差卷积构建多尺度模型,并引入预训练VGG16网络提取的感知损失以提升重建质量;模型轻量化(约1850万参数),针对64x64分辨率视频片段优化。 Result: 在UCF101数据集上使用2秒视频片段训练,测试集达到25.96 dB PSNR和0.8375 SSIM,验证集相比单尺度基线提升1.41 dB PSNR和0.0248 SSIM。 Conclusion: 所提MS-VQ-VAE框架适合于带宽敏感场景下的可扩展视频压缩,包括实时流媒体、移动视频分析和CDN存储优化,兼具高效性与良好重建质量。 Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[134] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai,Kunpeng Li,Menglin Jia,Jialiang Wang,Junzhe Sun,Feng Liang,Weifeng Chen,Felix Juefei-Xu,Chu Wang,Ali Thabet,Xiaoliang Dai,Xuan Ju,Alan Yuille,Ji Hou

Main category: cs.CV

TL;DR: 本文提出了一种物理感知的文本到视频生成方法PhyGDPO,通过构建大规模物理增强视频数据集PhyVidGen-135K,并设计物理引导奖励机制和高效的LoRA-SR训练策略,在物理一致性方面显著优于现有开源方法。

Details Motivation: 现有的文本到视频生成方法在遵循物理规律方面表现不足,且缺乏包含丰富物理交互的大规模训练数据,限制了模型对真实世界动态的建模能力。 Method: 提出PhyAugPipe管道,利用具备思维链推理能力的视觉语言模型自动生成带物理标注的视频数据;构建PhyGDPO框架,基于群组Plackett-Luce模型进行偏好优化,并引入物理引导奖励(PGR)和LoRA-Switch Reference(LoRA-SR)实现高效训练。 Result: 在PhyGenBench和VideoPhy2两个物理感知视频生成评测基准上显著优于当前最先进的开源方法,生成视频在物理合理性与动态真实性方面均有提升。 Conclusion: 通过结合物理增强数据构造与物理感知的偏好优化框架,可有效提升文本到视频生成模型的物理一致性,为未来构建更符合现实规律的生成系统提供了可行路径。 Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[135] OCP-LS: An Efficient Algorithm for Visual Localization

Jindi Zhong,Hongxia Wang,Huanshui Zhang

Main category: cs.CV

TL;DR: 提出了一种新的二阶优化算法,结合OCP方法并近似Hessian矩阵对角元素,显著提升了深度学习中大规模优化问题的收敛速度、训练稳定性和抗噪能力。

Details Motivation: 为了解决深度学习中大规模优化问题,现有优化算法在收敛速度、稳定性和鲁棒性方面存在不足。 Method: 提出一种新的二阶优化算法,结合OCP方法,并通过适当近似Hessian矩阵的对角元素来降低计算复杂度。 Result: 在多个标准视觉定位基准上进行了广泛实验,结果表明所提方法在定位精度上具有竞争力,同时收敛更快、训练更稳定、对噪声干扰更具鲁棒性。 Conclusion: 该算法在处理大规模深度学习优化问题时表现出显著优势,是一种高效且稳健的优化框架。 Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

[136] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao,Jiawen Xi,Linhui Xiao,Junnan Li,Xue Yang,Maoxun Yuan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了RGBT-Ground,首个面向复杂真实场景的大规模视觉定位基准,包含对齐的RGB与热红外图像对及高质量标注,并提出了一种支持单模态与多模态输入的统一框架和RGBT-VGNet基线模型,实验证明其在夜间和远距离等挑战场景下具有更强的鲁棒性。

Details Motivation: 现有视觉定位基准多基于清洁环境下的数据集,场景多样性不足,难以反映真实复杂环境(如光照、天气变化)对模型鲁棒性和泛化能力的影响,限制了其在安全关键应用中的评估有效性。 Method: 构建了一个大规模、空间对齐的RGB-热红外图像对数据集RGBT-Ground,配备高质量指代表达、边界框及细粒度场景/环境/对象级标注;设计了统一的视觉定位框架,支持RGB、TIR单模态及RGB-TIR多模态输入;提出了RGBT-VGNet作为融合多模态互补信息的基线方法。 Result: 在RGBT-Ground上对现有方法进行了广泛适配实验,结果表明所提出的RGBT-VGNet显著优于这些方法,尤其在夜间和长距离场景下表现更优。 Conclusion: RGBT-Ground为复杂真实环境下的鲁棒视觉定位提供了新的基准,RGBT-VGNet验证了多模态融合的有效性,推动了视觉定位在真实世界应用中的研究发展。 Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[137] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong,Ke Li,Di Wang,Nan Luo,Yiming Zhang,Kaiyu Li,Jianfei Yang,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了一种针对变化检测视觉问答(CDVQA)中决策模糊问题的强化微调框架DARFT,通过挖掘决策模糊样本并进行组相对策略优化,提升了模型的判别能力和鲁棒性。

Details Motivation: 现有CDVQA模型在监督微调后仍存在决策模糊问题,即正确答案与强干扰项置信度相近,影响模型性能。 Method: 提出DARFT框架:首先利用SFT训练的参考策略挖掘决策模糊样本(DAS),然后在这些样本上应用基于多样本解码和组内相对优势的组相对策略优化方法。 Result: 实验表明,DARFT在全量和少样本设置下均显著优于SFT基线,尤其在减少决策模糊、增强决策边界方面效果明显。 Conclusion: 显式优化决策模糊样本有助于提升CDVQA模型的性能,DARFT为解决视觉语言任务中的决策不确定性提供了新思路。 Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[138] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang,Chaoqun Wang,Zixuan Guan,Sam Kao,Pengfei Zhao,Peng Wu,Sifeng He

Main category: cs.CV

TL;DR: 本文提出了SliceLens,一个基于LLM和VLM的假设驱动框架,用于在实例级视觉任务中发现细粒度且可解释的错误切片,并构建了首个面向此类任务的基准FeSD,实验证明其在精度和可操作性上均显著优于现有方法。

Details Motivation: 现有的错误切片发现方法主要集中于图像分类任务,难以应用于检测、分割等多实例视觉任务,且缺乏对复杂视觉关系的细粒度推理能力;同时现有基准存在人工标注偏差或不反映真实模型失败的问题。 Method: 提出SliceLens框架,利用大语言模型(LLM)和视觉语言模型(VLM)生成并验证多样化的失败假设,通过 grounded visual reasoning 实现细粒度、可解释的错误切片识别;同时构建新基准FeSD,包含专家标注、精炼的真实错误切片及其局部区域定位。 Result: 在FeSD基准上,SliceLens将Precision@10从0.31提升至0.73,显著优于现有方法;实验验证其能发现可解释的错误模式,并通过模型修复实验证明其结果具有实际改进价值。 Conclusion: SliceLens结合LLM/VLM实现了跨实例级视觉任务的高效、可解释细粒度错误切片发现,FeSD为该领域提供了更真实可靠的新评估标准,推动了鲁棒模型评估的发展。 Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[139] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong

Main category: cs.CV

TL;DR: 提出了一种无需训练的少样本框架CPJ,通过结构化图像字幕提升农业病害诊断的准确性和可解释性,在多个指标上显著优于基线方法。

Details Motivation: 现有农作物病害诊断方法依赖昂贵的监督微调,且在域迁移下表现不佳,缺乏可解释性。 Method: 提出Caption-Prompt-Judge(CPJ)框架,利用大视觉语言模型生成多角度图像字幕,通过LLM-as-Judge模块迭代优化字幕,并用于双答案VQA流程以支持识别与管理决策。 Result: 在CDDMBench上评估,使用GPT-5-mini生成字幕时,GPT-5-Nano相比无字幕基线提升了22.7个百分点的疾病分类准确率和19.5分的问答得分。 Conclusion: CPJ框架无需微调即可实现鲁棒、可解释的农业病害诊断,提供透明的证据推理过程,推动了精准农业的发展。 Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[140] 3D Semantic Segmentation for Post-Disaster Assessment

Nhut Le,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 本文提出了一种用于灾后环境的新型3D语义分割数据集,基于无人机拍摄的飓风伊恩灾区影像,采用SfM和MVS技术重建3D点云,并评估了当前最先进的分割模型,揭示了现有方法在灾害场景中的局限性。

Details Motivation: 现有的深度学习模型缺乏专门针对灾后环境设计的3D数据集,限制了灾后评估的准确性与效率。 Method: 利用无人机采集飓风伊恩灾区的航拍影像,结合运动恢复结构(SfM)和多视图立体匹配(MVS)技术构建3D点云数据集,并在该数据集上评估了FPT、PTv3和OA-CNNs等先进3D语义分割模型。 Result: 实验表明,现有SOTA模型在灾后复杂环境中表现不佳,存在显著的分割精度下降问题,暴露出对非结构化灾害场景的适应能力不足。 Conclusion: 研究强调了开发面向灾后场景的专用3D基准数据集和更鲁棒分割算法的迫切需求,以提升灾害响应中的场景理解能力。 Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

[141] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

Zheng Liu,Jinchao Zhu,Gao Huang

Main category: cs.CV

TL;DR: 提出协同低秩适应(CLoRA)方法,通过基空间共享和样本无关多样性增强(SADE)在保持参数效率的同时提升学习能力,在图像和点云任务中实现了性能与效率的更好平衡。

Details Motivation: 现有低秩适配方法在参数效率与微调性能之间难以兼顾,要么性能下降,要么引入过多可训练参数。 Method: 提出CLoRA,包含基空间共享和SADE:基空间共享使多个低秩模块共享投影空间以提升容量;SADE通过正则化减少表示冗余,增强多样性。 Result: 在图像和点云数据集上实验表明,CLoRA在参数效率和学习性能间取得更好平衡,并在点云分析中所需GFLOPs最少。 Conclusion: CLoRA有效提升了低秩微调方法的表示能力与多样性,同时保持高参数效率,为视觉Transformer的高效微调提供了新思路。 Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

[142] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang,Junfei Huang,Zongzhangbao Yin,Yingsong Hu,Anni Xu,Xinyi Luo,Xueqi Sun,Hai Wu,Sheng Ao,Zhaoxing Zhu,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出了面向户外监控场景的3D视觉定位新任务,并构建了首个大规模真实世界多模态数据集MoniRefer,同时提出了一种端到端的多模态方法Moni3DVG,在复杂交通环境中实现了更优的3D对象定位性能。

Details Motivation: 现有3D视觉定位研究主要集中于室内或自动驾驶场景,缺乏针对路侧基础设施监控场景的配对点云-文本数据,限制了交通基础设施对自然语言指令的理解能力。 Method: 构建了包含13.6万个物体和41.1万条自然语言描述的MoniRefer数据集,并提出Moni3DVG方法,融合图像的外观信息与点云的几何及光学信息进行多模态特征学习和3D目标定位。 Result: 在新提出的基准上进行了大量实验和消融研究,验证了所提方法在3D视觉定位任务中的优越性和有效性。 Conclusion: 该工作推动了路侧基础设施在复杂交通环境下的自然语言理解与目标定位能力,为城市智能交通系统提供了重要数据与技术基础。 Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[143] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

Shuyuan Lin,Yu Guo,Xiao Chen,Yanjie Liang,Guobao Xiao,Feiran Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为逐层分层注意力网络(Layer-by-Layer Hierarchical Attention Network)的新方法,用于提升计算机视觉中特征点匹配的精度,尤其在存在大量异常值的情况下表现优异。

Details Motivation: 特征点匹配中的大量异常值会显著降低匹配准确性和鲁棒性,尤其是在高比例异常值情况下如何有效提取高质量信息并减少负样本影响是一个挑战。 Method: 提出包含阶段融合、分层提取和注意力机制的网络结构;引入逐层通道融合模块以保留各阶段语义信息并实现整体融合,并设计分层注意力模块来自适应捕捉和融合全局感知与结构语义信息。 Result: 在YFCC100M和SUN3D两个公开数据集上的实验表明,该方法在异常值剔除和相机位姿估计任务上优于多种现有先进方法。 Conclusion: 所提方法通过增强特征表示能力,有效提升了特征点匹配的精度和鲁棒性,尤其在高异常值比例下具有优越性能。 Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.

[144] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

Qingyu Xu,Runtong Zhang,Zihuan Qiu,Fanman Meng

Main category: cs.CV

TL;DR: 本文提出了一种用于消防救援场景的目标检测新方法,构建了包含多种场景和关键目标类别的FireRescue数据集,并提出了改进的FRS-YOLO模型以提升复杂环境下的检测性能。

Details Motivation: 现有研究主要关注山地或森林环境,忽视更常见且结构复杂的 urban 救援场景,且检测类别有限,缺乏对指挥决策至关重要的多类目标(如消防车、消防员)的全面覆盖。 Method: 构建了一个名为FireRescue的新数据集,包含15,980张图像和32,000个边界框,涵盖城市、山地、森林和水域等多种救援场景中的八类关键目标;提出FRS-YOLO模型,引入即插即用的多维协同增强注意力模块和动态特征采样器,以提升易混淆类别和小目标的检测能力。 Result: 实验结果表明,消防救援场景中的目标检测具有高度挑战性,所提方法显著提升了YOLO系列模型在该场景下的检测性能,有效缓解了类别混淆和小目标漏检问题。 Conclusion: 本文通过构建更贴近实际指挥需求的数据集和设计针对性的检测模型,推动了消防救援场景中目标检测技术的发展,为实际应用提供了更可靠的技术支持。 Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named "FireRescue" for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

[145] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang,Hanting Li,Wei Li,Jie Hu,Xinghao Chen,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了RadAR,一种基于径向拓扑结构的并行化自回归视觉生成框架,通过环形分层生成和嵌套注意力机制提升生成效率与质量。

Details Motivation: 传统的自回归模型采用逐token的光栅扫描解码方式,推理效率低,且未充分利用视觉token间的局部依赖性和空间相关性。 Method: 提出径向拓扑结构,将图像token按距中心点的空间距离划分为多个同心环,逐环由内向外并行生成;引入嵌套注意力机制,在前向过程中动态修正不合理输出,减少错误累积。 Result: 实现了高效的并行生成,显著提升了推理速度,同时保持了良好的生成质量和模型稳定性。 Conclusion: RadAR在不牺牲表征能力的前提下,有效加速了自回归视觉生成,验证了结构设计与动态校正结合在提升效率与性能上的优势。 Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

Maolin Wang,Bowen Yu,Sheng Zhang,Linjie Mi,Wanyu Wang,Yiqi Wang,Pengyue Jia,Xuetao Wei,Zenglin Xu,Ruocheng Guo,Xiangyu Zhao

Main category: cs.CV

TL;DR: 提出RGTN,一种受重整化群启发的张量网络结构搜索框架,通过多尺度连续优化实现高效、自适应的张量分解。

Details Motivation: 现有张量网络结构搜索方法在计算可扩展性、结构适应性和优化鲁棒性方面存在不足,难以捕捉多尺度结构、受限于离散搜索空间且结构与参数优化分离。 Method: 引入物理启发的重整化群流思想,采用动态尺度变换实现跨分辨率的连续结构演化;设计可学习边门控机制和基于节点张力、边信息流的智能建议策略,在优化过程中动态调整拓扑结构。 Result: 在光场数据、高阶合成张量和视频补全任务上,RGTN实现了最先进的压缩比,并比现有方法快4-600倍。 Conclusion: RGTN通过多尺度、连续、物理引导的结构搜索范式,显著提升了张量网络结构搜索的效率与性能,为高效张量分解提供了新思路。 Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

[147] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye,Xiaotong You,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出EVOL-SAM3,一种零样本推理分割框架,通过推理时的进化搜索机制(生成-评估-演化循环)克服现有方法在语言幻觉、空间误判和训练依赖上的局限。

Details Motivation: 现有推理分割方法受限于监督微调的灾难性遗忘、强化学习的不稳定性,或训练自由方法的静态推理模式,缺乏深度推理与自我修正能力。 Method: 提出EVOL-SAM3,维护一组提示假设,通过‘生成-评估-演化’循环迭代优化;引入无参考视觉竞技场进行成对评估,语义变异算子纠正错误,并结合几何先验的异构竞技场模块提升鲁棒性。 Result: 在ReasonSeg基准上,EVOL-SAM3显著优于静态基线方法,并在零样本设置下超越全监督最先进方法。 Conclusion: 将推理分割重构为推理时进化搜索过程是有效且具潜力的方向,EVOL-SAM3为零样本复杂语义理解提供了新范式。 Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[148] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了一种阶段感知的多模型采样策略FlowBlending,通过在不同训练阶段使用大模型和小模型来优化推理速度与计算成本,在保持大模型生成质量的同时显著减少FLOPs并提升推理效率。

Details Motivation: 发现模型容量对不同时间步的影响存在差异,早期和晚期阶段对容量敏感,而中间阶段则不敏感,因此希望利用这一特性优化扩散模型的采样过程。 Method: 提出FlowBlending方法,结合大模型和小模型,在容量敏感阶段(早期和晚期)使用大模型,在中间阶段使用小模型;引入简单准则确定阶段边界,并通过速度散度分析识别容量敏感区域。 Result: 在LTX-Video和WAN 2.1等模型上,FlowBlending实现了最高1.65倍的推理加速和57.35%的FLOPs减少,同时保持了大模型的视觉保真度、时序连贯性和语义一致性,并可与现有加速技术结合实现额外2倍加速。 Conclusion: FlowBlending是一种高效且兼容性强的采样策略,能够根据扩散过程的不同阶段动态调整模型容量,在大幅降低计算开销的同时维持生成质量。 Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

[149] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li,Yiming Cui,Yicheng He,Yiwei Wang,Shu Zhang,Longyin Wen,Yulei Niu

Main category: cs.CV

TL;DR: 本文提出了EchoFoley任务和EchoVidia框架,用于实现基于视频的细粒度可控声音生成,通过新构建的EchoFoley-6k数据集和事件中心的生成策略,在可控性和音质上显著超越现有方法。

Details Motivation: 现有视频-文本到音频生成(VT2A)方法存在视觉主导、缺乏细粒度控制定义以及指令理解能力弱的问题,限制了声音生成的可控性与语义准确性。 Method: 提出EchoFoley任务,采用符号化表示描述声音事件的时序、类别与生成方式;构建EchoFoley-6k基准数据集;设计以 sounding event 为中心、结合快慢思维策略的EchoVidia生成框架。 Result: 实验显示,EchoVidia在可控性上超越现有VT2A模型40.7%,感知质量提升12.5%。 Conclusion: EchoFoley任务和EchoVidia框架有效解决了VT2A中的视觉主导、控制粒度粗和指令理解弱等问题,推动了视频关联声音生成的可控性与实用性发展。 Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[150] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

Xiang Liu,Yimin Zhou,Jinxiang Wang,Yujun Huang,Shuzhao Xie,Shiyu Qin,Mingyao Hong,Jiawei Li,Yaowei Wang,Zhi Wang,Shu-Tao Xia,Bin Chen

Main category: cs.CV

TL;DR: 本文提出了Splatwizard,一个专为3D高斯点阵压缩模型设计的统一基准测试工具包,支持自动化评估渲染质量、几何精度、帧率和资源消耗等关键指标。

Details Motivation: 现有的3DGS算法缺乏标准化和全面的评估工具,特别是在压缩任务方面,难以综合比较不同方法在渲染速度、率失真权衡、内存效率和几何准确性等方面的表现。 Method: 设计了一个名为Splatwizard的统一基准测试框架,集成了实现新3DGS压缩模型的功能,并整合了自动化计算图像质量指标、重建网格的Chamfer距离、渲染帧率及计算资源消耗的流水线。 Result: Splatwizard提供了易用的框架和自动化评估流程,支持现有最先进方法的集成,并能全面评估3DGS压缩模型的性能。 Conclusion: Splatwizard填补了3DGS压缩领域缺乏标准化评估工具的空白,有助于推动该领域的规范化发展和技术比较。 Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard

[151] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman,Srinath R,Jaswanth Reddy,Lokesh R Boregowda,Venkatesh Babu Radhakrishnan

Main category: cs.CV

TL;DR: 提出了一种统一的3D实例分割框架,通过可学习的高斯基元特征嵌入和“嵌入到标签”解码机制,结合边界硬挖掘策略,有效解决了多视角2D实例标签不一致问题,在多个数据集上优于现有方法。

Details Motivation: 解决现有3D实例分割方法中因多视角2D实例标签不一致导致的性能下降问题,以及两阶段方法效率低、依赖敏感超参数或预处理的局限性。 Method: 提出一种统一框架,将特征嵌入学习与标签生成整合:1)在高斯基元中引入可学习的特征嵌入;2)通过新颖的‘Embedding-to-Label’过程高效解码为实例标签;3)针对物体边界伪影问题,采用基于线性层变换后特征的三元组损失进行边界样本硬挖掘,提升训练稳定性与效果。 Result: 在ScanNet、Replica3D和Messy-Rooms数据集上实现了优于基线方法的定性和定量结果,验证了方法的有效性与鲁棒性。 Conclusion: 该方法通过统一优化特征嵌入与标签预测,并结合稳定的边界硬挖掘策略,显著提升了3D实例分割的性能与训练效率,为基于3DGS/NeRF的场景理解提供了新思路。 Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[152] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe,Yudai Hirose,Mashiho Mukaida,Satoshi Ono

Main category: cs.CV

TL;DR: 提出了一种基于投影的对抗攻击方法,利用物理环境中的优化策略验证了深度估计模型的脆弱性。

Details Motivation: 验证基于深度神经网络的单目深度估计模型在面对对抗攻击时的脆弱性,并提升其鲁棒性。 Method: 采用基于物理环境(PITL)优化的方法和分布式协方差矩阵自适应进化策略,通过投影扰动光到目标物体上生成对抗样本。 Result: 实验证明该方法成功生成了导致深度误估计的对抗样本,使目标场景中的部分物体消失。 Conclusion: DNN-based MDE模型易受投影式对抗攻击影响,需加强实际应用中的鲁棒性设计。 Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization -- evaluating candidate solutions in actual environments to account for device specifications and disturbances -- and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

[153] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

Andrew Tinits,Stephen Mann

Main category: cs.CV

TL;DR: 本文提出了一种改进的Noise2Noise方法,通过理论分析证明某些非线性函数在特定条件下可安全用于噪声目标图像而不会引入显著偏差,并成功应用于高动态范围(HDR)图像去噪,避免了对干净训练数据的需求。

Details Motivation: Noise2Noise虽无需干净图像作为训练标签,但其在使用非线性函数处理噪声目标时会引入偏差,限制了预处理手段的应用;尤其在HDR图像去噪中,因异常值问题导致训练困难。 Method: 提出一个理论框架来分析非线性函数对Noise2Noise训练的影响,识别出一类引入最小偏差的非线性函数,并结合特定的损失函数与色调映射函数以抑制异常值影响。 Result: 在蒙特卡洛渲染的HDR图像去噪任务中,应用该方法后模型性能接近使用高采样参考图像训练的原始版本,但仅需带噪声的数据进行训练。 Conclusion: 某些非线性操作可在Noise2Noise框架中安全使用,扩展了其适用范围,尤其是在高动态范围图像去噪中有效缓解了训练难题。 Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

[154] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage,Rico Sennnrich

Main category: cs.CV

TL;DR: 提出一种通过无导数优化进行遗憾最小化的新方法,以提升基于2D训练的跨模态系统在3D场景中的自适应能力。

Details Motivation: 解决跨模态系统在从2D视觉输入转向处理3D场景时面临的维度不匹配问题。 Method: 采用基于值的优化和无导数优化技术,通过最小化遗憾来改进多变量互信息估计,并控制场景内相机。 Result: 该方法使现成的跨模态系统能够在线适应对象遮挡并区分特征,提升了在多对象3D场景中的跨模态任务性能。 Conclusion: 无需预训练或微调,所提方法有效增强了跨模态系统对3D场景的适应性与鲁棒性。 Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[155] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

Md Ahmed Al Muzaddid,Jordan A. James,William J. Beksi

Main category: cs.CV

TL;DR: CropTrack是一种结合外观和运动信息的多目标跟踪框架,专为解决农业环境中频繁遮挡和相似外观带来的挑战而设计。

Details Motivation: 农业环境中的重复模式、物体外观相似、光照突变和频繁遮挡使得现有跟踪器难以保持目标身份,尤其是依赖运动信息的方法在强遮挡下表现不佳。 Method: 提出CropTrack,包含重排序增强的外观关联、基于外观的一对多关联与冲突解决策略,以及指数移动平均原型特征库,以提升外观关联能力。 Result: 在公开农业MOT数据集上验证,CropTrack在IDF1分数、关联准确率上显著优于现有方法,且身份切换次数更少。 Conclusion: CropTrack通过有效融合外观与运动信息,在农业多目标跟踪中实现了更稳定的身份保持,推动了该领域性能的提升。 Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

[156] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

Xunyi Zhao,Gengze Zhou,Qi Wu

Main category: cs.CV

TL;DR: 本文提出了一个名为VLN-MME的统一评估框架,用于探索多模态大语言模型(MLLMs)在视觉-语言导航(VLN)任务中作为零样本具身智能体的潜力,并揭示了其在3D空间推理和上下文感知方面的局限性。

Details Motivation: 研究MLLMs在需要多轮对话、空间推理和序列动作预测的具身智能体任务中的表现,尤其是在视觉-语言导航场景下的能力与不足。 Method: 构建了一个模块化、可扩展的评估框架VLN-MME,将传统导航数据集整合为标准化基准,通过引入思维链(CoT)和自反思机制增强基线智能体,进行零样本实验分析。 Result: 实验发现,尽管MLLMs能遵循指令并结构化输出,但加入CoT和自反思反而导致性能下降,表明其在3D空间推理和上下文感知方面存在缺陷。 Conclusion: VLN-MME为系统评估MLLMs在具身导航任务中的表现提供了基础,揭示了其在序列决策能力上的局限,为未来MLLMs作为具身智能体的后训练提供了重要指导。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[157] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

Meng Lan,Lefei Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 提出OFL-SAM2,一种无需手动提示、基于在线学习的SAM2框架,用于标签高效的医学图像分割,通过轻量级映射网络和自适应融合模块实现少样本下的高性能。

Details Motivation: 将SAM2应用于医学图像分割面临需要大量标注数据和高质量人工提示的问题,耗时且依赖专家参与,因此需要一种更高效、低标注成本的方法。 Method: 设计一个轻量级映射网络,利用有限标注样本将通用图像特征转换为目标特征,并引入在线少样本学习机制在推理时动态更新参数;同时设计自适应融合模块,将生成的目标特征与冻结的SAM2的记忆注意力特征动态融合。 Result: 在三个不同的医学图像分割数据集上实验表明,OFL-SAM2在少量训练数据下达到了最先进的性能。 Conclusion: OFL-SAM2有效解决了SAM2在医学图像分割中对人工提示和大量标注数据的依赖,实现了高效、强泛化的标签节约型分割框架。 Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model's generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

[158] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao

Main category: cs.CV

TL;DR: FinMMDocR是一个新的双语多模态基准,用于评估多模态大语言模型在真实金融场景中的数值推理能力,具有场景感知、文档理解与多步计算三大特点。

Details Motivation: 现有基准在真实金融场景下的多模态数值推理评估存在不足,缺乏对隐含金融场景、长文档理解和复杂多步推理的综合支持。 Method: 构建包含1,200个专家标注问题的双语多模态数据集,涵盖12类金融场景和9种共837份中英文长文档,平均问题需11步推理,且65%需跨页证据整合。 Result: 最佳多模态大语言模型准确率仅为58.0%,不同检索增强生成方法表现差异显著,显示任务挑战性高。 Conclusion: FinMMDocR能有效推动多模态大模型和推理增强方法在复杂现实场景中的发展。 Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[159] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Bartłomiej Olber,Jakub Winter,Paweł Wawrzyński,Andrii Gamalii,Daniel Górniak,Marcin Łojek,Robert Nowak,Krystian Radlak

Main category: cs.CV

TL;DR: 提出一种基于神经元激活模式的激光雷达域适应方法,通过选择少量代表性样本实现跨域3D目标检测的高性能。

Details Motivation: 解决3D物体检测器在不同地域间泛化能力差的问题,如在美国训练的模型在亚洲或欧洲表现不佳。 Method: 基于神经元激活模式选择目标域中具有代表性且多样化的少量样本进行标注,并结合受持续学习启发的后训练技术防止权重漂移。 Result: 该方法在极低标注预算下显著优于线性探测和现有域适应技术,实现了最先进的性能。 Conclusion: 通过合理选择少量目标域样本并结合防止权重漂移的技术,可高效实现跨域3D检测的域适应。 Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

[160] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Rongji Xun,Junjie Yuan,Zhongjie Wang

Main category: cs.CV

TL;DR: 提出HaineiFRDM,一种基于扩散模型的开源电影修复框架,通过全局-局部频率模块和分块训练策略实现高分辨率影片修复,并构建真实退化数据集,显著优于现有开源方法。

Details Motivation: 现有开源电影修复方法因使用低质量合成数据和噪声光流,性能落后于商业方案,且未探索高分辨率修复问题。 Method: 提出HaineiFRDM框架,采用分块训练与测试策略以适应单张24GB显存GPU;设计位置感知的全局提示与帧融合模块;引入全局-局部频率模块增强纹理一致性;利用低分辨率结果作为全局残差缓解分块伪影;并构建包含真实退化和合成数据的电影修复数据集。 Result: 实验表明,该方法在缺陷修复能力上显著优于现有开源方法,尤其在高分辨率影片修复中表现突出。 Conclusion: HaineiFRDM有效提升了开源电影修复的性能,推动了扩散模型在真实影视修复场景中的应用,未来将开源代码与数据集。 Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

[161] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Xinran Gong,Gorkem Durak,Halil Ertugrul Aktas,Vedat Cicek,Jinkui Hao,Ulas Bagci,Nilay S. Shah,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ProDM的生成扩散模型,用于从非门控胸部CT中恢复无运动伪影的冠状动脉钙化病变,从而提高CAC评分的准确性与临床可用性。

Details Motivation: 非门控胸部CT常因心脏和呼吸运动导致严重伪影,影响冠状动脉钙化(CAC)定量精度,而ECG门控CT虽能减少伪影但应用受限,因此亟需一种可在常规CT上实现可靠CAC量化的解决方案。 Method: 提出ProDM框架,包含三个核心组件:(1) CAC运动模拟数据引擎,通过门控CT生成带多种运动轨迹的非门控图像以支持无配对数据的监督训练;(2) 引入钙特异性先验的可微钙一致性损失,保持钙化病灶完整性;(3) 渐进式校正机制,在扩散过程中逐步减少伪影,提升稳定性和钙化保真度。 Result: 在真实患者数据上的实验表明,ProDM显著提升了CAC评分准确性、病灶空间保真度和风险分层性能,并在读片研究中验证了其抑制运动伪影和增强临床可用性的能力。 Conclusion: ProDM为从常规非门控胸部CT中实现可靠的CAC定量提供了有前景的框架,有望推动心血管疾病风险评估在广泛筛查中的应用。 Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

[162] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li,Yukai Gu,Yingqian Min,Zikang Liu,Yifan Du,Kun Zhou,Min Yang,Wayne Xin Zhao,Minghui Qiu

Main category: cs.CV

TL;DR: 本文提出了一个面向生成视频推理(GVR)的过程感知评估范式,引入了包含16个任务的VIPER基准和衡量中间步骤与结果一致性的POC@r指标,发现当前最先进模型存在严重的结果正确但过程错误的问题。

Details Motivation: 现有视频生成模型评估多依赖单帧判断,易导致模型通过错误推理过程得出正确结果(outcome-hacking),缺乏对推理过程有效性的评估。 Method: 提出VIPER基准测试集,涵盖时间、结构、符号、空间、物理和规划等推理任务;设计基于VLM-as-Judge与分层评分标准的Process-outcome Consistency (POC@r)指标,评估推理过程与最终结果的一致性。 Result: 实验显示当前最先进的视频生成模型在POC@1.0上仅约20%,表明其普遍存在严重的结果-过程不一致问题;同时揭示了测试时扩展和采样鲁棒性之间的差距。 Conclusion: 当前视频生成模型在实现真正通用视觉推理方面仍有显著不足,需重视过程有效性评估,VIPER和POC@r为未来研究提供了更严谨的评测框架。 Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[163] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu,Kevin Qinghong Lin,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了ShowUI-π,首个基于流的生成模型,用于实现GUI智能体中的灵巧操作,支持离散点击与连续拖拽的统一建模,并构建了包含20K拖拽轨迹的数据集和ScreenDrag评测基准。实验表明现有商用智能体在该任务上表现较差,而ShowUI-π以450M参数达到最佳性能(26.98),显著优于现有方法。

Details Motivation: 现有的GUI智能体依赖于离散的点击预测,无法支持需要实时感知与调整的连续交互(如拖动进度条),缺乏对复杂、灵活的人机交互行为的支持,限制了其在真实数字环境中的自动化能力。 Method: 提出ShowUI-π,采用基于流的动作生成模型,通过轻量级动作专家从连续视觉输入中预测光标增量调整;设计统一的离散-连续动作空间,集成点击与拖拽操作;构建ScreenDrag数据集(20K拖拽轨迹)和评测基准,涵盖五个领域(如PowerPoint、Premiere Pro),支持在线与离线评估。 Result: 实验显示现有商用GUI智能体在ScreenDrag上表现不佳(Operator得分为13.27,Gemini-2.5-CUA最高为22.18);ShowUI-π仅用450M参数即达到26.98的得分,展现出更强的拖拽能力和任务适应性。 Conclusion: ShowUI-π首次实现了对GUI环境中灵巧操作的建模,推动了GUI智能体向人类水平的精细控制迈进,同时发布的ScreenDrag数据集与基准为后续研究提供了重要资源。 Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

[164] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza,Baldoino Fonseca dos Santos Neto,Marcio de Medeiros Ribeiro

Main category: cs.CV

TL;DR: 本文对量化、剪枝和权重聚类等模型压缩技术在ResNet-50、VGG-19和MobileNetV2上的单独及组合应用进行了综合评估,使用CIFAR-10-C和CIFAR-100-C数据集分析了鲁棒性、准确性和压缩比之间的权衡。结果表明,某些压缩策略不仅能保持甚至提升模型在常见图像污染下的鲁棒性,尤其对结构更复杂的网络更为明显。通过多目标评估,找到了最优配置,揭示了定制化组合策略的优势。

Details Motivation: 模型压缩虽有助于在资源受限设备上部署深度学习模型,但可能影响其在自然干扰下的鲁棒性,因此需要系统评估不同压缩方法对鲁棒性的影响。 Method: 对ResNet-50、VGG-19和MobileNetV2应用量化、剪枝和权重聚类技术(单独与组合),在CIFAR-10-C和CIFAR-100-C数据集上评估其准确性、压缩比和鲁棒性,并采用多目标优化方法分析性能权衡。 Result: 某些压缩策略可在提高压缩比的同时保持甚至提升模型鲁棒性,尤其是对复杂架构网络;定制化的组合压缩方法能实现更优的多目标性能。 Conclusion: 合理的压缩策略选择与组合可同时实现高效率与强鲁棒性,为在真实噪声环境中部署高效视觉模型提供了指导。 Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

[165] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park,Hyunwoo Ha,Wonjun Jo,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了DarkEQA,一个用于评估视觉语言模型在低光条件下视觉感知能力的开源基准,强调了现有模型在夜间或黑暗环境中的性能不足。

Details Motivation: 现有的视觉语言模型基准主要在理想光照条件下评估模型性能,忽略了实际应用中常见的低光等视觉退化情况,限制了模型在全天候场景中的鲁棒性。 Method: 构建了一个名为DarkEQA的基准,通过在线性RAW空间中物理真实地模拟光照下降和传感器噪声,并结合ISP渲染流程,控制低光退化程度,评估基于第一人称视角观察的视觉问答任务表现。 Result: 实验评估了多种先进视觉语言模型和低光图像增强模型,结果表明当前VLMs在低光条件下性能显著下降,揭示了其感知瓶颈。 Conclusion: DarkEQA为评估视觉语言模型在低光环境下的鲁棒性提供了有效工具,突出了改进模型在真实复杂光照条件下感知能力的必要性。 Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[166] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Zhenyu Cui,Jiahuan Zhou,Yuxin Peng

Main category: cs.CV

TL;DR: 本文提出了一种无需重新索引历史图库图像的终身行人重识别新任务(RFL-ReID),并设计了双向连续兼容表示(Bi-C2R)框架,在避免灾难性遗忘的同时实现新旧模型特征的兼容,取得了优异性能。

Details Motivation: 现有终身行人重识别方法依赖于对历史图库重新索引,存在隐私问题和高计算成本,且导致新旧特征不兼容,影响检索性能。 Method: 提出Bi-C2R框架,通过双向知识迁移和特征更新机制,在不重新提取历史特征的前提下,持续更新模型并保持新旧特征空间的一致性,实现免重新索引的终身学习。 Result: 在多个基准数据集上验证了Bi-C2R的有效性,不仅在新提出的RFL-ReID任务上表现领先,也在传统L-ReID任务中达到先进水平。 Conclusion: Bi-C2R成功解决了RFL-ReID中的特征兼容性和灾难性遗忘问题,为实际应用中高效、隐私保护的持续学习行人重识别提供了可行方案。 Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

[167] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu,Jiahe Li,Fabio Tosi,Matteo Poggi,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: FoundationSLAM 是一种基于学习的单目稠密SLAM系统,通过结合基础深度模型的几何引导,提升了轨迹估计和稠密重建的精度与鲁棒性,在多个数据集上实现了实时且高质量的性能。

Details Motivation: 现有基于光流的稠密SLAM方法缺乏几何一致性,导致跟踪和建图不够准确和鲁棒,限制了其在复杂场景中的应用。 Method: 提出Hybrid Flow Network生成具有几何感知的匹配点;设计Bi-Consistent Bundle Adjustment Layer在多视角约束下联合优化关键帧位姿与深度;引入Reliability-Aware Refinement机制动态调整光流更新,形成匹配与优化间的闭环反馈。 Result: 在多个挑战性数据集上实现了优于现有方法的轨迹精度和稠密重建质量,运行速度达18 FPS,具备良好的泛化能力和实际应用潜力。 Conclusion: FoundationSLAM通过融合基础模型的几何先验与可学习的光流匹配,在保持实时性的同时显著提升了单目稠密SLAM的准确性与鲁棒性。 Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

[168] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Xu He,Haoxian Zhang,Hejia Chen,Changyuan Zheng,Liyang Chen,Songlin Tang,Jiehui Huang,Xiaoqiang Liu,Pengfei Wan,Zhiyong Wu

Main category: cs.CV

TL;DR: 本文提出一种自举式框架,将音频驱动的视觉配音从病态的修复任务转化为良定义的视频编辑问题,利用扩散Transformer生成理想训练数据并进行端到端编辑,实现高精度唇部同步与身份保持。

Details Motivation: 现有方法因缺乏理想的成对训练数据(仅唇部运动不同而其他条件一致的视频对)而依赖掩码修复范式,导致视觉伪影、身份漂移和同步效果差。 Method: 采用Diffusion Transformer作为数据生成器,为真实视频生成对应的唇部动作修改版本,构建对齐的视频对;再基于这些数据训练一个DiT-based音频驱动编辑器,利用完整帧对进行端到端学习,并引入时间步自适应多阶段训练策略以解耦编辑目标。 Result: 该方法在唇部同步精度、身份保持和真实场景鲁棒性方面显著优于现有方法,尤其在复杂野外场景中表现优异。 Conclusion: 通过重构问题设定和构建理想训练数据,本文实现了高质量的音频驱动视觉配音,为未来研究提供了新范式与评估基准ContextDubBench。 Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

[169] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Dian Shao,Mingfei Shi,Like Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为FineTec的统一框架,用于在时间损坏的情况下进行细粒度动作识别。该方法通过上下文感知补全、空间分解和物理驱动估计来恢复和增强骨骼序列,显著提升了在严重数据缺失情况下的识别性能。

Details Motivation: 现有方法难以准确恢复时间动态和细粒度空间结构,导致相似动作间的细微运动线索丢失,尤其在在线姿态估计产生大量缺失数据的真实场景中表现不佳。 Method: FineTec首先使用多样时间掩码进行上下文感知补全以恢复基础骨骼序列;然后通过基于语义区域和运动方差的空间分解生成两个增强序列;最后利用拉格朗日动力学估计关节加速度,并将位置与加速度序列融合输入GCN进行动作识别。 Result: 在NTU-60、NTU-120、Gym99和Gym288等多个基准上实验表明,FineTec在不同级别的时序损坏下均优于先前方法,在Gym99-severe和Gym288-severe设置下分别达到89.1%和78.1%的top-1准确率。 Conclusion: FineTec通过结合上下文补全、语义空间分解与物理驱动建模,有效提升了在时序数据损坏情况下的细粒度动作识别鲁棒性和泛化能力。 Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

[170] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Jiageng Liu,Weijie Lyu,Xueting Li,Yejie Guo,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Edit3r是一种前馈框架,能够从无姿态、视角不一致且经过指令编辑的图像中单次重建和编辑3D场景,无需每场景优化或姿态估计,具有快速、高保真和三维一致性强的优点。

Details Motivation: 现有3D场景编辑方法通常需要每场景优化和精确的姿态估计,限制了编辑速度与实用性,且缺乏多视角一致的编辑图像用于监督训练。 Method: 提出Edit3r,采用前馈网络直接预测符合指令的3D编辑;通过SAM2-based recoloring策略生成跨视角一致的监督信号,并采用非对称输入策略融合参考视图与辅助视图以提升对齐能力。 Result: 在新提出的DL3DV-Edit-Bench基准上表现优异,相比现有方法在语义对齐和3D一致性方面更优,推理速度显著更快,能处理如InstructPix2Pix等2D编辑输入。 Conclusion: Edit3r实现了快速、无需优化的单步3D场景编辑,在真实感渲染与编辑一致性之间取得了良好平衡,具备实时3D编辑应用潜力。 Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

[171] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang,Hao-Jen Chien,Chin-Yang Lin,Ying-Huan Chen,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为GaMO的几何感知多视角外绘框架,用于稀疏视角下的3D场景重建。与生成新视角不同,GaMO从现有视角扩展视野,保持几何一致性并提升覆盖范围,在无需训练的情况下实现零样本应用,显著优于现有方法且速度提升25倍。

Details Motivation: 现有基于扩散模型的方法在稀疏视图重建中存在视野覆盖不足、几何不一致和计算成本高的问题,本文旨在通过多视角外绘策略解决这些局限。 Method: 提出GaMO框架,采用多视角条件控制和几何感知去噪策略,对已有视角进行视野扩展(而非生成新视角),在零样本设置下运行,无需额外训练。 Result: 在Replica和ScanNet++数据集上,使用3、6、9个输入视图均达到最先进的重建质量,PSNR和LPIPS指标优于先前方法,处理时间低于10分钟,比现有扩散方法快25倍。 Conclusion: GaMO通过几何感知的多视角外绘有效解决了稀疏视图重建中的关键挑战,在质量、覆盖范围和效率方面均取得显著提升,为高效高质量3D重建提供了新思路。 Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

[172] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Zhening Huang,Hyeonho Jeong,Xuelin Chen,Yulia Gryaditskaya,Tuanfeng Y. Wang,Joan Lasenby,Chun-Hao Huang

Main category: cs.CV

TL;DR: 本文提出了SpaceTimePilot,一种能够解耦空间与时间的视频扩散模型,可在生成过程中独立控制摄像机视角和运动序列,实现对动态场景在时空上的自由重渲染。

Details Motivation: 现有方法难以在生成视频中同时精确控制视角变化和动作时序,且缺乏配对的多时间序列视频数据集用于训练。为此,本文旨在实现高质量、可分离的时空控制生成。 Method: 提出一种新的动画时间嵌入机制,并设计基于时间扭曲的训练策略以利用多视角数据模拟时间差异;引入改进的相机条件机制和全新的CamxTime合成数据集,实现更精准的双控制。模型联合使用时间扭曲策略与CamxTime数据集进行训练。 Result: 在真实与合成数据上均表现出优异的时空解耦能力,能灵活控制视角与动作序列,生成质量优于先前方法。 Conclusion: SpaceTimePilot通过新颖的时间建模与数据策略,首次实现了在扩散模型中高效且精确的时空解耦控制,推动了可控视频生成的发展。 Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot