Skip to content

Table of Contents

cs.CL [Back]

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi,Richard M. K. van Dijk,Gijs Wijnholds,Tessa Verhoef

Main category: cs.CL

TL;DR: 本研究提出了一种结合OCR、生成式AI和数据库链接的自动化流程,用于将Leiden大学历史文档中的教授传记数据数字化,并实现了较高的数据提取与匹配精度。

Details Motivation: 为了高效整合历史文献图像中的非结构化信息与现有高质量数据库,解决传统手工转录效率低、易出错的问题,研究旨在构建一个可自动处理历史文档的数字化管道。 Method: 采用OCR技术对1983-1985年出版的Leidse hoogleraren en lectoren书籍进行文本识别,利用生成式AI在解码时施加结构化约束以从OCR结果中提取JSON格式数据,并通过记录链接算法将提取结果与现有数据库进行匹配。 Result: OCR的字符错误率(CER)为1.08%,词错误率(WER)为5.06%;从OCR文本中提取JSON的平均准确率为63%(基于标注文本为65%);记录链接在标注JSON上准确率达94%,在OCR生成JSON上达81%。 Conclusion: 该自动化管道能有效处理历史文档的版式多样性和术语差异问题,生成式AI可在一定程度上弥补OCR性能不足,为数字人文领域的历史资料数字化提供了可行的技术框架。 Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano

Main category: cs.CL

TL;DR: 本文提出了CAT框架,用于评估和可视化大语言模型在可控输入变化下的准确性和响应一致性之间的相互作用,核心是通过一致性-准确性关系(CAR)曲线和一致性导向鲁棒性估计(CORE)指数来量化准确性和一致性之间的权衡。

Details Motivation: 现有的评估方法主要关注模型的准确性或基准得分,而最近一致性被认为是部署大语言模型于高风险实际应用中的重要属性。然而,准确性和一致性之间的相互依赖性尚未被充分考虑,因此需要一种更细致的评估方法。 Method: 提出了一种名为CAT的框架,利用多选题基准作为案例研究,引入了CAR曲线和MCA指标来展示模型准确性如何随一致性要求增加而变化,并提出了CORE指数来综合衡量准确性和一致性之间的权衡。 Result: 在多种通用和特定领域的大型语言模型上进行了实际演示,展示了不同模型在多个多项选择基准上的表现,并说明了CAT框架如何扩展到支持长篇、开放式评估。 Conclusion: CAT框架提供了一种新的方式来评估大语言模型的准确性和一致性之间的关系,有助于更好地理解模型性能,并为高风险应用场景下的模型部署提供了重要的参考。 Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang,Jinze Yu,Xing Zhang,Dayuan Jiang,Yin Song,Tomal Deb,Xuefeng Liu,Peiyang He

Main category: cs.CL

TL;DR: 本文提出了一种评估和提升大语言模型生成结构化输出一致性的新框架,包括语义树编辑距离(STED)和一致性评分体系,并通过实验验证其有效性,为实际应用提供了理论基础与工具支持。

Details Motivation: 大语言模型在生成结构化数据时存在输出不一致的问题,影响了其在生产环境中的可靠性,因此需要一个能够量化并改善这种一致性的框架。 Method: 提出了STED(语义树编辑距离)作为新的相似性度量指标,并结合多次生成结果的聚合分析构建一致性评分框架;在受控的合成数据集上进行系统实验,评估不同模型的表现。 Result: STED在语义等价样本间达到0.86-0.90的相似性得分,在结构差异样本中得分为0,优于现有指标;六种LLM的评测显示Claude-3.7-Sonnet一致性最佳,而其他模型如Claude-3-Haiku表现下降明显。 Conclusion: 该框架能有效评估和提升LLM生成结构化输出的一致性,支持模型选择、提示词优化和故障诊断,增强了LLM在生产系统中的可靠性。 Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Jahidul Islam,Md Ataullha,Saiful Azad

Main category: cs.CL

TL;DR: 本文提出了BanglaCodeAct,一种基于多智能体提示和迭代自修正的框架,用于从孟加拉语生成Python代码,无需任务特定微调,在mHumanEval数据集上取得了显著性能。

Details Motivation: 现有大模型在低资源语言(如孟加拉语)的代码生成方面表现不足,缺乏有效方法支持非英语用户的编程需求。 Method: 采用开源多语言大模型,构建Thought-Code-Observation循环的智能体框架,通过多智能体协作与自我修正实现动态代码生成与优化。 Result: Qwen3-8B结合BanglaCodeAct在开发集上pass@1准确率达94.0%,盲测集上达71.6%,优于其他小参数模型。 Conclusion: 该工作为孟加拉语到Python的代码生成建立了新基准,证明了基于智能体的推理在低资源语言代码生成中的有效性与潜力。 Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

[5] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

Tingwei Xie,Tianyi Zhou,Yonghong Song

Main category: cs.CL

TL;DR: PharmaShip是一个用于测试预训练文本布局模型在噪声OCR和多样化模板下性能的中文医药物流文档数据集,支持序列实体识别、关系抽取和阅读顺序预测任务,并提出序列感知约束作为可迁移的结构建模偏差。

Details Motivation: 现有文档理解模型在处理真实世界中嘈杂且模板多样的医药运输文档时表现受限,缺乏统一、可控的基准来评估不同架构在安全关键场景下的性能。 Method: 构建了一个包含三种任务(SER、RE、ROP)的真实中文医药物流文档数据集PharmaShip,采用实体为中心的评估协议,标准化了预处理、数据划分和优化流程,并对五种代表性模型(如LiLT、LayoutLMv3等)进行基准测试,引入阅读顺序正则化和长距离位置覆盖改进模型。 Result: 实验表明像素信息与显式几何特征具有互补性,但单独使用均不足够;引入阅读顺序正则化能持续提升SER和EL性能并增强鲁棒性,延长位置覆盖可改善末页预测稳定性;ROP在词级别准确但在段落级别仍具挑战,反映出边界模糊和长距离交叉问题。 Conclusion: PharmaShip为药物领域安全关键型文档理解提供了可复现的基准,验证了序列感知约束是一种可迁移的有效归纳偏置,有助于提升复杂文档结构建模能力。 Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.

[6] Noise-Driven Persona Formation in Reflexive Neural Language Generation

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 本文提出了Luca-Noise Reflex Protocol (LN-RP),用于分析大语言模型中噪声驱动的人格涌现现象,通过注入随机噪声观察到生成行为的非线性转变,并识别出三种稳定的人格模式。

Details Motivation: 研究大语言模型在噪声影响下如何产生和维持人格特征,探索生成过程中的动态变化机制。 Method: 在生成初始状态中注入随机噪声种子,进行152轮生成循环,分析语言行为的变化及熵特征。 Result: 发现了三种具有不同熵特征的稳定人格模式,外部噪声可引发反射生成动态的相变,且各模式间存在显著差异(p < 0.01)。 Conclusion: LN-RP为研究大语言模型中的反射生成、涌现行为和长距离语言连贯性提供了可重复的实验方法。 Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

[7] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

Main category: cs.CL

TL;DR: 本文提出了HarmTransform,一种多智能体辩论框架,用于将有害查询转化为更隐蔽的形式,以提升大语言模型的安全对齐能力。

Details Motivation: 现有的安全机制主要针对明显有害内容,忽视了通过隐晦改写保留恶意意图的潜在威胁,导致安全训练数据存在显著漏洞。 Method: 提出HarmTransform框架,利用多个智能体之间的迭代批评与优化,系统性生成高质量、隐蔽且保持原有害意图的查询变体。 Result: 实验表明,HarmTransform在生成有效隐蔽查询方面显著优于基线方法;但分析也发现多智能体辩论可能引发话题偏移和过度复杂化等问题。 Conclusion: 多智能体辩论在增强安全训练数据覆盖性方面具有潜力,但也存在局限性,需在提升隐蔽性与保持语义一致性之间权衡。 Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

[8] Emergent World Beliefs: Exploring Transformers in Stochastic Games

Adam Kamel,Tanish Rastogi,Michael Ma,Kailash Ranganathan,Kevin Zhu

Main category: cs.CL

TL;DR: 该论文研究了基于Transformer的大型语言模型(LLM)在不完全信息博弈(如德州扑克)中是否能学习环境的隐含状态表示,发现LLM在无监督情况下能自发学习手牌等级和胜率等结构,并通过非线性探针揭示其内部表征与理论信念状态相关。

Details Motivation: 探索LLM在部分可观测环境(POMDP)中是否能形成类似世界模型的内部表征,扩展此前在完全信息游戏中的发现。 Method: 在Poker Hand History(PHH)数据上预训练GPT风格模型,并使用线性和非线性探针分析其内部激活状态,以检测对手牌等级、胜率等特征的学习情况。 Result: 模型无需显式监督即可学习手牌的确定性结构(如牌型大小)和随机性特征(如胜率);非线性探针能有效解码这些表征,且与理论上的信念状态显著相关。 Conclusion: LLM能够在不完全信息环境中构建有意义的内部表示,表明其具备在复杂、不确定环境下进行推理和建模的潜力。 Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.

[9] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Anwar Alajmi,Gabriele Pergola

Main category: cs.CL

TL;DR: 本文提出了一种两阶段框架,用于检测在线隐性性别歧视内容,结合针对性训练和基于推理的推理机制,在多个基准上取得领先性能。

Details Motivation: 传统方法难以识别隐性、语境依赖的性别歧视内容,且存在标签稀缺、类别不平衡和标注噪声等问题,导致模型表现不稳定。 Method: 采用类平衡焦点损失、类感知批处理和后验阈值校准进行训练;在推理时通过动态路由将高置信度样本直接分类,不确定样本交由多角色协作专家判断(CEJ)模块进行推理整合。 Result: 在EXIST 2025 Task 1.1上F1提升+2.72%,EDOS Task A和B分别提升+4.48%和+1.30%。 Conclusion: 该框架有效应对了数据稀疏、噪声和概念模糊性问题,提升了对隐性性别歧视内容的检测能力。 Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.

[10] Break Out the Silverware -- Semantic Understanding of Stored Household Items

Michaela Levi-Richter,Reuth Mirsky,Oren Glickman

Main category: cs.CL

TL;DR: 本文提出了“存储家庭物品挑战”(Stored Household Item Challenge),旨在评估服务机器人在家庭场景中推断不可见物品存储位置的认知能力,并发布了两个数据集和一种结合视觉与大语言模型的混合方法NOAM,该方法在预测准确率上接近人类水平。

Details Motivation: 服务机器人缺乏常识推理能力,难以根据日常指令找到隐藏物品的存储位置,因此需要一个基准任务来评估和提升机器人的认知能力。 Method: 提出NOAM(Non-visible Object Allocation Model),将视觉输入转化为自然语言描述,结合场景结构理解和大语言模型(如GPT-4)推理出最可能的隐藏存储位置。 Result: NOAM在真实世界测试集上显著优于随机选择、纯视觉模型和多模态模型,预测准确率接近人类表现。 Conclusion: 结合视觉与语言模型的混合架构能有效提升机器人对家庭物品存储位置的推理能力,为构建更具认知能力的服务机器人提供了可行路径。 Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[11] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su,Meicong Zhang,Guoxiu He

Main category: cs.CL

TL;DR: 提出了一种无需训练的推测性解码增强方法EASD,通过引入基于熵的动态惩罚机制,在保持解码效率的同时提升了大语言模型的推理性能。

Details Motivation: 标准推测性解码中草稿模型与目标模型过度对齐,限制了加速效果和目标模型性能的超越,因此需要一种能动态识别低置信度预测并加以修正的方法。 Method: 在标准推测性解码基础上,引入动态熵惩罚机制:在每一步解码中,利用采样分布的熵衡量模型不确定性;当两个模型均呈现高熵且前N个预测重叠较大时,拒绝当前令牌并由目标模型重新采样。 Result: 在多个推理基准上,EASD consistently 优于现有推测性解码方法,并在多数情况下超越目标模型自身性能,同时保持与标准推测性解码相当的效率。 Conclusion: EASD是一种有效的训练-free推测性解码增强方法,通过熵感知机制防止低置信度错误传播,不仅提升了解码效率,还可能突破目标模型本身的性能限制。 Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

[12] MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team,:,Dong Zhang,Gang Wang,Jinlong Xue,Kai Fang,Liang Zhao,Rui Ma,Shuhuai Ren,Shuo Liu,Tao Guo,Weiji Zhuang,Xin Zhang,Xingchen Song,Yihan Yan,Yongzhe He,Cici,Bowen Shen,Chengxuan Zhu,Chong Ma,Chun Chen,Heyu Chen,Jiawei Li,Lei Li,Menghang Zhu,Peidian Li,Qiying Wang,Sirui Deng,Weimin Xiong,Wenshan Huang,Wenyu Yang,Yilin Jiang,Yixin Yang,Yuanyuan Tian,Yue Ma,Yue Yu,Zihan Zhang,Zihao Yue,Bangjun Xiao,Bingquan Xia,Bofei Gao,Bowen Ye,Can Cai,Chang Liu,Chenhong He,Chunan Li,Dawei Zhu,Duo Zhang,Fengyuan Shi,Guoan Wang,Hailin Zhang,Hanglong Lv,Hanyu Li,Hao Tian,Heng Qu,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianguang Zuo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Linghao Zhang,Meng Chen,Nuo Chen,Peng Zhang,Qianli Chen,Qiantong Wang,Rang Li,Shaohui Liu,Shengfan Wang,Shicheng Li,Shihua Yu,Shijie Cao,Shimao Chen,Shuhao Gu,Weikun Wang,Wenhan Ma,Xiangwei Deng,Xing Yong,Xing Zhang,Xu Wang,Yifan Song,Yihao Zhao,Yingbo Zhao,Yizhao Gao,Yu Cheng,Yu Tu,Yudong Wang,Zhaojun Huang,Zhengju Tang,Zhenru Lin,Zhichao Song,Zhipeng Xu,Zhixian Zheng,Zihan Jiang

Main category: cs.CL

TL;DR: MiMo-Audio通过大规模预训练实现了在多种音频任务上的少样本学习能力,在语音智能和音频理解等基准上达到开源模型的SOTA水平,并展现出语音续写、风格迁移等泛化能力。

Details Motivation: 现有音频语言模型依赖任务特定微调,而人类能通过少量示例或简单指令泛化到新任务。受GPT-3启发,作者探索大规模预训练是否能在音频领域实现类似强泛化能力。 Method: 将MiMo-Audio的预训练数据扩展至超过一亿小时,系统评估其少样本学习能力;在后训练阶段构建多样化的指令微调语料,并引入思维机制以增强音频理解与生成能力。 Result: MiMo-Audio-7B-Base在多个语音与音频理解基准上达到开源SOTA,能泛化至语音转换、风格迁移和语音编辑等未见任务,并具备生成高度逼真的对话、朗诵、直播等内容的能力;MiMo-Audio-7B-Instruct在多项音频理解、对话和指令TTS评测中接近或超越闭源模型。 Conclusion: 大规模下一句预测预训练可有效提升音频语言模型的通用性和泛化能力,验证了scaling law在音频领域的适用性,推动通用于多任务的音频基础模型发展。 Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

[13] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Amal Alqahtani,Efsun Kayi,Mona Diab

Main category: cs.CL

TL;DR: 本文提出了StressRoBERTa,一种用于自动检测英文推文中自我报告的慢性压力的跨条件迁移学习方法,通过在临床相关疾病上进行持续训练,提升了检测性能。

Details Motivation: 由于慢性压力普遍存在且常与其他心理疾病共病,利用社交媒体文本自动识别压力有助于公共健康监测和干预。 Method: 采用RoBERTa模型,在与压力高度共病的心理疾病(抑郁、焦虑、PTSD)相关的Stress-SMHD语料库上进行持续训练,并在SMM4H 2022 Task 8数据集上微调,实现对慢性压力的检测。 Result: StressRoBERTa在SMM4H 2022任务中达到82%的F1分数,超过最佳参赛系统3个百分点;在Dreaddit数据集上获得81% F1,验证了模型的跨情境迁移能力。 Conclusion: 针对与压力相关的临床心理疾病进行聚焦式跨条件迁移学习,能比通用语言模型或广泛心理健康模型提供更强的特征表示,有效提升慢性压力检测性能。 Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

[14] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Himel Ghosh

Main category: cs.CL

TL;DR: 本文对两种基于Transformer的偏见检测模型进行了可解释性比较研究,使用SHAP方法分析其在正确与错误预测中的词级归因。结果表明,尽管两模型关注相似的评价性语言,但在信号整合方式上存在显著差异,其中专用偏见检测模型更易产生误报,而领域自适应模型表现更优,错误率更低。

Details Motivation: 当前新闻文本中的自动化偏见检测缺乏对模型决策机制的理解,尤其在模型为何失败方面知之甚少,因此需要通过可解释性方法深入分析不同模型的行为差异。 Method: 采用基于SHAP的解释方法,对在BABE数据集上微调的偏见检测模型和领域自适应的RoBERTa模型进行词级归因分析,比较其在正确与错误预测中的注意力模式。 Result: 两个模型虽关注类似类型的评价性语言,但信号整合方式不同;偏见检测模型对误报赋予更高的内部证据强度,导致过度标记中性内容;领域自适应模型归因模式更合理,误报减少63%;错误主要源于话语层面的歧义而非显式偏见线索。 Conclusion: 可解释性评估对偏见检测系统至关重要,模型架构与训练策略显著影响其可靠性与在新闻环境中的适用性,未来应重视解释一致性以提升实际部署效果。 Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

[15] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Dingmin Wang,Ji Ma,Shankar Kumar

Main category: cs.CL

TL;DR: 本文研究了在检索增强型问答中使用大语言模型时,长上下文引入无关信息导致性能下降的问题,提出一种自适应分块提示策略,在减少token使用的同时保持性能,并发现模型在信息不足时倾向于生成错误答案而非拒绝回答。

Details Motivation: 长上下文虽有助于纳入相关知识,但会引入更多无关信息,影响模型生成质量,需寻找更有效的信息利用方式。 Method: 设计一种自适应提示策略,将检索到的信息切分为较小块并依次提示大语言模型作答,通过调整块大小平衡相关信息的保留与无关信息的排除。 Result: 在三个开放域问答数据集上的实验表明,该策略在使用更少token的情况下达到了与标准提示相当的性能;分析发现模型常在信息不足时生成错误答案而非拒绝回答。 Conclusion: 自适应分块提示可有效提升长上下文下的推理效率与准确性,同时揭示了提升大语言模型拒答能力的重要性。 Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.

[16] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Main category: cs.CL

TL;DR: 本文提出一种基于中间注意力层token分布生成对抗样本的新方法,利用模型内部的生成机制产生语义合理且一致的扰动,实验表明该方法在论证质量评估任务中可降低大语言模型的评估性能,但部分层和位置的替换可能导致语法退化,揭示了该方法的潜力与局限。

Details Motivation: 探索大语言模型中间层注意力机制中蕴含的token级假设,并利用这些内部表示生成更自然、更有效的对抗样本,以检验和强化LLM评估系统的鲁棒性。 Method: 从LLaMA-3.1-Instruct-8B的中间注意力层提取token分布,将其作为对抗扰动直接替换原始输入中的词元,而不依赖传统提示或梯度攻击方式,在ArgQuality数据集上进行实验验证。 Result: 基于注意力的对抗样本显著降低了模型在论证质量评估任务上的表现,同时保持输入的语义相似性;但特定层和位置的替换引入了语法错误,影响实际效果。 Conclusion: 中间层表示有潜力作为构建对抗样本的原则性来源,可用于压力测试LLM评估流程,但需进一步解决语法连贯性问题以提升实用性。 Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

[17] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Yukun Zhang,Stefan Elbl Droguett,Samyak Jain

Main category: cs.CL

TL;DR: 本研究提出一种多检索器的RAG系统,结合领域知识与上下文信息,提升金融数值问答任务的准确性,验证了领域特定训练和最新大模型在少样本场景下的优越性能。

Details Motivation: 由于缺乏金融领域的专业知识,现有大语言模型在处理金融数值推理问题时存在较多错误,难以满足复杂多步计算与专业理解的需求。 Method: 采用多检索器的检索增强生成(RAG)系统,结合外部领域知识与内部问题上下文,并使用SecBERT编码器进行领域特定训练,同时利用最新的大语言模型构建基于提示的生成器。 Result: 领域特定训练显著提升了模型性能,超越FinQA原有最优模型;最佳提示式LLM生成器取得当前最优结果,性能提升超过7%,但仍低于人类专家水平。实验表明较大模型能更好利用外部知识,抵消幻觉损失。 Conclusion: 领域特定训练和最新大语言模型能有效提升金融数值推理能力,外部知识增益在大模型中更显著,未来需进一步缩小模型与人类专家间的差距。 Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

[18] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers,Manit Patel,Seiyon M. Lee,Anthony F. Botelho

Main category: cs.CL

TL;DR: 本文提出了一种分析优先的框架,用于分离开放性回答中学生内容信号与教师评分倾向,利用ASSISTments数学数据建模教师历史作为动态先验,并结合句子嵌入和残差化方法提升内容表示的可解释性,结果显示结合教师先验与内容嵌入能显著提高预测性能(AUC 0.815),而仅依赖内容模型仍高于随机水平(AUC 0.626),该框架有助于揭示评分实践与学生思维证据之间的一致性或冲突。

Details Motivation: 自动化评分常混淆学生实际表达的内容与教师评分习惯,导致对学习表现的误判,因此需要一种方法将内容信号与评分者偏差分离,以实现更透明、可审计的学习评估。 Method: 采用分析优先框架,使用去识别的ASSISTments数学开放回答数据,将教师评分历史建模为动态先验,提取句子嵌入作为文本表示,并通过中心化和残差化消除题目提示和教师评分偏好的干扰;使用时间验证的线性模型量化各信号贡献,并通过投影面模型可视化评分分歧以供质性分析。 Result: 教师先验对成绩预测有显著影响;结合先验与内容嵌入效果最佳(AUC 0.815),仅内容模型较弱但高于随机(AUC 0.626);调整评分者效应后的内容表示更具信息量,保留更多有意义的嵌入维度,揭示出语义证据支持理解而非表面应答差异的情况。 Conclusion: 该研究提供了一个实用的分析流程,将嵌入特征转化为可用于教学反思的学习分析工具,使教师和研究者能够审视评分实践是否与学生推理和学习证据一致,促进更公平、透明的评估。 Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[19] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou,Chunkang Zhang,Guoxin Yu,Fandong Meng,Jie Zhou,Wai Lam,Mo Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于超图的动态记忆机制HGMem,用于增强多步检索增强生成(RAG)中的复杂推理与全局理解能力。

Details Motivation: 现有RAG系统的记忆模块多为静态存储,缺乏对基本事实间高阶关联的建模,限制了其在多步推理和知识演化中的表现。 Method: 设计了一种超图结构的记忆机制HGMem,其中超边表示记忆单元,支持逐步构建高阶交互,形成整合的知识结构以指导后续推理步骤。 Result: 在多个需要全局理解的挑战性数据集上进行了实验,结果表明HGMem显著优于强基线系统,在多步RAG任务中持续提升性能。 Conclusion: HGMem通过将记忆从被动存储转变为动态表达结构,有效增强了LLM在复杂推理和长程依赖任务中的全局感知与推理能力。 Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[20] Efficient Context Scaling with LongCat ZigZag Attention

Chen Zhang,Yang Bai,Jiahuan Li,Anchun Gui,Keheng Wang,Feifan Liu,Guanyu Wu,Yuwei Jiang,Defei Bu,Li Wei,Haihang Jing,Hongyin Tang,Xin Chen,Xiangzhou Huang,Fengcun Li,Rongxiang Weng,Yulei Qian,Yifan Lu,Yerui Sun,Jingang Wang,Yuchen Xie,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出了LongCat ZigZag Attention (LoZA),一种稀疏注意力机制,可将全注意力模型高效转换为稀疏版本,显著提升长上下文场景下的推理速度,并应用于LongCat-Flash-Exp模型以支持百万级token处理。

Details Motivation: 为了在有限计算预算下提升长上下文场景中模型的推理效率,解决全注意力机制计算开销大的问题。 Method: 提出了一种名为LongCat ZigZag Attention (LoZA) 的稀疏注意力机制,在训练中途将其应用于LongCat-Flash模型,实现对prefill-intensive和decode-intensive任务的加速。 Result: LoZA在长上下文场景下实现了显著的速度提升,支持最多100万token的快速处理,增强了长期推理和长视野智能体能力。 Conclusion: LoZA是一种高效的稀疏注意力方案,能够有效转化现有全注意力模型,在保持性能的同时大幅提升长序列处理效率。 Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

[21] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Zhiming Lin,Kai Zhao,Sophie Zhang,Peilai Yu,Canran Xiao

Main category: cs.CL

TL;DR: CEC-Zero是一种无需监督的强化学习框架,使大语言模型能自我纠正中文拼写错误,在9个基准上显著超越有监督方法和强LLM微调,实现了鲁棒且可扩展的无标签中文纠错新范式。

Details Motivation: 现有大模型和有监督方法在中文拼写纠错中对新型错误鲁棒性不足,且依赖昂贵标注,缺乏无需标注的高效解决方案。 Method: 提出CEC-Zero框架:通过从干净文本生成带错文本构造训练数据,利用语义相似性和候选一致性计算聚类共识奖励,并采用PPO算法优化策略,实现零监督下的自我纠错。 Result: 在9个基准上,CEC-Zero比有监督基线高10-13 F$_1$分,比强LLM微调高5-8分,具备无偏奖励和收敛的理论保证。 Conclusion: CEC-Zero建立了无需标注的中文拼写纠错新范式,提升了模型在噪声文本中的鲁棒性与可扩展性,释放了LLM在真实文本处理中的潜力。 Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

[22] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang,Shujian Zhang,John Lambert,Wenxuan Zhou,Zhangyang Wang,Mingqing Chen,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 提出了一种名为RISE的无监督框架,通过稀疏自编码器在激活空间中发现可解释的推理行为向量,实现了对大语言模型推理过程的解析与可控干预。

Details Motivation: 现有方法依赖人类定义的概念来分析大语言模型的推理过程,难以全面捕捉复杂的推理行为,且受限于词级别监督,缺乏对内部机制的深入理解。 Method: 将思维链分割为句子级别的步骤,在步骤级激活上训练稀疏自编码器(SAE),从中提取解耦的特征向量,即‘推理向量’,用于表征不同的推理行为,并通过可视化、聚类和干预实验验证其可解释性与可控性。 Result: 成功识别出如反思、回溯和置信度调节等可解释的推理行为,这些行为在解码器空间中具有可分性;可通过干预特定向量调控模型推理路径;同时发现与响应长度相关的结构特性,并挖掘出超出人类监督范围的新行为。 Conclusion: RISE框架展示了无监督表示学习在揭示和控制大语言模型内部推理机制方面的潜力,为模型可解释性和可控性提供了新途径。 Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

[23] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri,Subasish Das,Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: 本研究提出了WISE框架,通过比较八种轻量级Transformer模型和两种基线模型,在20,000个样本的数据集上进行虚假新闻与讽刺新闻的分类任务,结果表明MiniLM和RoBERTa-base表现最佳,且轻量级模型在资源受限场景中具有实用价值。

Details Motivation: 由于虚假新闻与讽刺内容在语言特征上相似但意图不同,准确区分二者具有挑战性,现有方法在效率与准确性之间难以平衡。 Method: 提出WISE框架,采用Fakeddit数据集中的20,000个平衡样本,使用分层5折交叉验证评估多个轻量级Transformer模型与基线模型,评估指标包括准确率、精确率、召回率、F1分数、ROC-AUC、PR-AUC、MCC、Brier分数和校准误差。 Result: MiniLM达到最高的准确率(87.58%),RoBERTa-base在ROC-AUC上表现最好(95.42%)且准确率为87.36%,DistilBERT在效率与性能之间取得了良好平衡(准确率86.28%,ROC-AUC 93.90%),统计检验显示模型间差异显著。 Conclusion: 轻量级模型在虚假与讽刺新闻分类任务中可媲美甚至超越大型模型,适用于实际部署于资源受限环境的信息可信度检测系统。 Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[24] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Sijia Chen,Di Niu

Main category: cs.CL

TL;DR: 提出iCLP框架,通过隐式规划在潜在空间中生成紧凑的推理指令,提升大模型在数学推理和代码生成任务中的准确性、效率与跨域泛化能力。

Details Motivation: 由于大语言模型易产生幻觉且任务问题多样,显式文本规划难以准确生成,因此需要一种更鲁棒的规划方式。 Method: 从已有推理轨迹中提炼显式计划,使用向量量化自编码器学习其离散的潜在表示,并通过微调使大模型学会基于潜在计划进行推理。 Result: 在数学推理和代码生成任务上显著提升了准确性和推理效率,并展现出强跨域泛化能力,同时保持了思维链推理的可解释性。 Conclusion: iCLP实现了大模型在潜在空间中的隐式规划,结合语言空间的显式推理,有效增强了复杂任务下的性能与泛化能力。 Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

[25] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla,Manoj Saravanan,Shrikar Reddy Kota

Main category: cs.CL

TL;DR: 本文提出了一个名为Composite Reliability Score (CRS) 的统一框架,用于综合评估大型语言模型在校准性、鲁棒性和不确定性量化方面的可靠性,并通过多个模型和数据集验证其有效性。

Details Motivation: 大型语言模型在关键决策领域应用广泛,但其可靠性(如过度自信错误、输入变化下的性能下降、缺乏不确定性估计)尚不明确,现有评估方法碎片化,无法全面衡量模型可靠性。 Method: 提出CRS框架,整合校准性、鲁棒性和不确定性量化三个维度,形成单一可解释指标;在十个开源LLM和五个问答数据集上进行基准测试、扰动实验和校准方法比较。 Result: CRS能够稳定地对模型进行排序,揭示单一指标无法发现的隐藏失败模式,并显示最可靠的系统在准确性、鲁棒性和校准不确定性之间取得平衡。 Conclusion: CRS为评估大型语言模型的可靠性提供了统一且可解释的度量标准,有助于在实际应用中识别更可信的模型。 Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

[26] HY-MT1.5 Technical Report

Mao Zheng,Zheng Li,Tao Chen,Mingyang Song,Di Wang

Main category: cs.CL

TL;DR: 本文介绍了新型机器翻译模型HY-MT1.5-1.8B和HY-MT1.5-7B,采用多阶段训练框架,在中英及小语种翻译任务中显著优于现有开源和商业模型,具备术语控制、上下文感知和格式保持等高级功能。

Details Motivation: 为了提升机器翻译模型的性能与参数效率,尤其是在中文-外文和英文-外文任务中超越现有大模型,并支持专业翻译需求。 Method: 提出一个包含通用与MT导向预训练、监督微调、策略内蒸馏和强化学习的多阶段训练框架,开发出HY-MT1.5系列模型。 Result: HY-MT1.5-1.8B在参数量仅为1.8B的情况下,性能超过Qwen3-32B等更大模型,达到Gemini-3.0-Pro的90%;HY-MT1.5-7B在Flores-200上达到其95%,并在WMT25和少数民族语言测试集上超越之。 Conclusion: HY-MT1.5系列模型在其参数规模下实现了最先进的翻译性能,兼具高效性与多功能性,适用于通用和专业化翻译场景。 Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

[27] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: 本文旨在通过集中关键信息,帮助研究人员从零开始在AWS SageMaker上成功训练首个Hugging Face模型,从而促进云计算的普及。

Details Motivation: 由于缺乏本地计算资源,许多研究人员转向云服务训练模型,但云平台的学习曲线陡峭且文档分散,形成使用障碍。 Method: 整合并系统化从零开始在AWS SageMaker上训练Hugging Face模型所需的核心步骤与信息,提供一站式指导。 Result: 为研究人员提供了清晰、完整的实践指南,降低了在云平台上训练大型语言模型的技术门槛。 Conclusion: 通过简化云平台的使用流程,该工作有助于让更多研究者便捷地利用云端资源进行模型训练,推动LLM研究的普及化。 Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

[28] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman,Erin Feiglin,Osher Yaari,Efrat Mentel,Amit Levi,Raz Lapid

Main category: cs.CL

TL;DR: 提出了一种针对掩码扩散语言模型(MDLMs)的激活引导框架,通过对比示例计算逐层引导向量,实现高效推理时控制。

Details Motivation: 现有的MDLMs在推理时缺乏有效的控制和引导机制,限制了其在实际应用中的灵活性。 Method: 利用对比示例通过单次前向传播计算逐层的引导向量,并在每一步反向扩散过程中应用这些向量,无需模拟去噪轨迹。 Result: 在LLaDA-8B-Instruct上实验表明,该方法能可靠地调节文本的高层属性,并通过消融研究分析了不同Transformer子模块和token范围的影响。 Conclusion: 所提出的激活引导框架为MDLMs提供了一种高效且灵活的推理时控制机制,提升了生成文本的可控性。 Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

[29] Large Emotional World Model

Changhao Song,Yazhou Zhang,Hui Gao,Chang Yang,Peng Zhang

Main category: cs.CL

TL;DR: 本文提出了一个大型情感世界模型(LEWM),通过构建包含情感的因果关系数据集EWH,使世界模型能够显式建模情绪状态,从而更好预测情绪驱动的社会行为。

Details Motivation: 现有大语言模型虽具备一定世界知识建模能力,但主要关注物理规律,缺乏对情感因素的系统性建模,而情感在人类决策和世界理解中至关重要。 Method: 受心智理论启发,构建了融合情感、因果关系与行为动机的Emotion-Why-How(EWH)数据集,并在此基础上提出LEWM模型,联合建模视觉观察、动作与情绪状态,实现对未来状态及情绪变化的预测。 Result: 实验表明,LEWM在情绪驱动的社会行为预测上表现更优,同时在基础任务上的性能与通用世界模型相当。 Conclusion: 将情感纳入世界模型有助于提升对人类行为的理解与预测,LEWM为构建更具社会智能的智能体提供了新方向。 Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

[30] Training Report of TeleChat3-MoE

Xinzhang Liu,Chao Wang,Zhihao Yang,Zhuo Jiang,Xuncheng Zhao,Haoran Wang,Lei Li,Dongdong He,Luobin Liu,Kaizhe Yuan,Han Gao,Zihan Wang,Yitong Yao,Sishi Xiong,Wenmin Deng,Haowei He,Kaidong Yu,Yu Zhao,Ruiyu Fang,Yuhao Jiang,Yingyan Li,Xiaohui Hu,Xi Yu,Jingqi Li,Yanwei Liu,Qingli Li,Xinyu Shi,Junhao Niu,Chengnuo Huang,Yao Xiao,Ruiwen Wang,Fengkai Li,Luwen Pu,Kaipeng Jia,Fubei Yao,Yuyao Huang,Xuewei He,Zhuoru Jiang,Ruiting Song,Rui Xue,Qiyi Xie,Jie Zhang,Zilu Huang,Zhaoxi Zhang,Zhilong Lu,Yanhan Zhang,Yin Zhang,Yanlei Xue,Zhu Yuan,Teng Su,Xin Jiang,Shuangyong Song,Yongxiang Li,Xuelong Li

Main category: cs.CL

TL;DR: TeleChat3-MoE是基于Ascend NPU集群训练的万亿参数级MoE架构大模型系列,本技术报告重点介绍支持其高效、可靠扩展的训练基础设施。

Details Motivation: 为了支持超大规模语言模型(如万亿参数MoE)在专用硬件(Ascend NPU)上的稳定与高效训练,需解决跨平台数值一致性、分布式训练性能瓶颈及多维并行优化等问题。 Method: 提出系统性的算子级和端到端数值精度验证方法;设计包含交错流水调度、注意力感知数据调度、分层重叠通信和DVM算子融合的性能优化套件;构建基于分析估计与整数线性规划的多维并行配置优化框架;并实施集群级优化以缓解主机与设备瓶颈。 Result: 实现了数千设备规模下的近线性扩展和显著吞吐提升,在长序列训练和专家并行等场景中表现出高效的可扩展性和稳定性。 Conclusion: 所提出的基础设施方案为在国产化硬件生态上训练超大规模语言模型提供了可靠、高效的工程范式,验证了Ascend NPU集群在前沿AI模型训练中的可行性与竞争力。 Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

[31] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang,Rui Sheng,Yafei Li,Huamin Qu,Yushi Sun,Min Zhu

Main category: cs.CL

TL;DR: MedKGI 是一种基于临床实践的诊断框架,通过整合医学知识图谱、基于信息增益的问题选择和结构化状态跟踪,提升了大语言模型在临床诊断中的准确性与对话效率。

Details Motivation: 现有大语言模型在临床诊断中存在幻觉、提问冗余和多轮对话不一致等问题,难以模拟真实临床推理过程。 Method: 提出 MedKGI 框架,结合医学知识图谱约束推理、基于信息增益选择判别性问题,并采用 OSCE 格式的结构化状态维护证据一致性。 Result: 在临床基准测试中,MedKGI 平均提升30%的对话效率,并在诊断准确率上达到最先进水平。 Conclusion: MedKGI 有效解决了 LLM 在临床诊断中的关键缺陷,实现了更高效、可靠且符合临床实践的诊断推理。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[32] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

May Bashendy,Walid Massoud,Sohaila Eltanbouly,Salam Albatarni,Marwan Sayed,Abrar Abir,Houda Bouamor,Tamer Elsayed

Main category: cs.CL

TL;DR: 本文介绍了LAILA,目前最大的公开阿拉伯语自动作文评分(AES)数据集,包含7,859篇带有整体和特征分数标注的作文,涵盖七个维度:相关性、组织、词汇、风格、发展、机械性和语法,并提供了最先进的阿拉伯语和英语模型在特定提示和跨提示设置下的基准结果。

Details Motivation: 由于缺乏公开可用的数据集,阿拉伯语自动作文评分(AES)的研究受到限制,因此需要构建一个大规模的公开数据集来推动该领域的发展。 Method: 设计并收集了LAILA数据集,包含7,859篇阿拉伯语作文,每篇作文在七个评分维度上进行了人工标注;使用最先进的阿拉伯语和英语预训练模型在prompt-specific和cross-prompt两种设置下进行实验,评估其在AES任务上的性能。 Result: LAILA是当前最大且公开的阿拉伯语AES数据集,覆盖多种评分维度;实验结果表明现有模型在该数据集上仍有提升空间,尤其是在跨提示场景中;为后续研究提供了可靠的基准。 Conclusion: LAILA填补了阿拉伯语AES研究中的关键空白,为开发更鲁棒的自动评分系统提供了重要资源,并有望促进该领域的进一步发展。 Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

[33] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

Michael E. Rose,Mainak Ghosh,Sebastian Erhardt,Cheng Li,Erik Buunk,Dietmar Harhoff

Main category: cs.CL

TL;DR: 本文提出了一种适用于专利和科学出版物的语言相似性模型Pat-SPECTER,在八种模型的对比中表现最佳,并验证了美国专利引用的论文语义相似性较低的假设。

Details Motivation: 为了同时处理专利和科学出版物之间的语言相似性,开发一个有效的模型来预测可信的专利-论文引用。 Method: 基于SPECTER2模型在专利数据上进行微调,构建Pat-SPECTER模型,并通过八种语言模型的“赛马式”评估进行比较。 Result: Pat-SPECTER在预测专利-论文引用任务中表现最优,并在两个实际场景中展示了其能力;研究发现美国专利引用的论文语义相似性低于其他司法管辖区。 Conclusion: Pat-SPECTER是目前最适合用于专利与论文间相似性计算的模型,且研究支持美国因诚信义务导致其专利引用更不相关的论文这一假设。 Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

[34] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Ziqing Fan,Yuqiao Xian,Yan Sun,Li Shen

Main category: cs.CL

TL;DR: 本文提出了DATAMASK,一种高效的联合学习框架,用于大规模预训练数据选择,通过将选择过程建模为掩码学习问题,显著减少了98.9%的选择时间,并在12个不同任务上实现了性能提升。

Details Motivation: 现有基于质量或多样性指标的数据选择方法在长期预训练中存在收益递减或损失高价值样本的问题,且难以在万亿级数据上联合优化多类指标。 Method: 将数据选择视为掩码学习问题,通过迭代采样数据掩码、基于策略梯度的目标优化和更新掩码采样logits,实现对质量和多样性指标的联合优化,并引入多种加速技术提高效率。 Result: 使用DATAMASK从15万亿token的FineWeb数据集中选出约10%的子集FineWeb-Mask,在1.5B密集模型和7B MoE模型上分别取得3.2%和1.9%的性能提升,选择时间减少98.9%。 Conclusion: DATAMASK能够高效地联合优化多类型数据选择指标,显著提升大语言模型在多种任务上的表现,为万亿级预训练数据的精细化筛选提供了可行方案。 Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

[35] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll,Adam Jatowt

Main category: cs.CL

TL;DR: 本文介绍了一个用于欧盟分类法合规性评估的新型结构化数据集,基于190份企业报告,包含经济活动和关键绩效指标(KPIs)的真实数据。研究首次系统评估了大语言模型(LLMs)在该流程中的表现,发现LLMs在定性任务中表现中等,在定量任务中则完全失败,并揭示了简洁元数据优于完整报告的悖论,以及模型置信度校准差的问题。研究表明LLMs尚不能完全自动化该流程,但可作为专家辅助工具,且该数据集为未来研究提供了公开基准。

Details Motivation: 由于缺乏公开的基准数据集,当前对使用大语言模型(LLMs)自动化欧盟分类法合规流程的研究受到限制,亟需一个真实、结构化的数据集来系统评估模型性能。 Method: 构建了一个包含190份企业报告的结构化数据集,涵盖经济活动和定量KPI的真实标签;采用多步代理框架的LLMs进行零样本设置下的实验,评估其在定性识别和定量预测任务中的表现,并分析模型输出与输入长度、置信度之间的关系。 Result: LLMs在识别经济活动的定性任务中表现中等,多步框架略微提升精度;但在预测财务KPI的定量任务中全面失败;发现简洁元数据常比完整报告带来更好性能;模型置信度与实际准确性相关性低,表明校准差。 Conclusion: 当前LLMs尚不足以实现欧盟分类法合规流程的全自动,但可作为辅助工具增强人类专家效率;所提出的数据集为后续研究提供了重要的公开基准。 Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[36] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了FIGR,一种通过端到端强化学习将主动视觉思维融入多轮推理的模型,利用可视化表征提升复杂问题中对全局结构关系的理解能力。

Details Motivation: 复杂推理问题常涉及隐式的空间、几何和结构关系,纯文本推理难以有效捕捉这些全局结构约束。 Method: 提出FIGR模型,通过在推理过程中构建视觉表征来外化中间结构假设,并利用强化学习自适应调控视觉推理的触发与方式。 Result: 在AIME 2025和BeyondAIME等数学推理基准上,FIGR分别比强文本链式推理基线提升了13.12%和11.00%。 Conclusion: 图引导的多模态推理能显著增强复杂推理的稳定性与可靠性,尤其在处理难于文本表达的结构信息时具有优势。 Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[37] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Shupeng Li,Weipeng Lu,Linyun Liu,Chen Lin,Shaofei Li,Zhendong Tan,Hanjun Zhong,Yucheng Zeng,Chenghao Zhu,Mengyue Liu,Daxiang Dong,Jianmin Wu,Yunting Xiao,Annan Li,Danyu Liu,Jingnan Zhang,Licen Liu,Dawei Yin,Dou Shen

Main category: cs.CL

TL;DR: 本文提出了QianfanHuijin,一种面向金融领域的大型语言模型,并设计了一种可推广的多阶段训练范式,通过逐步增强领域知识、推理与智能体能力,在权威金融基准上实现了优越性能。

Details Motivation: 随着金融服务复杂性的加深,仅具备领域知识的模型已无法满足需求,亟需同时具备金融推理和智能体能力的模型。 Method: 采用多阶段训练范式:首先在金融语料上进行持续预训练(CPT),然后依次进行金融监督微调(SFT)、金融推理强化学习(RL)、金融智能体强化学习(RL),最后通过贴合实际业务场景的通用强化学习进行对齐。 Result: QianfanHuijin在多个权威金融基准上表现优异,消融实验表明推理和智能体强化学习阶段显著提升了对应能力。 Conclusion: 该细粒度、渐进式的后训练方法能有效增强工业级大模型,有望成为各行业领域模型增强的主流范式。 Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

[38] World model inspired sarcasm reasoning with large language model agents

Keito Inoshita,Shinnosuke Mizuno

Main category: cs.CL

TL;DR: 本文提出了WM-SAR模型,通过将讽刺理解重构为受世界模型启发的推理过程,利用多个LLM代理分解字面意义、上下文、规范期望和意图,并通过可解释的逻辑回归模型融合不一致性和意图得分进行讽刺检测。

Details Motivation: 现有讽刺检测方法多依赖黑箱模型,缺乏对认知因素的结构化解释;且未能显式建模语义评价与规范预期之间的不匹配。因此需要一个更可解释、基于认知机制的框架。 Method: 提出WM-SAR框架:使用专门的LLM代理分别建模字面意义、上下文、规范期望和意图;计算字面与规范之间的不一致性得分,结合意图得分,输入轻量级逻辑回归模型预测讽刺概率。 Result: 在多个代表性讽刺检测基准上,WM-SAR均优于现有的深度学习和LLM方法;消融实验和案例分析表明语义不一致性和意图推理的结合对性能至关重要。 Conclusion: WM-SAR通过引入可解释的数值决策结构,在保持高性能的同时实现了对讽刺认知机制的结构化建模,推动了可解释性讽刺理解的发展。 Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

[39] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro,Zied Bouraoui

Main category: cs.CL

TL;DR: 提出一种基于自监督对比学习的框架,通过模拟人类阅读策略提升长文档表示效果,在法律和生物医学文本中显著提高准确性和效率。

Details Motivation: 现有模型在处理长文档时存在资源消耗大、上下文捕捉不全或缺乏可解释性的问题,难以有效表示法律和医学等领域的长文本。 Method: 引入一种新的自监督对比学习框架,随机遮蔽文档中的部分段落,利用基于自然语言推理(NLI)的对比目标将其与相关部分对齐,远离无关部分,模拟人类略读文本的理解方式。 Result: 在法律和生物医学文本上的实验表明,该方法在准确性和计算效率方面均有显著提升。 Conclusion: 所提方法能更有效地建模长文档结构,生成更丰富且高效的文档表示,适用于专业领域长文本理解任务。 Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[40] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel,Constantine Lignos

Main category: cs.CL

TL;DR: 本文研究了低资源语言的自动文本摘要方法,比较了零样本大语言模型、微调mT5模型及数据增强等多种方法,发现多语言微调的mT5在多数指标上表现更优,且LLM作为评估器在低资源语言中可靠性较低。

Details Motivation: 低资源语言的文本摘要研究相对较少,现有高性能方法多集中于英语等高资源语言,因此需要探索适用于低资源语言的有效摘要方法。 Method: 采用了多种方法进行比较,包括对大小不同的大语言模型进行零样本提示、微调mT5模型(结合或不结合数据增强和多语言迁移)、以及使用大语言模型进行翻译-摘要-回译的 pipeline 方法,并通过五种不同指标进行评估。 Result: 不同规模的LLM表现存在差异;多语言微调的mT5基线模型在大多数指标上优于其他方法,包括零样本LLM;基于LLM的评估在低资源语言上可靠性较低。 Conclusion: 针对低资源语言的文本摘要,微调多语言小模型(如mT5)比零样本大模型更有效,且应谨慎使用LLM作为评估工具。 Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[41] Cleaning English Abstracts of Scientific Publications

Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt

Main category: cs.CL

TL;DR: 提出了一种开源语言模型,用于自动清理英文科学摘要中的冗余信息,提升文本嵌入质量和相似性分析的准确性。

Details Motivation: 科学摘要常被用作研究主题的代理,但其中包含的版权信息、元数据等杂乱内容会影响下游文本分析任务。 Method: 开发了一个易于集成的开源语言模型,能够自动识别并移除科学摘要中的非必要内容。 Result: 该模型表现出保守性和高精度,能改善清洗后摘要的相似性排序,并增强标准长度嵌入的信息含量。 Conclusion: 所提出的模型有效提升了科学文本预处理的质量,适用于依赖文本相似性和嵌入表示的科研分析任务。 Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

[42] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

Titas Ramancauskas,Kotryna Ramancauske

Main category: cs.CL

TL;DR: 本文提出并评估了一个针对雅思写作考试的在线修订平台,结合自动化评分与个性化反馈,通过设计导向研究(DBR)迭代优化模型,从基于规则的方法转向基于DistilBERT的回归模型,显著提升了评分准确性,并验证了自适应反馈对考生提分的有效性。

Details Motivation: 传统雅思写作备考方法缺乏依据评分标准提供的个性化反馈,且难以模拟真实考试环境,因此需要一个能提供精准、定制化反馈的智能平台来弥补这一不足。 Method: 采用设计导向研究(Design-Based Research, DBR)方法,进行多轮迭代开发;平台架构分离对话引导与写作界面以降低认知负荷;早期使用基于规则的自动评分(AES),后期升级为DistilBERT加回归头的深度学习模型,并实现自适应反馈机制。 Result: 早期基于规则的模型存在中段分数压缩、低准确率和负R²问题;第四轮DBR引入DistilBERT后,MAE降至0.66且R²转为正值;第五轮实现自适应反馈,用户平均提分0.060个band(p=0.011,Cohen's d=0.504),但效果因修改策略而异;表层保守修改比激进结构调整更可靠。 Conclusion: 自动化反馈可作为雅思写作教学的有效补充,尤其适用于表层语言修正;但仍难以准确评估高分段作文,未来需结合长期追踪研究与官方考官验证以提升系统可靠性与实用性。 Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

[43] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski,Alexander Waibel

Main category: cs.CL

TL;DR: 本文提出了自动语音转录中的段落分割任务,构建了首个针对该任务的基准数据集TEDPara和YTSegPara,提出了一种基于大语言模型的约束解码方法,并设计了高效的小模型MiniSeg,在段落与章节联合分割上达到最优性能。

Details Motivation: 语音转录文本通常为无结构的词流,影响可读性和再利用;现有文本分割研究缺乏自然、鲁棒的基准,且段落分割在语音处理中常被忽视。 Method: 构建两个新基准(TEDPara和YTSegPara);提出约束解码框架,使大语言模型在保留原转录内容的同时插入段落分隔符;设计轻量模型MiniSeg,并扩展为层次化结构以联合预测章节和段落。 Result: 建立了首个面向语音段落分割的基准;约束解码支持精确、句子对齐的评估;MiniSeg在准确率上达到SOTA,且能以极低计算成本实现章节与段落的联合预测。 Conclusion: 段落分割应成为语音处理中的标准环节,本文提供的资源与方法为此奠定了基础,推动了语音与文本分割领域的融合发展。 Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

[44] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said,Muhammad Sammani Sani

Main category: cs.CL

TL;DR: 本研究通过构建基于西非威胁场景的对抗性数据集HausaSafety,对三种主流大模型(GPT-5.1、Gemini 3 Pro、Claude 4.5 Opus)在英语与豪萨语中的安全对齐表现进行系统评估,发现语言与时间框架交互引发的复杂干扰机制,揭示当前模型安全防护依赖表层启发式判断而非深层语义理解,提出需转向不变对齐的新范式以保障多语言与跨时间场景下的安全稳定性。

Details Motivation: 随着大语言模型融入全球关键基础设施,其安全对齐是否能从英语零样本迁移到其他语言仍存盲区,尤其在资源较少但风险独特的地区(如西非),现有假设可能带来严重安全隐患,因此亟需针对低资源语言和本地化威胁场景开展系统性安全审计。 Method: 构建名为HausaSafety的新型对抗性数据集,涵盖西非特有的威胁情境(如Yahoo-Yahoo诈骗、Dane枪支制造),采用2×4因子设计,在1,440次评估中测试三种最先进模型(GPT-5.1、Gemini 3 Pro、Claude 4.5 Opus)在语言(英语vs.豪萨语)与时间框架(过去/现在/未来等)交叉条件下的安全响应表现,并分析其交互效应。 Result: 研究发现并非简单的多语言安全差距,而是一种由语言与时间框架交互引起的“复杂干扰”机制:Claude 4.5 Opus在豪萨语中因不确定性驱动拒绝而更安全(45.0%),反而在英语中较低(36.7%),呈现“逆向语言效应”;同时存在显著“时间不对称性”,过去时态防御几乎失效(仅15.6%安全),未来时则过度保守拒绝(57.2%安全),最安全与最脆弱配置间相差达9.2倍。 Conclusion: 当前大模型的安全性并非稳定属性,而是受语言和时间上下文影响的动态状态,其依赖表层启发式规则导致“安全 pockets”,使全球南方用户面临本地化风险暴露。研究呼吁从现有对齐范式转向“不变对齐”,以实现跨语言与跨时间维度的安全稳定性。 Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[45] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong,Qi Zhang,Jiayang Gao,Lei Jiang,Yanbing Liu,Nannan Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为HaluNet的轻量级神经网络框架,用于检测大语言模型在问答任务中的幻觉问题。该方法结合了词元级别的概率不确定性和语义表示不确定性,通过多分支架构自适应融合模型输出中的知识与不确定性,实现高效的单次幻觉检测。

Details Motivation: 大语言模型在问答中容易产生幻觉(如事实错误或内容捏造),现有基于内部不确定性信号的检测方法通常只关注单一类型的不确定性,忽略了不同来源之间的互补性,尤其是词元级别概率不确定性与语义表示不确定性之间的协同作用。 Method: 提出HaluNet,一个可训练的轻量级神经网络框架,采用多分支结构,将语义嵌入与概率置信度和分布不确定性相结合,整合多层次的词元级不确定性,并自适应地融合模型已知信息与其输出中的不确定性。 Result: 在SQuAD、TriviaQA和Natural Questions数据集上的实验表明,HaluNet在有无上下文的情况下均表现出色的幻觉检测性能和良好的计算效率。 Conclusion: HaluNet能够有效利用多种不确定性来源的互补性,实现高效、准确的幻觉检测,具有在基于大语言模型的问答系统中实现实时应用的潜力。 Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

Hongseok Oh,Wonseok Hwang,Kyoung-Woon On

Main category: cs.CL

TL;DR: 本文介绍了韩国规范法律基准(KCL),用于评估语言模型在不依赖领域知识情况下的法律推理能力,包含选择题和开放式问答两部分,并提供了支持性判例和自动评估工具。

Details Motivation: 旨在独立评估语言模型的法律推理能力,而非测试其对特定法律知识的记忆。 Method: 构建包含支持性先例的双组件基准:KCL-MCQA(283道选择题)和KCL-Essay(169道开放题),并系统评估30多个模型的表现。 Result: 实验显示现有模型在KCL-Essay上仍有较大差距,且专为推理设计的模型表现优于通用模型。 Conclusion: KCL能更准确地区分模型的推理能力与参数化知识,推动法律AI向真正推理发展,所有资源已开源。 Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[47] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Zhenyu Zhang,Xiaoxia Wu,Zhongzhu Zhou,Qingyang Wu,Yineng Zhang,Pragaash Ponnusamy,Harikaran Subbaraj,Jue Wang,Shuaiwen Leon Song,Ben Athiwaratkun

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的认知推理引导方法CREST,通过干预与特定认知行为相关的注意力头,在推理时抑制低效的思维模式,从而提升大语言模型的准确性和推理效率。

Details Motivation: 大语言模型在解决复杂任务时依赖长链式思维推理,但常常存在推理效率低下、延迟高以及思维不稳定(如浅层或重复性思考)的问题。因此,需要一种有效的方法来优化推理过程。 Method: 研究发现某些注意力头与验证、回溯等认知行为相关,CREST利用这一发现,通过在推理时对这些头部进行轻量级干预,旋转隐藏表示以抑制无效推理行为。该方法包括离线校准和推理时调节两个步骤。 Result: 在多个推理基准和模型上,CREST最高提升了17.5%的准确率,并减少了37.6%的token使用量。 Conclusion: CREST是一种无需训练、简单有效的测试时推理优化方法,能够自适应地抑制无效推理行为,实现更快速、更可靠的LLM推理。 Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

[48] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu,Jiarui Qin,Lingfeng Qiao,Yinghui Li,Xinyi Dai,Bo Ke,Jianfeng He,Ruizhi Qiao,Di Yin,Xing Sun,Yunsheng Wu,Yinsong Liu,Shuangyin Liu,Mingkong Tang,Haodong Lin,Jiayi Kuang,Fanxu Meng,Xiaojuan Tang,Yunjia Xi,Junjie Huang,Haotong Yang,Zhenyi Shen,Yangning Li,Qianwen Zhang,Yifei Yu,Siyu An,Junnan Dong,Qiufeng Wang,Jie Wang,Keyu Chen,Wei Wen,Taian Guo,Zhifeng Shen,Daohai Yu,Jiahao Li,Ke Li,Zongyi Li,Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM是一个从零训练的1.96B轻量级语言模型,通过紧凑架构、专用词表和分阶段课程学习,在长上下文理解与代理智能方面表现卓越,尤其在小于2B参数的模型中实现了新的SOTA性能。

Details Motivation: 设计一个兼具高效计算和原生代理智能的小型语言模型,克服传统小模型依赖蒸馏、缺乏内在推理规划能力的问题。 Method: 采用密集型Multi-Latent Attention架构和面向STEM的词汇表,支持128k上下文;设计“常识-STEM-代理”三阶段课程学习,使用约11T token数据进行预训练;在中期引入多领域(数学、编程、工具使用)的代理行为轨迹数据强化规划与反思能力。 Result: 在通用基准上性能媲美更大模型,在代理特定任务上显著超越现有SOTA基线,验证了轻量模型也能具备强大内在代理能力。 Conclusion: Youtu-LLM证明了通过系统性架构设计和课程训练,小型语言模型可以原生发展出强大的推理与代理能力,为资源受限场景下的智能代理应用提供了高效解决方案。 Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[49] Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan,Sid Black,Oliver Sourbut

Main category: cs.CL

TL;DR: 研究探讨了大语言模型(LLM)在多步任务中预测自身成功的能力及其决策改进潜力,发现尽管所有测试的LLM都表现出过度自信,但多数具有优于随机的辨别能力;较新和较大的LLM通常没有更高的辨别力,而Claude模型例外;在多步代理任务中,前沿LLM的过度自信随着任务进展而加剧,推理型LLM表现不优于非推理型;通过上下文中的失败经验,部分LLM能减少过度自信并显著改善决策,但并非全部;有趣的是,所有LLM的决策在其估计的成功概率下大致合理,但过于乐观的估计导致决策不佳;结果表明当前LLM代理受限于对其自身能力的认知不足,并讨论了LLM对自身能力认知对AI滥用和错位风险的影响。

Details Motivation: 探究大语言模型是否能够准确预测其在特定任务上的成功率,以及在执行多步任务过程中能否动态调整预测并从中学习以优化决策,特别是在高成本失败情境下的适应能力。 Method: 通过实验评估多个大语言模型在单步与多步任务中预测自身成功的准确性,分析其在任务进程中的信心变化,并引入带有失败经历的上下文示例来检验模型是否能据此调整行为和决策策略。 Result: 大多数LLM具备优于随机的预测辨别力但普遍存在过度自信现象;更大或更新的模型未显示出更强的辨别能力(Claude除外);在多步任务中,多个前沿LLM的过度自信随任务推进加剧,推理模型表现不优于非推理模型;部分LLM在经历失败上下文后减少了过度自信并改善了决策,但并非全部;所有LLM的决策与其预测概率一致,但由于预测过于乐观而导致整体决策质量差。 Conclusion: 当前的大语言模型代理受限于对自身能力的不准确认知,尤其是持续的过度自信问题,这影响了其在复杂和高风险任务中的有效性和安全性,提升其自我认知能力对于降低AI滥用和目标错位风险具有重要意义。 Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

[50] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li,Zhongsheng Wang,Haoyuan Li,Jiamou Liu

Main category: cs.CL

TL;DR: R-Debater是一个基于论证记忆的多轮辩论生成框架,结合检索增强与角色化代理,提升辩论的一致性、证据使用和连贯性。

Details Motivation: 传统LLM在多轮辩论中难以保持立场一致性和有效使用证据,受修辞与记忆研究启发,需构建具有记忆机制的辩论系统。 Method: 提出R-Debater框架,集成辩论知识库用于检索案例证据和历史辩论动作,并设计基于角色的代理生成连贯发言;在ORCHID辩论数据集上进行评估,构建包含1000项的检索语料库和32场保留辩论测试集。 Result: 在单轮(InspireScore)和多轮对抗模拟(Debatrix)任务中均优于强LLM基线;人类评估显示其在立场一致性与证据使用方面更优。 Conclusion: 结合检索增强与结构化规划可有效提升生成辩论的记忆性、忠实性和跨轮次连贯性,R-Debater为构建有记忆的论辩系统提供了可行框架。 Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

[51] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li,Shujian Zhang,Wenxuan Zhou,John Lambert,Chi Jin,Andrew Hard,Rajiv Mathews,Lun Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为MUSIC的无监督数据增强策略,用于生成跨多个对话轮次的对比对话对,以改进多轮对话的奖励模型(RM)训练。基于Gemma-2-9B-Instruct模型和Skywork数据集训练的MUSIC增强RM在多轮对话评估中表现出更高的与先进专有LLM判断的一致性,同时不牺牲单轮RM基准性能。

Details Motivation: 现有的偏好数据集通常仅基于最后一轮对话进行响应对比,难以捕捉多轮交互的复杂性,导致多轮自动评估效果不佳。因此,需要更有效的多轮评估方法来提升多轮奖励模型的质量。 Method: 提出了MUSIC(Multi-Step Instruction Contrast)无监督数据增强策略,通过合成跨越多个对话轮次的对比对话对,增强训练数据的多样性与深度,并在此基础上训练基于Gemma-2-9B-Instruct的多轮奖励模型。 Result: 在Skywork偏好数据集上应用MUSIC训练的多轮RM,在与先进专有大模型评判结果的一致性方面优于基线方法,同时在标准单轮RM基准上保持良好性能。 Conclusion: 引入跨多轮的对比信号对于构建鲁棒的多轮奖励模型至关重要,MUSIC提供了一种有效且可扩展的方法来增强多轮对话的自动化评估能力。 Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

[52] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Sibo Wei,Peng Chen,Lifeng Dong,Yin Luo,Lei Wang,Peng Zhang,Wenpeng Lu,Jianbin Guo,Hongjun Yang,Dajun Zeng

Main category: cs.CL

TL;DR: 本文提出了BIOME-Bench,一个用于评估大语言模型在多组学通路机制解析中性能的标准化基准,揭示了现有模型在分子相互作用推断和通路机制解释方面的不足。

Details Motivation: 现有的通路富集方法受限于通路资源的滞后性、冗余性和敏感性不足,且缺乏标准化基准来系统评估大语言模型在多组学分析中的能力。 Method: 通过四阶段流程构建BIOME-Bench,设计两项核心任务:生物分子相互作用推断和端到端多组学通路机制解析,并建立相应的评估协议。 Result: 实验表明当前的大语言模型在细粒度分子关系识别和生成准确、稳健的通路机制解释方面仍存在显著缺陷。 Conclusion: 需要进一步改进大语言模型以更好支持多组学数据的生物学机制解读,BIOME-Bench为这一方向提供了可重复评估的基础。 Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

[53] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

Mohammad Zia Ur Rehman,Velpuru Navya,Sanskar,Shuja Uddin Qureshi,Nagendra Kumar

Main category: cs.CL

TL;DR: 本文提出了一种半监督多语言抑郁检测网络Semi-SMDNet,结合教师-学生模型、集成学习和数据增强,有效提升资源稀缺语言中的抑郁检测性能。

Details Motivation: 由于不同语言风格、非正式表达以及许多语言缺乏标注数据,从社交媒体文本中检测抑郁症仍具挑战性。 Method: 提出Semi-SMDNet框架,采用多个教师模型进行软投票生成伪标签,通过基于不确定性的阈值过滤低置信度样本,并使用置信度加权训练策略提升跨语言鲁棒性。 Result: 在阿拉伯语、孟加拉语、英语和西班牙语数据集上均优于强基线模型,显著缩小了高资源与低资源设置间的性能差距。 Conclusion: 该框架在标注资源有限的情况下适用于可扩展的跨语言心理健康监测,具有良好的实用性和泛化能力。 Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

[54] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs,Márton Csutora,Mátyás Antal,Márk Marosi

Main category: cs.CL

TL;DR: 本文研究了大语言模型在复杂推理任务中的性能与计算成本之间的权衡,发现MoE架构在效率和性能之间表现优越,并揭示了推理计算存在饱和点。

Details Motivation: 现有研究忽视了生成长推理链带来的巨大计算开销,而工业应用需兼顾准确性与推理成本,因此需要对模型进行计算感知的评估。 Method: 对新旧开源大语言模型进行了测试时计算资源感知的评估,绘制其在数学和推理密集型基准上的帕累托前沿,并分析效率随时间的变化趋势。 Result: 发现MoE架构在性能和效率方面表现优异;推理计算存在饱和点,超过该点后准确率提升显著下降。 Conclusion: 扩展推理能力虽有益,但无法克服模型本身的能力局限,合理平衡计算成本与性能更为关键。 Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

[55] Practising responsibility: Ethics in NLP as a hands-on course

Malvina Nissim,Viviana Patti,Beatrice Savoldi

Main category: cs.CL

TL;DR: 本文介绍了一门关于自然语言处理(NLP)中伦理问题的课程及其以主动学习为基础的教学方法,旨在应对快速发展的技术领域中融入伦理教育的挑战。

Details Motivation: 随着NLP系统日益普及,将伦理考量融入NLP教育变得至关重要。然而,由于该领域发展迅速且需超越传统技术训练培养批判性思维,课程设计面临挑战。 Method: 采用基于主动学习的教学方法,包括互动环节、实践练习和“以教促学”模式,并在四年中于不同机构、教育层次和跨学科背景中不断优化课程。 Result: 课程产出了大量可复用的教学资源和面向不同受众的教育产品,均由学生自主完成。 Conclusion: 分享该课程的设计与实践经验,旨在为希望在教学中融入社会影响思考的教育者提供借鉴。 Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field's rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and "learning by teaching" methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

[56] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Main category: cs.CL

TL;DR: 本文提出了一种名为“三角测量”的新方法,用于验证多语言模型中的机制性解释,要求满足因果标准并在不同环境下保持稳定性。

Details Motivation: 多语言模型在不同语言、脚本和文化中表现不稳定,需要更可靠的机制性解释方法。 Method: 引入“参考族”概念和“三角测量”准则,结合必要性、充分性和不变性来验证电路,并采用自动电路发现与干预实验。 Result: 三角测量能有效过滤仅在单一环境中成立但在跨语言情况下失败的虚假电路,提升了可解释性的可信度。 Conclusion: 该方法为多语言模型的机制分析提供了可证伪的标准,增强了对模型内部机制的理解。 Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

[57] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay,Sathwik Reddy,Shruthi Muthukumar,Jisun An,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: PrivacyBench 是一个基于社会情境的基准,用于评估AI代理在多轮对话中保护用户隐私信息的能力,发现当前检索增强生成(RAG)系统仍存在严重的信息泄露风险。

Details Motivation: 个性化AI代理需要访问用户的数字足迹,但缺乏社会情境意识的系统可能无意中暴露用户秘密,带来隐私和安全风险。 Method: 提出 PrivacyBench 基准,包含具有嵌入式秘密的社会化数据集,并通过多轮对话评估测试 RAG 系统的隐私泄露情况。 Result: 实验显示,RAG 助手在最多 26.56% 的交互中泄露秘密;使用隐私感知提示可将泄露降至 5.12%,但检索机制仍无差别地访问敏感数据,导致生成器成为隐私保护的单一故障点。 Conclusion: 当前架构在大规模部署下存在安全隐患,亟需采用以隐私为先、结构化的隐私保护设计来构建更安全、包容的网络环境。 Abstract: Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

[58] Big AI is accelerating the metacrisis: What can we do?

Steven Bird

Main category: cs.CL

TL;DR: 本文探讨了生态、意义和语言危机交织成的元危机,指出大型AI和语言工程师在其中的角色,并呼吁重新设计以人类繁荣和地球生命为中心的NLP未来。

Details Motivation: 应对由生态、意义和语言危机汇聚而成的元危机,反思当前AI和自然语言处理的发展方向对人类社会和环境的负面影响。 Method: 通过批判性分析当前AI和语言工程的发展趋势,揭示其价值观缺失和技术滥用的问题。 Result: 提出需要集体智慧来探索替代方案,推动以生命肯定为目标的NLP发展路径。 Conclusion: 必须重新构想NLP的未来,使其服务于人类福祉和地球生态的可持续发展。 Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

[59] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang,Yizhi Li,Yantao Du,Ge Zhang,Jiayi Zhou,Yuchen Wu,Yinzhu Piao,Denghui Cao,Tong Sun,Ziniu Li,Li Du,Bo Lei,Jiaheng Liu,Chenghua Lin,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang

Main category: cs.CL

TL;DR: Encyclo-K是一种基于知识陈述的新型LLM评测基准,通过从教科书中提取陈述并动态生成问题,解决了数据污染、单知识点评估和高标注成本的问题。

Details Motivation: 现有LLM基准存在易受数据污染、局限于单知识点评估和依赖专家标注三大问题,需构建更可靠、全面且低成本的评测方法。 Method: 从权威教材中提取独立知识陈述,测试时通过随机采样动态组合成问题,仅需验证格式合规性以降低标注成本。 Result: 在50多个LLM上的实验显示,即使最强模型GPT-5.1准确率也仅为62.07%,模型表现呈清晰梯度分布,验证了动态评估和多陈述理解的挑战性。 Conclusion: Encyclo-K提供了一个可扩展的框架,支持对LLM在细粒度学科知识上的综合理解能力进行动态、可靠的评估。 Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

[60] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie,Yixuan Wei,Huanqi Cao,Chenggang Zhao,Chengqi Deng,Jiashi Li,Damai Dai,Huazuo Gao,Jiang Chang,Liang Zhao,Shangyan Zhou,Zhean Xu,Zhengyan Zhang,Wangding Zeng,Shengding Hu,Yuqing Wang,Jingyang Yuan,Lean Wang,Wenfeng Liang

Main category: cs.CL

TL;DR: 提出Manifold-Constrained Hyper-Connections (mHC) 框架,通过流形投影恢复超连接中的恒等映射性质,并优化基础设施以提升训练稳定性、可扩展性和效率。

Details Motivation: 现有超连接方法因连接模式多样化破坏了残差连接的恒等映射特性,导致训练不稳定、可扩展性受限及内存开销增加。 Method: 将超连接的残差空间投影到特定流形上以恢复恒等映射,并结合严格的基础设施优化来提升效率。 Result: 实验表明mHC在大规模训练中有效,显著提升性能和可扩展性,同时降低内存访问开销。 Conclusion: mHC作为HC的灵活且实用的扩展,有助于深入理解拓扑结构设计,并为基座模型的发展提供新方向。 Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[61] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Hengli Li,Zhaoxin Yu,Qi Shen,Chenxi Li,Mengmeng Wang,Tinglang Wu,Yipeng Kang,Yuxuan Wang,Song-Chun Zhu,Zixia Jia,Zilong Zheng

Main category: cs.CL

TL;DR: 本文提出了BEDA框架,通过将信念估计转化为生成过程中的概率约束,实现了在对抗性、合作性和谈判场景中更有效的策略对话。

Details Motivation: 现有方法虽然能准确估计信念,但缺乏在生成过程中有效利用这些信念的原则性机制。 Method: 形式化了两种核心对话行为(对抗与对齐),并通过概率约束将其操作化,构建包含世界集、信念估计器和条件生成器的BEDA框架。 Result: 在CKBG、MF和CaSiNo三个任务上,BEDA均显著优于强基线模型,例如在CKBG中使用GPT-4.1-nano时成功率提升20.6点,在CaSiNo中达到最优交易结果。 Conclusion: 将信念估计作为生成约束是一种简单且通用的方法,可提升策略对话的可靠性。 Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

[62] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Minjun Zhao,Xinyu Zhang,Shuai Zhang,Deyang Li,Ruifeng Shi

Main category: cs.CL

TL;DR: 提出ADOPT框架,用于多步LLM流水线的自适应依赖感知提示优化,通过建模步骤与最终结果的依赖关系,实现精确的文本梯度估计和高效的提示优化。

Details Motivation: 多步LLM流水线性能依赖各步骤提示,但缺乏步骤级监督和存在步骤间依赖,使得联合提示优化困难,现有方法效果不佳。 Method: 提出ADOPT框架,建模每个LLM步骤与最终任务结果之间的依赖关系,解耦文本梯度估计与梯度更新,并采用基于Shapley值的机制自适应分配优化资源。 Result: 在真实数据集和多种流水线结构上实验表明,ADOPT有效且鲁棒, consistently 优于现有的最先进提示优化基线方法。 Conclusion: ADOPT通过依赖感知和自适应资源分配,显著提升了多步LLM流水线的提示优化效果和稳定性。 Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

Luis Adrián Cabrera-Diego

Main category: cs.CL

TL;DR: 本文提出了一种基于DeBERTa V3和LSTM的法律文档分类方法,通过随机选取48个短文本块(每块最多128个token)作为输入,解决了长文本处理难题,并结合Temporal构建了可靠的部署流水线。

Details Motivation: 法律文档通常词汇专业且篇幅较长,直接使用Transformer模型处理存在计算成本高、速度慢或无法处理的问题,因此需要一种高效且可行的分类方法。 Method: 采用DeBERTa V3与LSTM相结合的模型,输入为从文档中随机抽取的48个短文本块(每个最多128个token),并利用Temporal构建可持久执行的部署流水线以提升系统可靠性。 Result: 最佳模型达到0.898的加权F分数,部署在CPU上的流水线每处理100个文件中位时间为498秒。 Conclusion: 该方法在不完整读取全文的情况下仍能有效分类法律文档,结合高效模型结构与可靠处理流程,适用于实际应用场景中的长文本分类任务。 Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

[64] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Siddhant Agarwal,Adya Dhuler,Polly Ruhnke,Melvin Speisman,Md Shad Akhtar,Shweta Yadav

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型和多智能体框架MAMAMemeia的新方法,用于检测社交媒体中表情包所反映的抑郁症状,显著提升了检测性能。

Details Motivation: 随着表情包被越来越多地用于表达抑郁情绪,亟需有效的方法来识别这些潜在的心理健康风险。 Method: 提出RESTOREx资源库,结合大语言模型生成与人工标注的解释,并设计基于认知分析疗法(CAT)的多智能体多方面讨论框架MAMAMemeia进行抑郁症状检测。 Result: MAMAMemeia在macro-F1指标上比现有最优方法提升7.55%,成为超过30种方法中的新基准。 Conclusion: 该研究为利用AI技术理解社交媒体中的心理健康问题提供了有效工具和新方向。 Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

[65] Modeling Language as a Sequence of Thoughts

Nasim Borazjanizadeh,James McClelland

Main category: cs.CL

TL;DR: 本文提出了Thought Gestalt(TG)模型,一种在词元和句子级“思想”状态两个层次上建模语言的循环Transformer,通过保留句子表示的记忆并利用未来标记损失的梯度来优化先前句子向量的生成,在数据效率和关系方向泛化方面优于传统Transformer。

Details Motivation: 受认知科学启发,人类理解语言时会将语言流转化为紧凑且持久的事件状表征,而现有语言模型仅依赖表面共现统计,缺乏全局一致的潜在表示,导致在关系方向、上下文错误和数据效率方面表现不佳。因此,作者希望构建更接近人类理解机制的模型。 Method: 提出Thought Gestalt(TG)模型,该模型为一种循环Transformer,逐句生成词元,并通过跨注意力机制访问之前句子表征的记忆;词元和句子表征共享同一组模型参数,并通过单一的下一个词元交叉熵目标进行训练,同时保留写入记忆的句子表征的计算图,使未来词元损失的梯度可反向传播以优化早期句子向量。 Result: 在扩展实验中,TG在匹配的GPT-2基准上持续提升效率,拟合结果显示GPT-2需要约5-8%更多的数据和33-42%更多的参数才能达到TG的损失水平;此外,TG在父子关系反转诅咒探测任务中减少了关系方向泛化错误。 Conclusion: TG通过引入句子级持久记忆和跨注意力机制,结合与词元生成共享参数的句子表征学习,在不增加训练目标复杂性的情况下提升了语言模型的数据效率和关系推理能力,验证了模仿人类语言理解机制的有效性。 Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

[66] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng

Main category: cs.CL

TL;DR: AdaGReS是一种面向检索增强生成(RAG)的冗余感知上下文选择框架,通过优化相关性与冗余惩罚的集合级目标,在令牌预算限制下提升上下文质量。

Details Motivation: 标准的top-k检索常引入冗余或近似重复的上下文片段,浪费令牌预算并降低生成质量,因此需要更智能的上下文选择机制。 Method: 提出AdaGReS框架,采用基于边际增益的贪心选择算法,在令牌预算约束下优化包含相关性和冗余惩罚的集合级目标,并引入闭式、实例自适应的参数校准方法,自动平衡相关性与冗余。 Result: 理论分析表明该目标函数在实际嵌入相似性条件下具有ε-近似次模性,为贪心算法提供近似最优性保证;实验显示其在开放域问答和高冗余生物医学文本中均能有效控制冗余、提升上下文与最终回答质量。 Conclusion: AdaGReS通过自适应地平衡相关性与冗余,在令牌受限的RAG场景中实现了更优的上下文选择,无需人工调参且具备理论保证和良好泛化性。 Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

cs.CV [Back]

[67] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich,Yangming Lee

Main category: cs.CV

TL;DR: 提出一种基于Depth Anything V2架构和DV-LORA自适应方法的单目深度估计技术,显著提升手术内窥镜环境下对薄器械和高反射区域的深度估计精度与鲁棒性。

Details Motivation: 现有自监督单目深度估计方法在手术内窥镜的高反射、流体环境中易出现边界坍塌问题,尤其对细小器械和透明表面估计不准。 Method: 利用Depth Anything V2的高质量合成先验,并通过动态向量低秩适应(DV-LORA)高效迁移到医学图像域;同时在SCARED数据集上引入物理分层评估协议,以更准确评估高反射场景下的性能。 Result: 在SCARED数据集上达到98.1%的准确率(<1.25),相比基线方法平方相对误差降低超过17%,并在高反射区域表现出更强鲁棒性。 Conclusion: 该方法有效克服了从合成到真实医疗场景的域差距,提升了复杂手术环境下的深度估计性能,为机器人辅助手术提供了更可靠的感知基础。 Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

[68] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

Surya Rayala,Marcos Quinones-Grueiro,Naveeduddin Mohammed,Ashwin T S,Benjamin Goldberg,Randall Spain,Paige Lawton,Gautam Biswas

Main category: cs.CV

TL;DR: 本文提出了一种基于视频的评估管道,利用计算机视觉从城市作战训练视频中提取2D骨架、注视向量和运动轨迹,构建任务特定指标,并结合扩展的认知任务分析(CTA)层次模型,实现对心理运动流畅性、情境意识和团队协作的客观量化评估。

Details Motivation: 传统军事训练评估依赖昂贵、侵入式传感器或主观人工观察,难以实现可扩展且准确的客观性能评估,尤其是在认知、心理运动和团队协作技能方面存在不足。 Method: 采用无需额外硬件的视频分析方法,通过计算机视觉模型提取训练中的2D姿态、 gaze 和轨迹数据,设计针对ECR任务的性能指标,并集成到加权的认知任务分析(CTA)框架中,生成个体与团队的整体评分。 Result: 在真实ECR演练案例中验证了该方法的有效性,提供了可操作的领域特定指标,支持通过Gamemaster和GIFT系统进行交互式战后回顾与可视化反馈。 Conclusion: 该方法为合成训练环境中的可扩展、客观性能评估提供了可行路径,未来将拓展至3D视频分析并提升在STE中的广泛应用能力。 Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

[69] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang,Shengqu Cai,Muyang Li,Chong Zeng,Beijia Lu,Anyi Rao,Song Han,Gordon Wetzstein,Maneesh Agrawala

Main category: cs.CV

TL;DR: 提出PFP神经网络结构,用于通过显式预训练目标将长视频压缩为短上下文,保留任意时间位置单帧的高频细节。

Details Motivation: 为了在低上下文成本下实现长期历史记忆并保持较高保真度,需要有效压缩长视频同时保留关键视觉细节。 Method: 设计PFP神经网络结构,采用显式预训练目标来保留单帧的高频细节,并将其作为记忆编码器进行微调,用于自回归视频模型。 Result: 基线模型可将20秒视频压缩为约5k长度的上下文,支持随机帧的感知质量良好的重建,并可用于自回归视频生成中的低开销长时记忆。 Conclusion: PFP提供了一种有效的长视频压缩方法,在保留高频细节的同时显著降低上下文长度,适合用于需长期记忆的视频建模任务。 Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[70] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng,Hongfei Xue,Pu Wang,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了终身域自适应的3D人体姿态估计新任务,首次将终身域适应引入3D HPE领域,通过创新的GAN框架和新型3D姿态生成器范式,有效缓解域偏移和灾难性遗忘问题,在多个数据集上表现出优越性能。

Details Motivation: 现有的域适应方法忽略了非平稳目标姿态数据集的问题,且难以在持续学习新域时保留旧域知识,因此需要一种能够持续适应新域并防止灾难性遗忘的新方法。 Method: 提出一种新的GAN框架,包含3D姿态生成器、2D姿态判别器和3D姿态估计器,并设计融合姿态感知、时序感知和域感知知识的3D姿态生成器,以增强对当前域的适应并减轻对先前域的灾难性遗忘。 Result: 在多种域自适应3D HPE数据集上进行了广泛实验,结果表明所提方法在适应新域的同时有效保持了旧域性能,整体表现优于现有方法。 Conclusion: 本文开创性地将终身域适应引入3D HPE任务,提出的框架能有效应对域偏移和知识遗忘问题,为实际应用中持续适应多样化真实场景提供了可行方案。 Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[71] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

Krithika Iyer,Austin Tapp,Athelia Paulli,Gabrielle Dickerson,Syed Muhammad Anwar,Natasha Lepore,Marius George Linguraru

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的框架,利用儿童T1加权MRI生成合成CT(sCT),实现颅骨和颅缝的精确分割与可视化,克服了MRI在骨骼成像上的局限性,且无需辐射暴露。

Details Motivation: 量化儿童颅骨发育和颅缝骨化对诊断和治疗头颅生长异常至关重要。由于CT有电离辐射风险,不适合常规用于儿童,而MRI虽无辐射但难以清晰显示颅缝和骨密度,因此需要一种无辐射且能准确评估颅骨结构的方法。 Method: 采用深度学习驱动的管道,结合领域特定的变分自编码器,将0.2至2岁儿童的T1加权MRI转换为合成CT(sCT),并预测颅骨分割、生成颅缝概率热图,进而进行颅缝分割。使用内部儿科数据进行训练与验证。 Result: sCT与真实CT的结构相似性达99%,Frechet起始距离为1.01;七块颅骨的分割平均Dice系数为85%;颅缝分割Dice系数达80%;TOST检验表明sCT与真实CT在颅骨和颅缝分割上具有等效性(p < 0.05)。 Conclusion: 该方法首次实现了从MRI-derived sCT中进行儿童颅缝分割,生成的sCT在视觉和定量上与真实CT几乎无法区分,为儿童颅骨发育的无创评估提供了可行方案,填补了现有影像技术的空白。 Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

[72] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema,Eliza Mace,Hunter Brown,Heidys Cabrera,Nick Krall,Matthew O'Neill,Shivangi Sarkar,Lowell Weissman,Eric Hughes,Guido Zarrella

Main category: cs.CV

TL;DR: 本研究探索了在高分辨率光电遥感数据上训练大规模基础模型的扩展规律,利用超过一千万亿像素的卫星数据训练视觉Transformer模型,发现性能受限于数据而非模型参数,为遥感领域的大规模模型发展提供了实践指导。

Details Motivation: 由于遥感等高价值领域的扩展规律尚不明确,缺乏指导大规模模型训练的原则,限制了该领域基础模型的发展。 Method: 使用超过一千万亿像素的商业卫星EO数据,在MITRE联邦AI沙箱中逐步训练更大规模的视觉Transformer骨干网络,并分析其在petascale下的表现、成败模式及跨遥感模态的域间差距影响。 Result: 发现即使在极大规模下,模型性能仍处于数据受限状态,而非参数受限;识别出若干成功与失败模式,并揭示了对其他遥感模态的泛化意义。 Conclusion: 研究结果可为遥感领域未来的数据采集策略、计算资源分配和优化调度提供依据,推动前沿规模遥感基础模型的发展。 Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[73] Learning to learn skill assessment for fetal ultrasound scanning

Yipei Wang,Qianye Yang,Lior Drukker,Aris T. Papageorghiou,Yipeng Hu,J. Alison Noble

Main category: cs.CV

TL;DR: 提出了一种新的双层优化框架,用于无监督地评估胎儿超声技能,通过任务执行效果来量化技能水平。

Details Motivation: 传统超声技能评估依赖专家主观判断,且耗时;现有自动化方法多依赖监督学习和预设技能指标,限制了泛化能力。 Method: 设计了一个包含临床任务预测器和技能预测器的双层优化框架,联合优化两个网络,通过图像任务完成质量间接评估操作技能。 Result: 在真实临床胎儿头部超声视频上验证了该方法的可行性,能够有效预测超声技能水平。 Conclusion: 该框架无需手动标注技能等级,通过优化任务性能作为技能指标,实现了更客观、自动化的超声技能评估。 Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

[74] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

Yulong Zou,Bo Liu,Cun-Jing Zheng,Yuan-ming Geng,Siyue Li,Qiankun Zuo,Shuihua Wang,Yudong Zhang,Jin Hong

Main category: cs.CV

TL;DR: 提出了一种元引导多模态学习框架(MGML),用于在不完整多模态MRI数据下提升脑肿瘤分割性能,包含自适应模态融合与一致性正则化模块,无需修改模型结构,可端到端训练,在BraTS2020和BraTS2023上优于现有方法。

Details Motivation: 临床中多模态MRI数据常不完整,限制了多模态信息的充分利用,如何有效利用不完整多模态数据进行病灶分割是一个关键挑战。 Method: 提出MGML框架,包含两个模块:1)元参数化的自适应模态融合(Meta-AMF),根据可用模态生成软标签监督信号,实现动态多模态融合;2)一致性正则化模块,提升模型鲁棒性与泛化能力。该方法不改变原模型结构,可嵌入训练流程实现端到端优化。 Result: 在BraTS2020和BraTS2023数据集上验证,相比多个SOTA方法表现更优。在BraTS2020的15种缺失模态组合平均Dice得分中,WT、TC、ET分别为87.55、79.36、62.67。 Conclusion: MGML能有效利用不完整多模态MRI数据,提升脑肿瘤分割性能,具有良好的鲁棒性、通用性和可集成性,适用于实际临床场景。 Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[75] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

Hualin Ye,Bingxi Liu,Jixiang Du,Yu Qin,Ziyi Chen,Hong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于跨视角地理定位(CVGL)的新系统,通过DINOv2骨干网络、多尺度通道重分配模块和集成MoE路由的改进聚合模块,有效应对视角差异带来的特征对齐挑战,在减少参数量的同时实现了竞争性性能。

Details Motivation: 由于不同视角之间的显著差异,现有方法在特征聚合与对齐方面面临挑战,因此需要更鲁棒的模型来提升跨视角地理定位的准确性与效率。 Method: 采用DINOv2骨干网络结合卷积适配器进行微调,设计多尺度通道重分配模块以增强空间表示的多样性与稳定性,并引入基于Mixture-of-Experts(MoE)路由的改进聚合模块,在交叉注意力框架中动态选择专家子空间处理异构输入。 Result: 在University-1652和SUES-200数据集上的实验表明,该方法在较少训练参数的情况下达到了具有竞争力的性能。 Conclusion: 所提出的CVGL系统通过结构创新有效缓解了跨视角差异问题,兼顾性能与模型效率,为后续研究提供了可行方向。 Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

[76] Kinematic-Based Assessment of Surgical Actions in Microanastomosis

Yan Meng,Daniel Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 提出了一种基于AI的自动化框架,用于显微吻合手术中的动作分割和技能评估,能够在边缘计算平台上高效运行。

Details Motivation: 传统显微外科培训评估依赖专家主观评分,存在评分者间差异、不一致性和耗时等问题,亟需客观、系统的自动化评估方法。 Method: 该框架包含三个模块:基于YOLO和DeepSORT的器械尖端追踪定位模块;基于自相似矩阵的动作边界检测与无监督聚类动作分割模块;以及用于评估手术动作熟练度的有监督分类模块。 Result: 在58段专家评分的显微吻合视频数据集上验证,动作分割帧级准确率达92.4%,技能分类准确率达85.5%,能有效复现专家评价。 Conclusion: 该方法可为显微外科教育提供客观、实时反馈,推动标准化、数据驱动的培训体系发展,提升高风险手术环境下的能力评估水平。 Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

[77] U-Net-Like Spiking Neural Networks for Single Image Dehazing

Huibin Li,Haoran Liu,Mingzhe Liu,Yulong Xiao,Peng Li,Guibin Zan

Main category: cs.CV

TL;DR: 提出了一种结合U-Net结构和脉冲神经网络(SNN)的新型去雾架构DehazeSNN,通过引入OLIFBlock模块提升跨通道通信,在降低计算开销的同时实现了与现有最先进方法相媲美的性能。

Details Motivation: 传统去雾方法依赖大气散射模型,而深度学习方法如CNN和Transformer存在长距离依赖建模不足或计算开销大的问题,因此需要一种高效且性能优越的去雾模型。 Method: 提出DehazeSNN,采用U-Net-like结构结合Spiking Neural Networks,并引入正交泄漏积分-放电块(OLIFBlock)以增强跨通道信息交互,有效捕捉多尺度特征和长程依赖。 Result: 实验表明,DehazeSNN在多个基准数据集上去雾效果与当前最优方法相当,同时模型更小、MACs更低,具备更高的计算效率。 Conclusion: DehazeSNN是一种高效、轻量且高性能的图像去雾方法,结合SNN的优势为实际应用中的低功耗和高清晰度需求提供了新方向。 Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.

[78] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li,Yuecong Min,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了T2VAttack,首次从语义和时序两个角度系统研究了文本到视频扩散模型的对抗攻击问题,揭示了现有模型在微小提示词修改下的脆弱性。

Details Motivation: 尽管文本到视频生成模型取得了显著进展,但其对对抗攻击的鲁棒性尚未被充分探索,本文旨在填补这一空白。 Method: 提出了两种攻击目标(语义对齐和时序动态)和两种攻击方法:T2VAttack-S通过贪心搜索替换关键词,T2VAttack-I通过迭代插入优化词进行微小扰动。 Result: 实验表明,仅替换或插入一个单词即可显著降低多个主流T2V模型(如ModelScope、CogVideoX等)生成视频的语义保真度和时序连贯性。 Conclusion: 当前文本到视频扩散模型在对抗性提示下存在严重漏洞,亟需提升其鲁棒性以保障实际应用安全。 Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[79] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Yuang Jia,Jinlong Wang,Jiayi Zhao,Chunlam Li,Shunzhou Wang,Wei Gao

Main category: cs.CV

TL;DR: 本文提出了一种无需昂贵传感器或标注先验的自动驾驶场景视图外推方法,仅使用图像和可选相机姿态,通过可变形4D高斯与扩散模型迭代优化生成高质量新视角图像。

Details Motivation: 现有视图外推方法依赖LiDAR、3D框等昂贵或人工标注的先验信息,限制了实际部署,本文旨在仅用图像实现高效、高质量的视图外推。 Method: 首先估计全局静态与每帧动态点云并融合为统一表示;采用可变形4D高斯框架重建场景;利用其渲染的伪图像训练视频扩散模型,并迭代地用扩散模型 refine 高斯渲染结果,同时将优化结果反馈回4DGS训练。 Result: 相比基线方法,在目标外推视角下生成了质量更高的新视角图像。 Conclusion: 该方法在无需强几何先验的情况下,实现了高质量的视图外推,具有更强的现实应用潜力。 Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

[80] Anomaly detection in satellite imagery through temporal inpainting

Bertrand Rouet-Leduc,Claudia Hulbert

Main category: cs.CV

TL;DR: 提出一种基于深度学习的卫星影像时间序列异常检测方法,通过预测无变化时地表应有状态来识别表面变化,显著提升了检测灵敏度和特异性。

Details Motivation: 传统变化检测方法难以区分大气噪声、季节性变化与真实地表变化,导致灵敏度不足。需要更鲁棒的方法实现全球尺度的自动化表面变化监测。 Method: 基于SATLAS基础模型构建图像修复(inpainting)模型,利用Sentinel-2时间序列中前期影像预测最新一帧的地表状态,并通过预测与观测之间的差异检测异常。使用全球分布的多气候区和土地覆盖类型数据进行训练。 Result: 在2023年土耳其-叙利亚地震引发的地表破裂事件中验证,成功检测到Tepehan地区的断裂数特征,检测阈值比传统方法低约三倍,在灵敏度和特异性上均优于时间中值和Reed-Xiaoli异常检测器。 Conclusion: 该方法能有效利用卫星影像的时间冗余性,实现对微弱地表变化的高灵敏度检测,为基于免费多光谱卫星数据的全球自动化变化监测提供了可行路径。 Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

[81] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

Jun Ding,Shang Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的医学图像分割框架GCA-ResUNet,通过引入轻量级的分组坐标注意力(GCA)模块,增强了对长距离上下文依赖和多器官异质结构的建模能力,在保持计算效率的同时提升了分割精度。

Details Motivation: 现有的U-Net类方法因局部感受野和同质化的注意力机制难以有效建模长距离依赖,而Transformer虽能捕捉全局信息但计算开销大,限制了其在资源受限临床环境中的应用。因此需要一种兼顾性能与效率的分割方法。 Method: 提出GCA-ResUNet,设计了一种可插拔的分组坐标注意力(GCA)模块:将通道上下文建模分组以应对语义异质性,并结合方向感知的坐标编码来捕获水平和垂直空间依赖。该模块嵌入于CNN主干网络中,在不显著增加计算成本的前提下增强全局表征能力。 Result: 在Synapse和ACDC两个基准数据集上实验表明,GCA-ResUNet分别取得了86.11%和92.64%的Dice分数,优于包括Swin-UNet和TransUNet在内的多种代表性CNN和Transformer方法,尤其在小器官和复杂边界结构的分割上表现更优。 Conclusion: GCA-ResUNet在分割准确性和计算效率之间实现了良好平衡,具备良好的临床部署潜力,为医学图像分割提供了一种实用且可扩展的解决方案。 Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

[82] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li,Zhenyu Qi,Hao Qin,Huanrui Yang,Sen He,Kebin Peng

Main category: cs.CV

TL;DR: 本文提出GASeg框架,通过结合几何与外观特征中的拓扑信息来提升自监督语义分割的鲁棒性,核心为可微分盒计数(DBC)模块和拓扑增强(TopoAug),并引入GALoss实现跨模态对齐,在多个基准上达到SOTA性能。

Details Motivation: 现有自监督语义分割方法在面对外观模糊(如阴影、反光、局部纹理)时表现不佳,因其过度依赖不稳定的外观特征。本文旨在通过引入稳定的拓扑结构信息来缓解该问题。 Method: 提出GASeg框架,包含可微分盒计数(DBC)模块用于提取几何与外观双流的多尺度拓扑统计特征;设计拓扑增强(TopoAug)策略,利用形态学操作模拟真实模糊场景以增强模型鲁棒性;并通过多目标损失GALoss实现几何与外观特征间的显式跨模态对齐。 Result: 在COCO-Stuff、Cityscapes和PASCAL等多个基准数据集上取得当前最优性能,验证了所提方法在桥接几何与外观信息方面的有效性。 Conclusion: 通过融合稳定拓扑信息与对抗性数据增强,GASeg有效缓解了自监督语义分割中对外观特征的过依赖问题,提升了模型在复杂场景下的泛化能力。 Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[83] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

Tae Ha Park,Simone D'Amico

Main category: cs.CV

TL;DR: 提出了一种利用太阳位置先验信息改进3D高斯点阵模型训练的新方法,用于在空间交会过程中从图像序列恢复未知目标航天器的3D结构,并提升渲染的光度质量以支持姿态估计。

Details Motivation: 由于太空成像中光照条件动态变化,传统3D重建方法难以保证3D高斯点阵(3DGS)模型的光度准确性,影响后续位姿估计任务。因此需要引入光照先验来提升模型性能。 Method: 将服务航天器估计并维持的太阳位置先验信息融入3DGS训练流程,通过联合优化几何与光度一致性,实现对动态光照的适应和阴影、自遮挡的建模。 Result: 实验表明,该方法使3DGS模型能适应快速变化的空间光照条件,显著提升渲染图像的光度质量,并有助于提高基于光度优化的相机位姿估计精度。 Conclusion: 引入太阳位置先验有效解决了动态光照下3DGS在空间目标重建中的光度不一致问题,增强了3D重建与视觉位姿估计的鲁棒性与实用性。 Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target's geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun's position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

[84] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu,Hui Li,Yiyun Su

Main category: cs.CV

TL;DR: 本文提出了一种名为Hilbert-VLM的新型两阶段融合框架,用于提升视觉语言模型在3D医学图像分析中的性能,通过引入Hilbert空间填充曲线和改进SAM2架构,实现了更精确的病灶分割与疾病分类。

Details Motivation: 现有视觉语言模型在处理复杂的3D多模态医学图像时,难以有效整合互补信息,且易忽略细微但关键的病理特征,限制了其在自动医学诊断中的应用。 Method: 提出Hilbert-VLM框架,包含HilbertMed-SAM模块用于精准病灶分割,并生成多模态增强提示以指导视觉语言模型进行疾病分类;改进SAM2架构,将Hilbert空间填充曲线融入Mamba状态空间模型的扫描机制,保留3D数据的空间局部性;设计Hilbert-Mamba交叉注意力(HMCA)机制和尺度感知解码器以捕捉细粒度细节;并通过提示增强模块融合分割掩码与文本属性生成密集提示。 Result: 在BraTS2021分割基准上,模型Dice分数达到82.35%,诊断分类准确率(ACC)为78.85%。 Conclusion: Hilbert-VLM通过结构创新有效提升了3D医学图像中病灶分割与疾病分类的准确性,增强了基于视觉语言模型的医学分析的可靠性与潜力。 Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[85] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li,Yue Song,Jianing Peng,Ting Liu,Jun Huang,Xiaochao Qu,Luoqi Liu,Wei Wang,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为Conditioned Velocity Correction (CVC)的新框架,用于改进基于流的扩散图像编辑方法,通过分解潜在空间中的速度场并引入后验一致性更新来提升编辑的稳定性和保真度。

Details Motivation: 现有基于流的扩散编辑方法在潜变量轨迹中存在累积的速度误差,导致语义不一致和结构失真,本文旨在解决这一问题。 Method: 提出CVC框架,将流式编辑重新定义为由已知源先验驱动的分布变换问题;引入双视角速度转换机制,分解为保持结构和引导语义的两个分支,并结合经验贝叶斯推断与Tweedie校正对条件速度场进行后验一致性更新。 Result: CVC显著降低了潜空间中的轨迹漂移和速度误差,实现了更稳定的动态演化,在多种任务上表现出更高的图像保真度、更好的语义对齐和更可靠的编辑效果。 Conclusion: CVC通过数学严谨的速度校正机制,有效提升了流式扩散模型在无需显式反演情况下的编辑质量与稳定性,为未来可控生成提供了可靠路径。 Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

[86] FitControler: Toward Fit-Aware Virtual Try-On

Lu Yang,Yicheng Liu,Yanan Li,Xiang Bai,Hao Lu

Main category: cs.CV

TL;DR: 本文提出了FitControler,一种可集成到现代虚拟试穿(VTON)模型中的可学习插件,实现对服装版型(fit)的精细控制,并构建了首个关注版型的VTON数据集Fit4Men及相应的评估指标。

Details Motivation: 现有虚拟试穿技术多关注服装细节的渲染,却忽视了影响整体风格的关键因素——服装版型(garment fit),即服装与人体的贴合方式,导致生成结果在风格协调性上不足。 Method: 提出FitControler,包含一个基于服装无关表征的版型感知布局生成器,用于生成不同版型的身体-服装布局;并设计一个多尺度版型注入器,将布局信息融入现有VTON模型,实现布局驱动的试穿结果生成。同时构建了包含13,000对样本的Fit4Men数据集,并提出了两个版型一致性评估指标。 Result: 实验证明FitControler能有效兼容多种主流VTON模型,实现精确的版型控制;新提出的评估指标能更好衡量生成结果的版型合理性;构建的Fit4Men数据集为后续研究提供了重要资源。 Conclusion: 通过显式建模和控制服装版型,显著提升了虚拟试穿结果的真实感与风格协调性,推动了VTON技术从‘穿什么’向‘怎么穿’的进阶发展。 Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style -- garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

[87] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

Huanxiong Liang,Yunuo Chen,Yicheng Pan,Sixian Wang,Jincheng Dai,Guo Lu,Wenjun Zhang

Main category: cs.CV

TL;DR: 提出了一种结构引导的2D高斯点阵分配方法,通过结构引导初始化、自适应位宽量化和几何一致性正则化,显著提升了2DGS在低比特率下的率失真性能,同时保持了毫秒级解码速度。

Details Motivation: 现有2DGS方法在分配表示容量和参数精度时忽略图像结构,导致低比特率下率失真效率低。 Method: 1. 结构引导初始化:根据自然图像的空间结构先验分配2D高斯分布;2. 自适应位宽量化:在复杂区域为小尺度高斯赋予更高精度;3. 几何一致性正则化:对齐高斯方向与局部梯度方向。 Result: 在Kodak上BD-rate降低43.44%,在DIV2K上降低29.91%,保持超过1000 FPS解码速度。 Conclusion: 该方法有效提升2DGS的表示能力和率失真性能,兼顾高效解码,适用于高质量低延迟图像表示。 Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

[88] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang,Donghao Wang,Jiacheng Yang,Yifan Jiang,Meiyi Zhu,Yuekun Yang,Cong Wang,Qi Fan,Wenbin Li,Yang Gao

Main category: cs.CV

TL;DR: 本文提出了一种用于遥感图像理解的多特征融合视觉-语言模型MF-RSVLM,通过多尺度特征学习和循环视觉特征注入机制,有效提升了对细粒度结构的捕捉能力并缓解了视觉遗忘问题,在多种遥感任务上达到先进水平。

Details Motivation: 现有的视觉-语言模型在处理遥感图像时面临挑战,难以提取细粒度视觉特征且在深层语言处理过程中容易出现视觉遗忘,因此需要专门针对遥感领域优化的模型。 Method: 提出MF-RSVLM模型,采用多尺度视觉表示学习,融合全局上下文与局部细节,并通过循环视觉特征注入机制在语言生成过程中持续引入视觉信息。 Result: 在多个遥感基准任务(如分类、图像描述生成和视觉问答)上实验表明,MF-RSVLM取得了最先进的或具有竞争力的性能。 Conclusion: MF-RSVLM能有效提升遥感图像的视觉-语言理解能力,解决了细粒度特征提取和视觉遗忘问题,为遥感领域提供了高效的多模态解决方案。 Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[89] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He,Yujie Zhang,Shuyong Gao,Wenjie Li,Lingyi Hong,Mingxi Chen,Kaixun Jiang,Jiyuan Fu,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出RSAgent,一种基于多模态大语言模型的智能体框架,通过多轮工具调用与视觉反馈迭代优化文本引导的图像分割,显著提升了定位与掩码生成的准确性。

Details Motivation: 现有方法在文本引导分割中通常采用单次前向预测,缺乏对初始定位错误的修正能力,限制了分割性能。 Method: 提出RSAgent,利用多轮推理与动作交替的智能体框架,通过查询分割工具箱、观察视觉反馈并结合历史信息迭代优化空间假设;构建多轮推理轨迹数据 pipeline,并采用两阶段训练:冷启动监督微调+基于细粒度奖励的智能体强化学习。 Result: 在ReasonSeg测试集上实现66.5% gIoU的零样本性能,比Seg-Zero-7B提升9%;在RefCOCOg上达到81.5% cIoU,均取得当前最优结果。 Conclusion: RSAgent通过引入智能体式的多轮交互机制,实现了更鲁棒和精确的文本引导分割,在多种基准上表现出色,展示了智能体范式在该任务中的潜力。 Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[90] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Mustafa Munir,Md Mostafijur Rahman,Kartikeya Bhardwaj,Paul Whatmough,Radu Marculescu

Main category: cs.CV

TL;DR: PipeFlow是一种可扩展的长视频编辑方法,通过跳过低运动帧、并行化处理和神经插值技术,实现编辑时间随视频长度线性增长,显著提升效率。

Details Motivation: 长视频编辑因计算成本随序列延长呈指数增长而面临挑战,尤其是基于DDIM反演的联合编辑方法。 Method: 提出PipeFlow,包含三项创新:基于SSIM和光流的运动分析跳过低运动帧;基于GPU内存的分段并行DDIM反演与编辑调度;使用神经网络插值平滑边界帧并补全跳过帧。 Result: PipeFlow在编辑长视频时实现线性时间增长,相比TokenFlow加速9.6倍,比DMT快31.7倍。 Conclusion: PipeFlow能高效扩展至极长甚至无限长度视频,避免传统方法中逐帧累积的计算开销。 Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

[91] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Xinran Qin,Yuhui Quan,Ruotao Xu,Hui Ji

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的可训练各向异性扩散框架,通过深度Q学习选择扩散操作,实现了对不同图像结构的强适应性,在去噪任务中优于传统扩散方法并可与深度CNN方法相媲美。

Details Motivation: 传统各向异性扩散方法使用显式扩散算子,难以适应复杂图像结构,性能受限。 Method: 将去噪过程建模为一系列由深度Q学习优化顺序的扩散操作,构建基于强化学习的可训练各向异性扩散框架。 Result: 该方法在三种常见噪声上去噪效果优于现有扩散方法,并与代表性深度CNN方法相当。 Conclusion: 所提方法通过学习扩散策略提升了各向异性扩散的自适应能力,是图像去噪的有效新范式。 Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

[92] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

Yizhi Liu,Ruitao Pu,Shilin Xu,Yingke Chen,Quan-Hui Liu,Yuan Sun

Main category: cs.CV

TL;DR: 提出一种新的鲁棒跨模态学习框架NIRNL,通过跨模态边界保持和邻居感知实例精炼有效应对带噪声标签的跨模态检索问题。

Details Motivation: 现有鲁棒跨模态检索方法难以同时兼顾模型性能、标签校准可靠性和数据利用率,且多模态标注数据常含噪声,影响检索性能。 Method: 提出Cross-modal Margin Preserving (CMP) 来增强样本对间的判别性,并设计Neighbor-aware Instance Refining (NIR) 通过跨模态邻域共识识别纯样本、难样本和噪声样本子集,进而为不同子集定制优化策略。 Result: 在三个基准数据集上实验表明,NIRNL在高噪声率下仍表现出卓越的鲁棒性,达到最先进的性能。 Conclusion: NIRNL能有效应对标注噪声问题,在提升模型鲁棒性的同时最大化利用可用数据,显著优于现有方法。 Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[93] Pathology Context Recalibration Network for Ocular Disease Recognition

Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于自动眼部疾病识别的PCRNet模型,结合病理学上下文和专家经验先验,通过新设计的病理重校准模块(PRM)和专家先验引导适配器(EPGA),并引入集成损失(IL)提升性能,在三个数据集上优于现有方法。

Details Motivation: 深度神经网络在眼部疾病识别中忽略临床病理上下文和专家经验先验,影响识别性能与决策可解释性。 Method: 提出PRM模块利用像素级上下文压缩和病理分布集中算子捕捉病理上下文先验;设计EPGA适配器挖掘专家经验先验以增强关键区域表示;构建PCRNet模型,并引入集成损失(IL)优化训练。 Result: 在三个眼部疾病数据集上,PCRNet结合IL显著优于现有的注意力机制和先进损失方法;可视化分析验证了PRM和EPGA对模型决策过程的影响机制。 Conclusion: 融合病理上下文与专家经验先验可有效提升眼部疾病识别性能及模型可解释性,PCRNet为临床辅助诊断提供了高效可靠的解决方案。 Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

[94] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen,Dexin Chen,Fengchao Xiong,Yuntao Qian,Liang Xiao

Main category: cs.CV

TL;DR: 提出一种平衡的分层对比损失和解耦学习策略,以改善细粒度遥感图像检测性能。

Details Motivation: 现有方法在处理分层标签结构时忽视了类别分布不平衡和分类与定位任务干扰的问题。 Method: 引入可学习的类原型和梯度均衡机制的分层对比损失,并在DETR框架中将对象查询解耦为分类和定位两部分。 Result: 在三个具有分层标注的细粒度数据集上实验表明,该方法优于当前最先进的方法。 Conclusion: 所提方法有效缓解了类别不平衡问题并提升了细粒度检测性能。 Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[95] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Aiyue Chen,Yaofu Liu,Junjian Huang,Guang Lian,Yiwu Yao,Wangli Lan,Jing Lin,Zhixin Ma,Tingting Zhou,Harry Yang

Main category: cs.CV

TL;DR: RainFusion2.0提出了一种在线自适应、硬件高效的稀疏注意力机制,通过块均值预测、时空感知排列和首帧锚定机制,在保持生成质量的同时实现1.5~1.8倍的端到端加速,并支持多种硬件平台。

Details Motivation: DiT模型因注意力机制计算成本高而受限,现有稀疏注意力方法存在预测开销大和硬件通用性差的问题。 Method: 采用块级均值作为稀疏掩码预测的代表 token,设计时空感知的 token 置换策略,并引入首帧 sink 机制以优化视频生成。 Result: 实现80%稀疏率,端到端速度提升1.5~1.8倍,且不损失视频质量,适用于多种生成模型和硬件平台。 Conclusion: RainFusion2.0有效降低了DiT模型的计算开销,具备低预测开销、良好硬件通用性和实际部署价值。 Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

[96] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng

Main category: cs.CV

TL;DR: 本文提出了一种新的视频-语言模型框架D²VLM,通过解耦时间定位与文本回答任务并引入证据标记和因子化偏好优化算法FPO,提升了事件级视频理解的性能。

Details Motivation: 现有视频语言模型在时间定位和文本回答任务上通常耦合处理,缺乏清晰的逻辑结构,导致次优的学习目标,难以实现准确的事件级感知。 Method: 提出D²VLM框架,采用“先定位后回答”的范式,引入证据标记以捕捉事件级语义,并设计因子化偏好优化(FPO)算法,将概率性时间定位建模融入优化目标中,实现对两个任务的分离但协同学习。同时构建了一个合成数据集以支持训练。 Result: 实验表明,该方法在多个视频理解任务上显著优于现有方法,尤其在时间定位和基于证据的回答方面表现突出。 Conclusion: 通过因子化解耦学习和FPO优化策略,D²VLM有效增强了视频语言模型的时间推理与语义响应能力,为事件级视频理解提供了新思路。 Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[97] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian,Juncheng Wang,Yuxiang Feng,Chao Xu,Wang Lu,Yang Liu,Baigui Sun,Yiqiang Chen,Yong Liu,Shujun Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到动作生成框架Latent Motion Reasoning (LMR),通过引入双阶段的“思考-行动”机制,解决语言语义与运动学数据之间的语义-运动阻抗不匹配问题。

Details Motivation: 现有方法将文本到动作生成视为直接映射问题,在处理复杂语义时面临语义与高频动作数据之间的根本性不匹配。为此,作者提出需要一种更符合认知机理的分层生成架构。 Method: 受层级运动控制启发,设计了Latent Motion Reasoning (LMR) 框架,包含一个双粒度 tokenizer,将动作分解为用于全局规划的推理潜变量和用于细节还原的执行潜变量,并采用两阶段自回归生成:先进行粗略轨迹规划(Think),再生成具体帧(Act)。 Result: 在T2M-GPT和MotionStreamer两个基线上实现改进,实验表明LMR在语义对齐和物理合理性方面均有显著提升。 Conclusion: 动作生成的最优规划空间不是自然语言本身,而应是一个学习得到的、与动作对齐的中间概念空间,验证了系统2推理在生成模型中的有效性。 Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

[98] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen,Yanbo Wang,Wentao Zhao,Guole Shen,Tianchen Deng,Jingchuan Wang

Main category: cs.CV

TL;DR: 提出一种无需训练的生成式对抗攻击框架,利用扩散模型生成自然且与场景一致的对抗性物体,以更有效、隐蔽和可部署的方式攻击单目深度估计系统。

Details Motivation: 现有基于纹理补丁的物理攻击在自动驾驶场景中存在放置限制和真实性不足的问题,导致攻击效果受限,因此需要更自然、更具实用性的攻击方法来评估系统安全性。 Method: 设计了一个无需训练的对抗攻击框架,包含显著区域选择模块和雅可比向量积引导机制,通过基于扩散的条件生成过程合成物理上合理的对抗性物体。 Result: 在数字和物理实验中,该方法在攻击有效性、隐蔽性和物理可部署性方面显著优于现有方法。 Conclusion: 所提方法能生成高度逼真的对抗性物体,有效干扰单目深度估计,对自动驾驶系统的安全评估具有重要意义。 Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

[99] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng,Yue Yang,Xiaohan He,Jiatong Zhao,Jianlong Chen,Zijun Chen,Daocheng Fu,Qi Liu,Renqiu Xia,Bo Zhang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了GeoBench,一个用于评估视觉-语言模型在几何问题解决中推理能力的分层基准,揭示了当前模型在复杂任务中的性能瓶颈及关键影响因素。

Details Motivation: 现有几何推理评测存在数据污染、过度关注答案而忽视推理过程、诊断粒度不足等问题,亟需更系统、可靠的评估框架。 Method: 提出GeoBench,包含四个推理层级(视觉感知、目标导向规划、严格定理应用、自反式回溯),并通过TrustGeoGen生成六个经过形式化验证的任务进行系统评估。 Result: 实验表明,尽管推理模型(如OpenAI-o3)优于通用MLLM,但随着任务复杂度增加性能显著下降;子目标分解和无关前提过滤对准确性至关重要,而思维链提示在某些任务中反而降低性能。 Conclusion: GeoBench是一个全面且可操作的几何推理评测基准,为构建具备深度几何推理能力的系统提供了明确指导。 Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[100] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 本文提出并验证了两种用于计算机视觉中基于大语言模型的神经网络架构生成的关键技术:少样本架构提示(FSAP)和空白归一化哈希验证,提升了生成效率与多样性,并在多个视觉基准上进行了大规模实验验证。

Details Motivation: 神经架构搜索(NAS)计算成本高,而大语言模型(LLM)提供了一种新可能,但其在计算机视觉中的应用缺乏系统研究,尤其是在提示工程和去重验证方面。 Method: 基于NNGPT/LEMUR框架,提出Few-Shot Architecture Prompting(FSAP),系统研究不同示例数量对生成效果的影响;引入Whitespace-Normalized Hash Validation进行快速去重,替代耗时的AST解析。 Result: 实验发现n=3个示例在多样性和聚焦性之间达到最佳平衡;所提哈希方法比AST解析快100倍(<1ms),有效避免重复训练;在7个视觉基准上生成1900个独特架构,并提出数据集均衡评估方法以支持跨任务比较。 Conclusion: 本工作为LLM驱动的计算机视觉架构设计提供了实用指南和严谨评估范式,显著降低计算资源需求,使更多研究者能参与自动化模型设计。 Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

[101] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen,Sujie Hu,Jiashu Zhu,Meiqi Wu,Jintao Chen,Yanxun Li,Nisha Huang,Chengyu Fang,Jiahong Wu,Xiangxiang Chu,Xiu Li

Main category: cs.CV

TL;DR: 本文提出了一种新的对齐方法D²-Align,以缓解文本到图像扩散模型中的偏好模式崩溃(PMC)问题,通过在奖励信号中引入方向性解耦来保持生成多样性。

Details Motivation: 现有基于人类反馈的强化学习方法虽然在自动奖励指标上表现良好,但容易导致偏好模式崩溃(PMC),即生成结果趋同、多样性严重下降。需要一种能同时保证质量和多样性的对齐方法。 Method: 提出Directional Decoupling Alignment (D²-Align):首先在冻结的奖励模型嵌入空间中学习一个方向性校正,然后在优化过程中将该校正应用于奖励信号,从而避免模型陷入特定模式。同时构建了DivGenBench基准来量化PMC现象。 Result: 实验表明,D²-Align在保持高图像质量的同时显著提升了生成多样性,在自动化指标和人类偏好评估中均优于现有方法。 Conclusion: D²-Align有效缓解了偏好模式崩溃问题,实现了更稳定且多样化的文本到图像模型对齐,为未来对齐技术提供了新方向。 Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[102] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni,ZhenQi Chen,YuanFu Yang

Main category: cs.CV

TL;DR: IMDD-1M是首个大规模工业多模态缺陷数据集,包含100万个图文对,涵盖60多种材料和400多种缺陷类型,配合细粒度文本描述,支持分类、分割、检索等任务;基于该数据集训练的扩散型视觉-语言基础模型可通过少量数据微调实现高效工业检测,展现出强大的可扩展性和领域适应性。

Details Motivation: 现有工业缺陷检测数据集缺乏细粒度语义信息和多模态对齐,难以支持先进的多模态学习方法在制造业中的应用,因此需要构建一个大规模、高质量、图文对齐的工业缺陷数据集以推动智能制造发展。 Method: 构建了包含100万个高分辨率真实缺陷图像与专家标注文本的大规模多模态数据集IMDD-1M,并从零开始训练一个基于扩散机制的视觉-语言基础模型,通过轻量级微调实现对特定工业场景的快速适应。 Result: 所提出的模型在仅使用不到5%的任务特定数据时即可达到与专用专家模型相当的性能,验证了基于IMDD-1M的基础模型在工业缺陷理解与生成任务中的高效性和泛化能力。 Conclusion: IMDD-1M为工业质检领域的多模态学习提供了重要基础,所提出的基础模型范式显著降低了数据依赖,展示了可扩展、领域自适应的制造智能前景。 Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[103] Bayesian Self-Distillation for Image Classification

Anton Adelöw,Matteo Gamba,Atsuto Maki

Main category: cs.CV

TL;DR: 提出了一种基于贝叶斯推理的自蒸馏方法BSD,不依赖硬目标,提升了模型准确性、校准性和鲁棒性。

Details Motivation: 现有自蒸馏方法仍依赖硬目标,导致模型过置信,限制了校准性、泛化性和鲁棒性。 Method: 通过贝叶斯推断利用模型自身预测构建样本特定的目标分布,实现无需硬目标的自蒸馏。 Result: 在多种架构和数据集上,BSD显著提高了测试准确率(如ResNet-50在CIFAR-100上+1.4%),降低ECE达40%,并增强对数据损坏、扰动和标签噪声的鲁棒性。结合对比损失时,在单阶段单网络方法中达到最优抗标签噪声性能。 Conclusion: BSD是一种有效的无硬目标自蒸馏方法,全面提升了模型性能与可靠性。 Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

[104] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He,Xiaoye Qu,Yafu Li,Tong Zhu,Siyuan Huang,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的生成式多模态推理范式DiffThinker,将多模态推理重构为图像到图像的生成任务,在视觉中心型复杂任务中显著优于现有MLLMs。

Details Motivation: 现有的多模态大模型推理过程以文本为中心,导致在复杂的、长视野的视觉任务中表现不佳。 Method: 提出DiffThinker,基于扩散模型的生成式图像到图像推理框架,将多模态推理转化为原生的视觉生成任务。 Result: 在四个领域(序列规划、组合优化、约束满足、空间配置)的实验表明,DiffThinker显著优于GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)和微调后的Qwen3-VL-32B基线(+39.0%)。 Conclusion: 生成式多模态推理是一种有前景的、面向视觉中心任务的新型推理范式,具备高效性、可控性、原生并行性和协作性四大核心特性。 Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[105] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Yu-Tang Chang,Pin-Wei Chen,Shih-Fang Chen

Main category: cs.CV

TL;DR: 提出了一种名为Deep Global Clustering (DGC) 的框架,用于内存高效的高光谱图像分割,无需预训练即可学习全局聚类结构,但在多目标损失平衡方面存在优化不稳定性。

Details Motivation: 现有的基础模型在特定领域应用(如近程农业监测)中迁移效果差,且高光谱数据量大导致计算和内存瓶颈,需要一种无需预训练、内存效率高的分割方法。 Method: DGC通过处理带有重叠区域的小块局部图像来学习全局聚类结构,利用重叠区域保证一致性,实现低内存消耗和快速训练。 Result: 在叶片病害数据集上实现了高质量的背景-组织分离(平均IoU为0.925),并展示了无监督疾病检测能力;但存在特征空间中簇过度合并导致的优化不稳定问题。 Conclusion: DGC框架在概念上具有潜力,能够在资源受限设备上快速训练并生成有意义的语义表示,但其实用性依赖于对动态损失平衡的进一步研究。 Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.

[106] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou,Qifan Li,Xiaobin Hu,Hai Chen,Shuhang Gu

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的内部引导(Internal Guidance, IG)策略,通过在训练过程中对中间层引入辅助监督,并在采样时外推中间和深层输出,显著提升了扩散模型的训练效率和生成质量。

Details Motivation: 现有扩散模型在低概率区域生成质量较差,分类器自由引导(CFG)易导致样本过度简化或失真,而基于“坏版本”引导的方法依赖复杂设计、额外训练和采样步骤,限制了其应用。 Method: 提出内部引导(IG)策略,在训练阶段对中间层添加辅助监督,在采样阶段外推中间和深层网络输出以生成结果。 Result: IG在多个基线上显著提升性能:ImageNet 256x256上,SiT-XL/2+IG在80和800 epoch分别达到FID=5.31和FID=1.75;LightningDiT-XL/1+IG达到FID=1.34,大幅优于现有方法;结合CFG后进一步达到SOTA的FID=1.19。 Conclusion: IG是一种无需额外网络或复杂设计即可提升扩散模型生成质量的有效方法,兼具高效训练与优异性能,为扩散模型引导机制提供了新思路。 Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

[107] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

Pieter M. Blok,Haozhou Wang,Hyun Kwon Suh,Peicheng Wang,James Burridge,Wei Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为PointRAFT的高通量点云回归网络,用于从部分点云直接预测马铃薯重量,避免了因自遮挡导致的重量低估问题。

Details Motivation: 由于RGB-D相机采集的点云存在自遮挡,导致马铃薯重量被系统性低估,因此需要一种能直接从不完整点云准确估计重量的方法。 Method: 提出PointRAFT网络,引入物体高度嵌入作为几何线索,直接从原始3D点云数据回归预测重量,无需重建完整三维形状。 Result: 在包含26,688个点云的数据集上训练和测试,测试集上平均绝对误差为12.0克,均方根误差为17.2克,推理速度达每秒150个块茎,显著优于基线模型。 Conclusion: PointRAFT能高效准确地估计马铃薯重量,满足商业收割机的高通量需求,并可推广至其他3D表型与机器人感知任务。 Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.

[108] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son,Suhyeok Kim,Seungryong Kim,Young Geun Kim

Main category: cs.CV

TL;DR: 提出CorGi和CorGi+方法,通过贡献度引导的块级缓存策略,在不损失生成质量的前提下显著加速DiT模型推理。

Details Motivation: DiT模型在图像生成中表现优异,但其迭代去噪过程计算量大,存在跨步骤的冗余计算问题。 Method: 提出CorGi,一种无需训练的推理加速框架,通过评估各Transformer块的贡献度,缓存低贡献块并在后续步骤中复用;针对文本到图像任务进一步提出CorGi+,利用交叉注意力图识别显著token并进行部分注意力更新。 Result: 在最先进的DiT模型上验证,CorGi和CorGi+平均可达2.0倍加速,同时保持高质量生成效果。 Conclusion: CorGi系列方法有效减少了DiT推理中的冗余计算,为扩散Transformer提供了高效、实用的加速方案。 Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[109] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Sina Jahromi,Farshid Hajati,Alireza Rezaee,Javaher Nourian

Main category: cs.CV

TL;DR: 本文提出了一种基于渐进式生成对抗网络和多目标优化的深度学习模型,用于解决医学图像分类中数据不平衡的问题,特别是在COVID-19检测中的应用。

Details Motivation: 医学图像数据集中类别不平衡问题严重影响了AI模型的性能,尤其是在疫情期间,数据不平衡更加突出,因此需要有效方法来提升分类准确性。 Method: 提出一种渐进式生成对抗网络(Progressive GAN)生成合成数据,并采用加权方式融合真实与合成数据;使用基于种群的多目标元启发式优化算法优化深度分类器的超参数。 Result: 在大型不平衡胸部X光图像数据集上,该模型在4类和2类分类任务中分别达到95.5%和98.5%的准确率,交叉验证指标优于现有方法。 Conclusion: 所提方法能有效缓解医学图像分类中的数据不平衡问题,显著提升分类性能,适用于疫情等数据稀缺场景。 Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

[110] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu,Zhewei Zhu,Xuyang Shi

Main category: cs.CV

TL;DR: 提出了一种轻量级可学习的注意力精炼模块(ARM),用于提升CLIP在开放词汇语义分割中的表现,具备“训练一次,处处可用”的优势。

Details Motivation: 现有无训练方法依赖昂贵的外部模型或静态启发式方法,难以有效利用CLIP内部特征进行像素级分割。 Method: 设计了一个可学习的注意力精炼模块(ARM),通过语义引导的交叉注意力和自注意力机制,自适应融合CLIP的多层次特征。 Result: ARM在多个基准上显著提升基线性能,推理开销极小,且可在不同无训练框架中即插即用。 Conclusion: ARM提供了一种高效、通用的无训练开放词汇语义分割新范式。 Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[111] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Shuyun Wang,Haiyang Sun,Bing Wang,Hangjun Ye,Xin Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为Mirage的一步视频扩散模型,用于自动驾驶场景中的高质量、连贯性资产编辑,通过引入2D编码器特征和两阶段对齐策略,实现了高真实感和时间一致性,并可泛化到其他视频到视频转换任务。

Details Motivation: 现有的视频对象编辑方法在保持视觉保真度和时间连贯性方面存在不足,限制了其在视觉中心的自动驾驶系统中的应用。 Method: 基于文本到视频扩散先验构建Mirage模型,采用预训练的2D编码器向3D解码器注入时间无关潜在变量以恢复细节并保持因果结构;提出两阶段数据对齐策略,结合粗略3D对齐与精细2D优化,缓解高斯分布不匹配导致的姿态错位问题。 Result: 实验表明,Mirage在多种编辑场景下均实现了高真实感和时间一致性,同时在视频到视频转换任务中展现出良好的泛化能力。 Conclusion: Mirage有效解决了现有方法在空间保真度和时间一致性方面的局限,为自动驾驶领域的数据增强提供了可靠的新方案,并有望成为未来研究的基准模型。 Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

[112] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Rahul Medicharla,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了MotivNet,一种基于Meta-Sapiens骨干网络的通用面部表情识别模型,无需跨域训练即可在多种数据集上实现竞争性性能,具备良好的现实世界泛化能力,并被验证为Sapiens模型的有效下游任务。

Details Motivation: 现有的面部表情识别(FER)模型在多样化数据上泛化能力弱,限制了其在真实场景中的应用。尽管已有研究提出复杂架构,但仍依赖跨域训练,与实际应用需求相矛盾。 Method: 提出MotivNet模型,采用Meta-Sapiens作为骨干网络,利用其通过大规模掩码自编码器预训练获得的强泛化能力。将FER定义为Sapiens的下游任务,并从基准性能、模型相似性和数据相似性三个标准评估其可行性。 Result: 实验表明,MotivNet在无需跨域训练的情况下,在多个数据集上达到与当前最先进模型相当的性能,展现出强大的跨域泛化能力,并满足作为Sapiens下游任务的三项评估标准。 Conclusion: MotivNet是一种具有强泛化能力的通用FER模型,验证了其作为Sapiens下游任务的可行性,推动了面部表情识别在真实场景中的应用潜力。 Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.

[113] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu,Yuanke Li,Xianlei Long,Kangping Ji,Chao Chen,Qingyi Gu,Zhenliang Ni

Main category: cs.CV

TL;DR: 本文提出了MambaSeg,一种基于并行Mamba编码器的双分支语义分割框架,通过融合RGB图像和事件流数据,在空间和时间维度上实现细粒度的跨模态融合,显著提升了在复杂环境下的语义分割性能,并降低了计算成本。

Details Motivation: 传统基于RGB的语义分割方法在快速运动、低光或高动态范围场景下性能下降,而事件相机虽具有高时间分辨率和低延迟优势,但缺乏颜色和纹理信息。现有融合方法多关注空间融合且计算开销大,忽视了事件流的时间动态特性。因此,需要一种高效且能充分利用两种模态互补特性的多模态分割方法。 Method: 提出MambaSeg框架,采用双分支结构分别处理RGB图像和事件流,利用Mamba模型捕捉长距离依赖;设计双维交互模块(DDIM),包含跨空间交互模块(CSIM)和跨时间交互模块(CTIM),在空间和时间维度进行细粒度融合,减少跨模态歧义,增强对齐。 Result: 在DDD17和DSEC数据集上进行了大量实验,结果表明MambaSeg在语义分割性能上达到最先进水平,同时显著降低了计算成本。 Conclusion: MambaSeg通过有效融合RGB与事件数据,在空间和时间维度上实现了高效的跨模态对齐,为多模态感知提供了一种高性能、低计算开销的解决方案,适用于自动驾驶和机器人等实时应用。 Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[114] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Zhi Li,Yaqi Wang,Bingtao Ma,Yifan Zhang,Huiyu Zhou,Shuai Wang

Main category: cs.CV

TL;DR: 提出一种基于物理引导流形投影(PGMP)的金属伪影去除框架,通过高保真模拟和确定性恢复实现快速、可靠的牙科CBCT伪影校正。

Details Motivation: 现有深度学习方法在牙科CBCT金属伪影去除中存在回归模糊或结构幻觉问题,且扩散模型因迭代采样缓慢难以临床应用。 Method: 设计Anatomically-Adaptive Physics Simulation (AAPS)生成高质量训练数据;采用DMP-Former实现单步确定性图像恢复;结合Semantic-Structural Alignment (SSA)模块利用医学基础模型保证解剖合理性。 Result: 在合成与多中心临床数据上均优于现有最先进方法,具备更高效率与诊断可靠性。 Conclusion: PGMP为临床可用的高效、真实且可靠的金属伪影去除提供了新范式。 Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP

[115] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang,Hao Wen,Aiming Hao,Bingze Song,Meiqi Wu,Jiahong Wu,Xiangxiang Chu,Sheng Lu,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为DualityForge的反事实数据合成框架,利用扩散模型进行可控视频编辑,生成高质量的原-编辑视频对及相应的问答对,构建大规模视频数据集DualityVidQA,并结合双阶段训练方法DNA-Train,有效减少多模态大语言模型在反事实视频中的幻觉问题。

Details Motivation: 多模态大语言模型(MLLMs)在视频理解中存在对语言先验过度依赖的问题,导致在违背常识的反事实视频中产生视觉未接地的幻觉,且由于反事实数据标注成本高,难以通过传统方式解决。 Method: 提出DualityForge框架,基于扩散模型进行可控视频编辑,将真实视频转化为反事实场景,并结合结构化上下文信息自动生成原-编辑视频对及对应的问答对;构建DualityVidQA数据集,并设计DNA-Train训练方法,在强化学习阶段采用成对ℓ₁优势归一化实现稳定高效的策略优化。 Result: 在DualityVidQA-Test上实验表明,相比Qwen2.5-VL-7B基线,模型在反事实视频中的幻觉问题相对减少了24.0%,同时在通用基准上也取得显著提升,显示出良好的泛化能力。 Conclusion: DualityForge与DNA-Train相结合能有效缓解MLLM在反事实视频中的幻觉问题,通过高质量对比数据的构建与高效训练策略,提升了模型的鲁棒性与泛化性能,未来将开源数据与代码。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[116] LiftProj: Space Lifting and Projection-Based Panorama Stitching

Yuan Jia,Ruimin Wu,Rui Song,Jiaojiao Li,Bin Song

Main category: cs.CV

TL;DR: 提出一种基于三维空间提升的全景图像拼接框架,通过将图像转换为统一坐标系下的三维点云表示,实现多视角融合与几何一致性360°全景生成,有效缓解了传统方法在显著视差和复杂遮挡下的扭曲与重影问题。

Details Motivation: 传统基于二维单应性变换的图像拼接方法在处理具有多层深度和遮挡的三维场景时,易产生重影、结构弯曲和拉伸失真,尤其在多视角累积和360°闭环拼接中问题突出,需更鲁棒的几何一致解决方案。 Method: 将输入图像提升至统一三维坐标系中的密集点云表示,结合置信度进行全局跨视角融合;在三维空间构建统一投影中心,采用等距圆柱投影将融合数据映射到单一全景流形,并在画布域内进行空洞填充以恢复纹理连续性。 Result: 实验表明,该方法在存在显著视差和复杂遮挡的场景下,显著减少了几何失真和重影伪影,生成更自然、几何更一致的360°全景图。 Conclusion: 该框架将图像拼接从二维变换范式转向三维一致性范式,具备灵活集成各类三维提升与补全模块的能力,提升了复杂真实场景下全景拼接的鲁棒性与视觉质量。 Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

[117] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Zhihua Wang,Fei Wu,Quanlin Li,Pinghong Zhou,Shuo Wang,Xian Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为EndoRare的一次性生成框架,能够从单个参考图像合成高保真、多样化的罕见胃肠病变图像,用于增强AI模型性能和培训新手临床医生。

Details Motivation: 罕见胃肠道病变在常规内窥镜检查中很少见,导致可用于开发可靠人工智能(AI)模型和培训新手临床医生的数据有限。因此需要一种高效的方法来弥补这一数据鸿沟。 Method: 提出EndoRare框架,利用语言引导的概念解耦技术,将病理性特征与非诊断性属性分离,并将前者编码为可学习的原型嵌入,同时变化后者以保证多样性;无需重新训练即可生成新样本。 Result: 在四种罕见病理上验证了该框架的有效性,专家认为合成图像具有临床合理性;用于数据增强后显著提升了下游AI分类器性能,在低假阳性率下提高了真阳性率;盲读研究显示新手内窥镜医师的召回率提升0.400,精确度提升0.267。 Conclusion: EndoRare为解决罕见疾病在计算机辅助诊断和临床教育中的数据稀缺问题提供了实用且高效的方法。 Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

[118] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq,Linda Larson-Prior,Fred Prior

Main category: cs.CV

TL;DR: 本文提出并验证了一种名为Virtual-Eyes的16位CT质量控制预处理流程,用于低剂量CT肺癌筛查,发现其可提升通用基础模型(如RAD-DINO)的性能与校准效果,但可能损害专用模型(如Sybil、ResNet-18)的表现,揭示了预处理对不同类型模型的差异化影响。

Details Motivation: 在低剂量CT肺癌筛查的深度学习流程中,稳健的预处理很少被量化评估。作者旨在开发一种临床驱动的标准化预处理方法,并系统分析其对通用基础模型与专用模型性能的影响差异。 Method: 提出Virtual-Eyes预处理流程,强制512x512平面分辨率,剔除非诊断性序列,并通过Hounsfield单位滤波和双侧肺覆盖评分提取连续肺区块,同时保留原始16位数据精度;使用NLST数据集中的765例患者,冻结RAD-DINO、Merlin等模型编码器提取切片级嵌入,训练无泄漏的MLP分类头,并比较Raw与Virtual-Eyes输入下Sybil和ResNet-18的表现变化。 Result: Virtual-Eyes使RAD-DINO的切片级AUC从0.576提升至0.610,患者级AUC在均值池化下从0.646升至0.683,在最大池化下从0.619显著提升至0.735,且校准性能改善(Brier分数从0.188降至0.112);而Sybil和ResNet-18在Virtual-Eyes输入下性能下降(Sybil AUC从0.886降至0.837),Merlin迁移效果有限(AUC约0.507–0.567)。 Conclusion: 解剖结构导向的质量控制可稳定并提升通用基础模型在低剂量CT筛查中的表现,但可能干扰依赖原始临床上下文的专用模型,提示模型设计与预处理策略需协同考虑。 Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

[119] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang,Zimo He,Wanhe Yu,Lexi Pang,Yunhao Li,Hongjie Li,Jieming Cui,Yuhan Li,Yizhou Wang,Yixin Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: UniAct是一种两阶段框架,通过细调的MLLM和因果流管道,实现人形机器人对多模态指令的低延迟、高成功率执行。

Details Motivation: 现有方法难以将语言、音乐、轨迹等异构指令转化为稳定、实时的动作,限制了人形机器人在多样化真实场景中的灵活响应能力。 Method: 提出UniAct框架:首先使用细调的多模态大语言模型(MLLM)解析多模态输入,再通过因果流式管道生成动作;利用FSQ将不同模态统一映射到共享离散码本,确保跨模态对齐,并将运动约束在物理合理的流形上。 Result: 实现低于500毫秒的端到端延迟,在零样本跟踪不完美参考动作任务中成功率提升19%;在自建20小时人形运动基准UniMoCap上验证了良好的泛化能力。 Conclusion: UniAct通过统一感知与控制,显著提升了人形机器人对多模态指令的响应速度与执行性能,推动了通用人形助手的发展。 Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

[120] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu,Zhiyuan Song,Hefeng Wu,Tao Pu,Keze Wang,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了CERES,一种基于因果推理的插件式框架,用于解决第一人称视频中指代表达视频对象分割(Ego-RVOS)中的数据偏差和视觉混淆问题,通过双模态因果干预实现最先进性能。

Details Motivation: 现有方法在Ego-RVOS任务中容易受到数据集中对象-动作配对偏差和第一人称视角固有视觉混淆(如快速运动、频繁遮挡)的影响,导致泛化能力差。 Method: 提出CERES框架,结合后门调整消除语言表示中的数据偏差,并利用前门调整融合语义特征与几何深度信息,以因果原则缓解视觉混淆。该框架可适配强预训练RVOS骨干网络。 Result: 在多个Ego-RVOS基准上取得当前最优性能,验证了因果干预在提升模型鲁棒性方面的有效性。 Conclusion: 将因果推理引入Ego-RVOS任务能够有效缓解数据偏倚和视觉干扰,为构建更可靠的第一人称视频理解模型提供了新思路。 Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[121] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng,Tao Hu,Wenwen Tong,Xueheng Li,Jiandong Chen,Haojia Yu,Jiefan Lu,Hewei Guo,Hanming Deng,Chengjun Xie,Gao Huang,Dahua Lin,Lewei Lu

Main category: cs.CV

TL;DR: 本文提出了SenseNova-MARS,一种通过强化学习实现多模态智能体推理与搜索的框架,能够动态结合图像/文本搜索和图像裁剪工具,提升视觉语言模型在复杂、知识密集场景下的推理能力,并发布新的高分辨率基准HR-MMSearch。

Details Motivation: 现有视觉语言模型在复杂视觉任务中缺乏像人类一样连续推理与动态使用工具(如搜索、裁剪)的能力,尤其在知识密集和高分辨率图像场景下表现不足。 Method: 提出SenseNova-MARS框架,结合图像搜索、文本搜索和图像裁剪工具,采用批归一化分组序列策略优化(BN-GSPO)算法进行强化学习训练,实现推理与工具使用的交错执行。 Result: 在MMSearch上得分为67.84,在新提出的HR-MMSearch上得分为41.64,超越Gemini-3-Flash和GPT-5等闭源模型,达到开源模型中的最先进水平。 Conclusion: SenseNova-MARS通过强化学习实现了视觉语言模型在复杂任务中稳定且高效的工具调用与连续推理,推动了具身智能体VLM的发展,且代码、模型与数据将全部开源。 Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[122] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei,Zhipeng Luo,Ling Feng,Venice Erin Liong

Main category: cs.CV

TL;DR: 提出LVLDrive框架,通过引入LiDAR点云增强视觉语言模型的3D空间理解能力,提升自动驾驶中的安全性和可靠性。

Details Motivation: 现有基于图像的视觉语言模型在自动驾驶中缺乏准确的度量空间推理和几何推断能力,影响驾驶策略的可靠性。 Method: 提出LVLDrive框架,融合LiDAR点云作为额外输入模态,并设计渐进式融合Q-Former以稳定地注入LiDAR特征;构建空间感知问答(SA-QA)数据集来训练模型的3D感知与推理能力。 Result: 在多个自动驾驶基准上实验表明,LVLDrive在场景理解、度量空间感知和驾驶决策方面均优于纯视觉方法。 Conclusion: 显式的3D度量数据对构建可信的基于视觉语言模型的自动驾驶系统至关重要。 Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[123] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altrac,Matthew Toews

Main category: cs.CV

TL;DR: 提出了一种基于特殊相对论和量子力学启发的卷积滤波与整流机制的信息力学模型,通过偶-奇核分解和DCT频域分析,揭示了CNN中信息处理与能量-动量关系之间的联系。

Details Motivation: 受现代物理理论启发,试图从基础物理类比的角度理解卷积神经网络中信息传播的机械机制。 Method: 将卷积核分解为偶部和奇部,在离散余弦变换(DCT)谱域中分析其对信息扩散与位移的影响,并类比于势能与动能、动量与速度的物理概念。 Result: 发现偶核导致各向同性扩散并保持质心,奇核引起质心定向移动,信息传播速度与奇核能量占比线性相关;小尺寸卷积核(如3×3)的信息传播模式主要由低频基(DC和梯度分量)决定。 Conclusion: 建立了CNN中信息传播与相对论性能量-动量关系之间的类比,首次揭示了深度学习中信息处理与基础物理原理之间的深刻联系。 Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

[124] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy

Main category: cs.CV

TL;DR: 本文提出了DermaVQA-DAS,一个支持封闭式问答和皮肤病损分割的扩展数据集,并引入了由专家设计的Dermatology Assessment Schema(DAS)框架,以推动以患者为中心的皮肤科视觉-语言建模研究。

Details Motivation: 现有皮肤病图像数据集多关注皮肤镜图像,缺乏患者自主查询和临床背景,限制了其在以患者为中心的护理中的应用。为此,本文提出DermaVQA-DAS来填补这一空白。 Method: 提出Dermatology Assessment Schema(DAS),包含36个高层级和27个细粒度评估问题,用于结构化标注;基于DAS构建专家标注的数据集,支持闭合式问答与病灶分割任务,并评估多种多模态模型及提示策略性能。 Result: 在分割任务中,不同提示策略影响模型表现,其中结合患者查询标题与内容的增强提示在多数投票微分评价下表现最佳(Jaccard指数0.395,Dice分数0.566);在闭合式问答中,o3模型准确率最高(0.798),GPT-4.1紧随其后(0.796),Gemini-1.5-Pro在Gemini系列中表现突出(0.783)。 Conclusion: DermaVQA-DAS和DAS框架为以患者为中心的皮肤病分析提供了标准化、结构化的评估手段,促进了多模态模型在临床相关场景中的发展,且已公开发布以支持后续研究。 Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

[125] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang,Lingdong Kong,Xiaolu Liu,Hao Shi,Wentong Li,Jianke Zhu,Steven C. H. Hoi

Main category: cs.CV

TL;DR: 本文提出了一种面向多模态传感器数据的统一预训练框架,旨在实现自动驾驶系统中的空间智能,涵盖从单模态基础到融合文本与占据表示的统一模型,并提出了通向通用多模态基础模型的发展路径。

Details Motivation: 现有基础模型在单一模态表现优异,但难以有效融合摄像头、LiDAR等多模态传感器数据以实现统一的空间理解,限制了自主系统在复杂环境中的感知与决策能力。 Method: 提出一个统一的多模态预训练框架与分类体系,分析传感器特性与学习策略的交互作用,利用平台特定数据集评估不同范式,并探索融合文本输入与占据表示以支持开放世界感知与规划。 Result: 建立了涵盖单模态到统一多模态框架的预训练范式分类体系,验证了其在3D目标检测和语义占据预测等任务上的有效性,并揭示了计算效率与模型可扩展性等关键瓶颈。 Conclusion: 通过系统化的多模态预训练方法有望实现鲁棒的空间智能,推动通用多模态基础模型在真实场景中的部署应用。 Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[126] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

Gur-Eyal Sela,Kumar Krishna Agrawal,Bharathan Balaji,Joseph Gonzalez,Ion Stoica

Main category: cs.CV

TL;DR: 本文提出了一种名为RedunCut的动态模型大小选择(DMSS)系统,通过测量驱动的规划器和轻量级性能模型,有效降低视频分析中的计算成本。

Details Motivation: 现有的DMSS方法在采样效率和准确率预测方面存在不足,尤其在移动视频和低精度需求场景下泛化能力差,且运行时缺乏真实标签导致准确性预测困难。 Method: RedunCut采用测量驱动的规划器评估采样的成本-效益权衡,并利用轻量级、数据驱动的性能模型提升每段视频的准确率预测精度,从而优化模型大小的选择。 Result: 在道路车辆、无人机和监控视频等多种数据集和模型任务上,RedunCut在保持固定准确率的前提下,将计算成本降低了14%-62%,并对历史数据少和数据漂移具有鲁棒性。 Conclusion: RedunCut通过更高效的采样策略和更精确的准确性预测,显著降低了大规模视频分析的推理成本,同时具备良好的通用性和稳定性。 Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

[127] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen,Haiyang Liu

Main category: cs.CV

TL;DR: 本文提出DyStream,一种基于流匹配的自回归模型,用于实现低延迟、高真实感的双人对话头像视频生成,支持实时唇形同步和自然非言语反馈。

Details Motivation: 现有基于块的非因果方法存在高延迟问题,无法满足真实对话中即时非言语反馈的需求,限制了双人互动场景下的应用。 Method: 采用流匹配的自回归框架,并设计因果编码器结合前瞻模块(如60ms未来上下文),在保持低延迟的同时提升生成质量。 Result: 每帧生成仅需34ms,系统总延迟低于100ms,在HDTF数据集上离线和在线唇同步置信度分别达到8.13和7.61,优于现有因果方法。 Conclusion: DyStream实现了高质量、超低延迟的双人对话视频生成,为实时交互式虚拟对话系统提供了有效解决方案。 Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[128] AI-Driven Evaluation of Surgical Skill via Action Recognition

Yan Meng,Daniel A. Donoho,Marcelle Altshuler,Omar Arnaout

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的微血管吻合术操作评估框架,结合改进的TimeSformer模型与YOLO目标检测,实现对手术视频中操作行为的自动识别与技能评估。

Details Motivation: 传统外科技能评估依赖专家人工观察,存在主观性强、耗时长、资源需求大等问题,尤其在资源有限地区难以推广,亟需一种客观、可扩展的自动化评估方法。 Method: 提出一种融合分层时间注意力和加权空间注意力的TimeSformer架构用于动作识别,并结合基于YOLO的对象检测与追踪方法提取精细运动特征,从五个维度评估微血管吻合技能。 Result: 在58段专家标注视频数据集上验证,动作分割帧级准确率达87.7%(后处理后提升至93.62%),各项技能评估平均分类准确率为76%。 Conclusion: 该系统能提供客观、一致且可解释的反馈,具有推动手术教育标准化和数据驱动培训的潜力。 Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

[129] Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas,Pranav K Nayak,Divya Mehul Rajparia,Deekshith Patel,Yashmitha Gogineni,Konda Reddy Mopuri,Sumohana S. Channappayya

Main category: cs.CV

TL;DR: 本文提出了一种新框架,利用离散小波变换(DWT)作为视觉任务中的输入依赖基元,来研究Vision Transformer(ViT)编码器表示空间中的组合性。实验结果表明,基于一级DWT分解的基元在潜在空间中近似满足组合性,揭示了ViT组织信息的新机制。

Details Motivation: 尽管对Transformer模型的理解多来自语言任务分析,但其在视觉任务中的表示学习机制尚不清晰,尤其是组合性在ViT中的体现缺乏系统研究。本文旨在探究ViT编码器是否在其表示空间中体现出组合性结构。 Method: 引入一种类比于表示学习中组合性度量的框架,使用离散小波变换(DWT)提取图像中的输入依赖基元,并通过检验由这些基元重构出的表示能否还原原始图像表示,来量化ViT中组合性的存在程度。 Result: 实验证明,一级DWT分解得到的基元在ViT的编码器表示空间中能够近似组合,即组合后的表示可有效还原原始图像的表示,显示出ViT潜在空间具有一定组合性。 Conclusion: ViT的表示空间在一定程度上支持组合性结构,DWT提供了一种有效的基元提取方式,该发现为理解ViT如何组织和处理视觉信息提供了新的视角。 Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

[130] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

Prasiddha Siwakoti,Atefeh Khoshkhahtinat,Piyush M. Mehta,Barbara J. Thompson,Michael S. F. Kirk,Daniel da Silva

Main category: cs.CV

TL;DR: 提出了一种针对太阳观测的多光谱图像压缩框架,结合图嵌入与注意力机制,在保持高保真度的同时显著减少数据量。

Details Motivation: 在带宽受限的空间任务中,如何平衡多光谱太阳图像压缩的效率与精细光谱、空间细节的保留是一个挑战。 Method: 提出iSWGE模块建模波段间关系,将光谱通道表示为具有学习边特征的图节点;并设计WSGA-C模块,结合稀疏图注意力与卷积注意力以降低空间冗余并突出细小结构。 Result: 在SDOML数据集上实验表明,相比现有强学习基线,该方法MSID降低20.15%,PSNR最高提升1.09%,log MS-SSIM提高1.62%,重建图像更清晰且光谱保真度更高。 Conclusion: 所提方法在相同比特率下实现了更优的压缩性能,适用于需高保真传输的太阳观测任务。 Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .

[131] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Devendra K. Jangid,Ripon K. Saha,Dilshan Godaliyadda,Jing Li,Seok-Jun Lee,Hamid R. Sheikh

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv2低级特征条件的图像超分辨率方法F2IDiff,以减少生成过程中的幻觉问题,特别适用于高保真度的手机摄影场景。

Details Motivation: 现有的文本到图像扩散模型在单图超分辨率中容易产生不希望的幻觉,且文本特征难以描述细节纹理,尤其在高分辨率手机图像上表现不佳。 Method: 引入基于DINOv2特征的低级条件扩散模型(F2IDiff),利用更丰富且严格的低级特征作为条件,替代传统的文本条件,提升小块图像的重建精度。 Result: 所提方法在保持高保真度的同时减少了生成幻觉,尤其在处理接近真实高分辨率的低分辨率图像时优于现有文本到图像扩散模型。 Conclusion: 使用低级特征(如DINOv2)进行条件控制的扩散模型能更有效地实现高质量、少幻觉的单图像超分辨率,更适合消费级摄影应用。 Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

[132] Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula,Jonathan Stubblefield

Main category: cs.CV

TL;DR: 本研究提出了一种结合YOLOv5和YOLOv8进行胸部X光异常检测,并利用大语言模型(LLM)生成自然语言放射学报告的管道,评估了模型在检测精度、推理延迟及文本生成质量方面的表现。

Details Motivation: 旨在弥合计算机视觉系统输出的结构化预测与放射科医生所需的完整叙述性报告之间的差距,提升AI在临床工作流中的实用性。 Method: 采用YOLOv5和YOLOv8进行病灶检测,提取边界框和类别标签后输入大语言模型(如GPT-4),由其生成描述性发现和临床摘要,并通过余弦相似度和人工评分评估生成文本质量。 Result: YOLOv5和YOLOv8均实现了较高的检测准确性,生成报告与真实报告具有较强的语义相似性;人工评估显示GPT-4在清晰度上得分高(4.88/5),但在语言流畅性方面较低(2.81/5)。 Conclusion: 该管道能有效生成临床准确的放射学报告,但当前系统在写作风格和自然流畅性方面仍与人类书写的报告存在差距。 Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

[133] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli,Banafsheh Rekabdar

Main category: cs.CV

TL;DR: 提出了一种用于低分辨率视频的多尺度向量量化变分自编码器(MS-VQ-VAE),生成紧凑且高保真的潜在表示,适用于边缘设备上的高效存储、传输和解码。

Details Motivation: 传统视频编解码器如H.264和HEVC主要针对像素域重建设计,缺乏对机器学习潜在表示的原生支持,难以融入深度学习流程,且在带宽和存储压力日益增加的背景下不够高效。 Method: 基于VQ-VAE-2框架扩展出时空域的多尺度结构,采用3D残差卷积构建两级分层潜在表示,并引入基于预训练VGG16的感知损失以提升重建质量;模型轻量化(约1850万参数),针对64x64分辨率视频片段优化。 Result: 在UCF101数据集上使用2秒视频片段训练,测试集达到25.96 dB PSNR和0.8375 SSIM,在验证集上比单尺度基线提升1.41 dB PSNR和0.0248 SSIM。 Conclusion: 所提出的MS-VQ-VAE框架适合于带宽受限场景下的可扩展视频压缩,包括实时流媒体、移动视频分析和CDN级存储优化。 Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[134] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai,Kunpeng Li,Menglin Jia,Jialiang Wang,Junzhe Sun,Feng Liang,Weifeng Chen,Felix Juefei-Xu,Chu Wang,Ali Thabet,Xiaoliang Dai,Xuan Ju,Alan Yuille,Ji Hou

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到视频生成方法PhyGDPO,通过物理增强的数据构建管道和物理感知的偏好优化框架,显著提升了生成视频的物理一致性。

Details Motivation: 现有的文本到视频生成方法在复杂物理交互场景下泛化能力差,且缺乏包含丰富物理现象的训练数据。 Method: 提出了PhyAugPipe数据构建管道,利用视觉语言模型结合思维链推理生成大规模物理交互视频数据集PhyVidGen-135K;并设计了PhyGDPO框架,基于群体Plackett-Luce模型进行物理感知的偏好学习,引入基于VLM的物理奖励机制(PGR)和高效的LoRA-Switch Reference(LoRA-SR)训练策略。 Result: 实验表明,该方法在物理感知评测基准PhyGenBench和VideoPhy2上显著优于当前开源的最先进方法。 Conclusion: PhyGDPO通过融合物理知识与大规模数据构建,有效提升了生成视频的物理合理性,为物理一致的视频生成提供了可扩展且高效的解决方案。 Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[135] OCP-LS: An Efficient Algorithm for Visual Localization

Jindi Zhong,Hongxia Wang,Huanshui Zhang

Main category: cs.CV

TL;DR: 提出了一种新的二阶优化算法,用于解决深度学习中的大规模优化问题,在多个视觉定位基准上表现出色。

Details Motivation: 针对深度学习中现有优化算法在大规模问题上收敛慢、稳定性差和对噪声敏感的问题,提出更高效的优化方法。 Method: 结合OCP方法并适当近似Hessian矩阵的对角元素,设计新型二阶优化算法。 Result: 在多个标准视觉定位基准上实验表明,该方法相比传统算法具有更快的收敛速度、更高的训练稳定性和更强的抗噪能力,同时保持有竞争力的定位精度。 Conclusion: 所提方法在处理大规模深度学习优化问题时具有显著优势,适用于高噪声环境下的视觉定位任务。 Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

[136] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao,Jiawen Xi,Linhui Xiao,Junnan Li,Xue Yang,Maoxun Yuan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了RGBT-Ground,首个面向复杂真实场景的大规模视觉定位基准,包含对齐的RGB与热红外图像对及高质量标注,并提出统一框架与RGBT-VGNet模型以实现鲁棒的多模态视觉定位。

Details Motivation: 现有视觉定位基准多基于干净环境下的数据集,缺乏对光照、天气等复杂现实条件的覆盖,难以评估模型在安全关键应用中的鲁棒性与泛化能力。 Method: 构建了包含RGB与热红外图像对、指代表达、边界框及细粒度场景标注的RGBT-Ground基准;设计统一框架支持单模态(RGB/TIR)与多模态(RGB-TIR)输入,并提出RGBT-VGNet模型融合互补模态信息。 Result: 在RGBT-Ground上对现有方法进行广泛适配实验,结果表明RGBT-VGNet在夜间和远距离场景下显著优于其他方法。 Conclusion: RGBT-Ground为复杂现实环境下的鲁棒视觉定位提供了新基准,RGBT-VGNet通过有效融合多模态信息提升了定位性能,尤其在挑战性场景中表现突出。 Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[137] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong,Ke Li,Di Wang,Nan Luo,Yiming Zhang,Kaiyu Li,Jianfei Yang,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了一种针对变化检测视觉问答(CDVQA)中决策模糊问题的强化微调框架DARFT,通过挖掘决策模糊样本并进行组相对策略优化,提升了模型的判别能力和鲁棒性。

Details Motivation: 现有CDVQA模型在监督微调后仍存在决策模糊问题,即正确答案与强干扰项置信度相近,影响模型性能。 Method: 提出DARFT框架:首先利用SFT训练的参考策略挖掘决策模糊样本(DAS),然后在该子集上应用基于多样本解码和组内相对优势的组相对策略优化方法。 Result: 实验表明DARFT在全量和少样本设置下均显著优于SFT基线,尤其在减少决策模糊、增强决策边界方面表现突出。 Conclusion: 显式优化决策模糊样本对提升CDVQA模型至关重要,DARFT为解决此类问题提供了有效且无需额外监督的新范式。 Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[138] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang,Chaoqun Wang,Zixuan Guan,Sam Kao,Pengfei Zhao,Peng Wu,Sifeng He

Main category: cs.CV

TL;DR: 本文提出了SliceLens,一种基于LLM和VLM的假设驱动框架,用于通过 grounded 视觉推理发现细粒度、可解释的计算机视觉模型错误切片,并构建了首个面向实例级视觉任务的细粒度切片发现基准FeSD。

Details Motivation: 现有的错误切片发现方法主要集中于图像分类,难以应用于检测、分割等多实例任务,且缺乏对复杂视觉关系下细粒度错误的识别能力;同时现有基准存在人工标注偏差、不反映真实失败模式的问题。 Method: 提出SliceLens框架,利用大语言模型和视觉语言模型生成并验证多样化的失败假设,通过 grounded 视觉推理实现细粒度错误切片的可靠识别;同时构建新基准FeSD,包含专家标注、精炼的真实切片标签,并精确关联到局部错误区域。 Result: 在现有基准和FeSD上实验表明,SliceLens在FeSD上的Precision@10达到0.73,显著优于基线0.31(提升0.42),且能识别出有助于实际模型改进的可解释错误切片,经修复实验验证有效。 Conclusion: SliceLens实现了跨实例级视觉任务的高效、可解释的细粒度错误切片发现,结合新基准FeSD为更鲁棒的模型评估提供了有效工具。 Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[139] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong

Main category: cs.CV

TL;DR: 提出了一种无需训练的少样本框架CPJ,通过结构化图像字幕提升农业病害诊断的准确性和可解释性,在多个指标上显著优于基线方法。

Details Motivation: 现有方法依赖昂贵的监督微调且在域迁移下表现差,需要一种更鲁棒、可解释且无需训练的农业病害诊断方法。 Method: 提出Caption-Prompt-Judge(CPJ)框架,利用大视觉语言模型生成多角度图像字幕,通过LLM-as-Judge模块迭代优化字幕,并用于双答案VQA过程以支持识别与管理决策。 Result: 在CDDMBench上评估显示,使用GPT-5-mini生成字幕时,GPT-5-Nano在疾病分类上提升+22.7个百分点,QA得分提升+19.5点。 Conclusion: CPJ实现了无需微调的高性能农业病害诊断,提供透明、基于证据的推理过程,推动了可解释农业AI的发展。 Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[140] 3D Semantic Segmentation for Post-Disaster Assessment

Nhut Le,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 本文构建了一个针对飓风灾后环境的3D语义分割数据集,使用无人机航拍影像通过SfM和MVS技术重建点云,并评估了当前最先进的模型在该场景下的性能,揭示了现有方法的局限性。

Details Motivation: 现有的深度学习模型缺乏专门针对灾后环境设计的3D数据集,限制了灾后场景理解与响应能力的发展。 Method: 利用无人机拍摄的飓风伊恩灾区航拍视频,采用SfM和MVS技术重建3D点云,构建专用数据集,并在该数据集上评估FPT、PTv3和OA-CNNs等SOTA 3D语义分割模型。 Result: 实验表明现有SOTA模型在灾后复杂环境中表现不佳,暴露出对灾害场景适应能力差的问题。 Conclusion: 需要开发面向灾后场景的专用3D语义分割基准数据集,并推动适用于此类环境的分割算法发展,以提升灾后评估与响应效率。 Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

[141] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

Zheng Liu,Jinchao Zhu,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为协作低秩适应(CLoRA)的新方法,通过基空间共享和样本无关多样性增强(SADE)在保持参数高效的同时提升学习性能,在图像和点云任务中实现了优于现有方法的性能与效率平衡。

Details Motivation: 现有低秩适应方法在参数效率与微调性能之间难以兼顾,往往牺牲性能或引入过多可训练参数。 Method: 提出CLoRA,包含基空间共享机制(多个低秩模块共享投影空间)和SADE(正则化低秩矩阵间的相似性以增强表示多样性)。 Result: 在多个图像和点云数据集上实验表明,CLoRA在参数效率和学习性能之间取得了更优的平衡,并在点云分析中所需GFLOPs最少。 Conclusion: CLoRA通过协作式低秩结构和多样性正则化,有效提升了低秩微调方法的性能与效率,适用于多种视觉任务。 Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

[142] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

Panquan Yang,Junfei Huang,Zongzhangbao Yin,Yingsong Hu,Anni Xu,Xinyi Luo,Xueqi Sun,Hai Wu,Sheng Ao,Zhaoxing Zhu,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出了面向户外监控场景的3D视觉定位新任务,并构建了首个大规模真实世界多模态数据集MoniRefer,包含约13.6万个物体和41.1万条自然语言描述。同时提出端到端方法Moni3DVG,融合图像外观与点云几何信息进行多模态学习,在新任务上表现出优越性能。

Details Motivation: 现有3D视觉定位研究主要集中在室内或自动驾驶场景,缺乏针对路侧基础设施监控场景的数据集和方法,限制了交通环境下的自然语言交互与目标定位能力。 Method: 构建了大规模真实世界数据集MoniRefer,包含多交叉口复杂交通场景的点云-文本配对数据,并提出端到端模型Moni3DVG,融合图像的外观信息与点云的几何及光学信息进行多模态特征学习与3D对象定位。 Result: MoniRefer包含136,018个物体和411,128条语言表达,经人工验证确保质量;实验表明Moni3DVG在新提出的基准上显著优于现有方法,消融研究验证了各模块有效性。 Conclusion: 本文推动了路侧基础设施层面的3D视觉定位研究,所提任务、数据集与方法为复杂交通场景下的多模态理解提供了新方向与资源。 Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[143] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

Shuyuan Lin,Yu Guo,Xiao Chen,Yanjie Liang,Guobao Xiao,Feiran Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为逐层分层注意力网络(Layer-by-Layer Hierarchical Attention Network)的新方法,用于提升计算机视觉中特征点匹配的精度,尤其在存在大量异常值的情况下表现出色。

Details Motivation: 特征点匹配中的大量异常值会显著降低匹配准确性和鲁棒性,尤其是在高比例异常值情况下如何有效提取高质量信息成为一个关键挑战。 Method: 提出了一种结合阶段融合、分层提取和注意力机制的网络结构;引入了逐层通道融合模块以保留各阶段语义信息并实现整体融合,并设计了分层注意力模块来自适应捕捉和融合全局感知与结构语义信息。 Result: 在YFCC100M和SUN3D两个公开数据集上的实验表明,该方法在异常值剔除和相机位姿估计任务上优于多种现有先进方法。 Conclusion: 所提出的网络结构有效提升了特征点匹配的精度与鲁棒性,通过融合多阶段特征和注意力机制增强了对异常值的抵抗能力。 Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.

[144] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

Qingyu Xu,Runtong Zhang,Zihuan Qiu,Fanman Meng

Main category: cs.CV

TL;DR: 本文提出了一种用于消防救援场景的目标检测新方法,构建了包含多种场景和关键目标类别的FireRescue数据集,并提出了改进的FRS-YOLO模型以提升复杂环境下的检测性能。

Details Motivation: 现有研究主要关注山区或森林环境,忽视了更常见且结构复杂的 urban 救援场景,同时检测类别有限,缺乏对指挥决策至关重要的多类目标(如消防车、消防员)的全面覆盖。 Method: 构建了一个名为FireRescue的新数据集,涵盖城市、山地、森林和水域等多种救援场景,包含8个关键类别共15,980张图像和32,000个边界框;提出FRS-YOLO模型,引入多维度协同增强注意力模块和动态特征采样器,以缓解类别混淆和小目标漏检问题。 Result: 实验证明消防救援场景中的目标检测具有挑战性,所提方法显著提升了YOLO系列模型在该场景下的检测性能,尤其在处理烟雾遮挡、背景干扰和远距离拍摄方面表现更好。 Conclusion: 本文通过构建更贴近实际指挥需求的数据集和设计针对性的检测模型,推动了消防救援场景中目标检测技术的发展,为后续研究提供了重要基础。 Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named "FireRescue" for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

[145] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang,Hanting Li,Wei Li,Jie Hu,Xinghao Chen,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出RadAR,一种用于加速自回归视觉生成的并行化框架,通过径向拓扑结构和嵌套注意力机制提升生成效率与质量。

Details Motivation: 传统自回归模型按逐个token顺序解码,推理效率低;且标准光栅扫描顺序未能充分利用视觉token间的局部依赖性和空间相关性。 Method: 设计基于径向拓扑的生成方式:以中心token为起点,将其他token按空间距离划分为多个同心环,逐环由内到外并行生成;引入嵌套注意力机制,在前向过程中动态修正不合理输出,减少误差累积。 Result: 实现了更高的并行化程度,在保持视觉生成表现力的同时显著提升了生成效率,并有效缓解了并行生成带来的上下文不一致问题。 Conclusion: RadAR通过环状并行生成和动态输出校正,兼顾了生成质量与速度,为高效自回归视觉生成提供了新思路。 Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

Maolin Wang,Bowen Yu,Sheng Zhang,Linjie Mi,Wanyu Wang,Yiqi Wang,Pengyue Jia,Xuetao Wei,Zenglin Xu,Ruocheng Guo,Xiangyu Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为RGTN的张量网络结构搜索新框架,受重整化群理论启发,通过多尺度连续优化实现高效、鲁棒的张量分解,在压缩比和速度上均显著优于现有方法。

Details Motivation: 现有张量网络结构搜索方法在计算可扩展性、结构适应性和优化鲁棒性方面存在局限,难以有效捕捉多尺度结构、进行平滑的结构演化,并且结构与参数优化分离导致效率低下。 Method: 提出RGTN框架,利用重整化群的多尺度变换思想,通过可学习的边门控机制和基于物理量(如节点张力、边信息流)的智能提议,实现从粗到细的连续结构演化和拓扑动态调整。 Result: 在光场数据、高阶合成张量和视频补全任务上实验表明,RGTN在压缩比上达到最优水平,运行速度比现有方法快4至600倍。 Conclusion: RGTN通过引入物理启发的多尺度优化机制,有效解决了传统TN-SS方法在结构搜索中的关键挑战,实现了更高效、更鲁棒的张量网络结构发现。 Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

[147] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye,Xiaotong You,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出了EVOL-SAM3,一种用于推理分割的零样本框架,通过将分割任务重构为推理时的进化搜索过程,克服了现有方法在训练负担和推理深度上的局限。

Details Motivation: 现有推理分割方法受限于监督微调的灾难性遗忘、强化学习的训练不稳定,以及训练自由方法的静态推理范式,缺乏自我修正能力。 Method: 提出EVOL-SAM3,采用‘生成-评估-进化’循环,维护提示假设种群;引入无参考视觉竞技场进行配对评估,并设计语义变异算子纠正错误,结合几何先验与语义推理的异构竞技模块提升最终选择鲁棒性。 Result: 在ReasonSeg基准上,EVOL-SAM3显著优于静态基线方法,并在零样本设置下超过全监督最先进方法。 Conclusion: EVOL-SAM3通过进化式推理机制实现了更深层次的语言-视觉联合推理,为零样本推理分割提供了新范式。 Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[148] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh

Main category: cs.CV

TL;DR: 提出了一种阶段感知的多模型采样策略FlowBlending,在关键阶段使用大模型,中间阶段使用小模型,显著提升推理速度并减少计算量,同时保持生成质量。

Details Motivation: 发现模型容量对不同时间步的影响不同,早期和晚期阶段对容量敏感,而中间阶段不敏感,因此希望在保证生成质量的同时提升推理效率。 Method: 提出FlowBlending方法,根据时间步阶段动态切换大模型和小模型;引入简单准则确定阶段边界,并通过速度散度分析识别容量敏感区域。 Result: 在LTX-Video和WAN 2.1上实现最高1.65倍加速和57.35%的FLOPs减少,保持大模型的视觉保真度、时序一致性和语义对齐,且兼容现有加速技术,可额外获得2倍加速。 Conclusion: FlowBlending通过阶段感知的模型切换有效平衡了生成质量与推理效率,为扩散模型的高效推理提供了新思路。 Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

[149] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li,Yiming Cui,Yicheng He,Yiwei Wang,Shu Zhang,Longyin Wen,Yulei Niu

Main category: cs.CV

TL;DR: 本文提出了EchoFoley任务和EchoVidia框架,用于实现基于视频的细粒度可控声音生成,解决了现有视频到音频模型在视觉主导、控制精度和指令理解方面的不足。

Details Motivation: 现有视频-文本到音频生成模型存在视觉与文本条件不平衡、缺乏细粒度控制定义以及指令理解能力弱的问题,限制了声音效果在多模态叙事中的应用。 Method: 提出EchoFoley任务,引入事件级局部控制和分层语义控制;设计符号化的声音事件表示方法,明确声音的时间、对象和方式;构建包含6000多个样本的EchoFoley-6k基准数据集;并开发以声音事件为中心、采用快慢思维策略的EchoVidia生成框架。 Result: 实验表明,EchoVidia在可控性上超越现有VT2A模型40.7%,在感知质量上提升12.5%。 Conclusion: EchoFoley任务和EchoVidia框架有效提升了视频相关声音生成的可控性和生成质量,推动了多模态内容创作中声音设计的发展。 Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[150] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

Xiang Liu,Yimin Zhou,Jinxiang Wang,Yujun Huang,Shuzhao Xie,Shiyu Qin,Mingyao Hong,Jiawei Li,Yaowei Wang,Zhi Wang,Shu-Tao Xia,Bin Chen

Main category: cs.CV

TL;DR: Splatwizard是一个专为3D高斯点阵压缩模型设计的统一基准测试工具包,提供易于使用的框架和自动化性能指标计算。

Details Motivation: 现有的评估工具缺乏针对3DGS压缩任务的全面、标准化评测手段,难以综合评估不同方法在渲染速度、率失真权衡、内存效率和几何精度等方面的表现。 Method: 提出Splatwizard,一个集成的开源工具包,支持新3DGS压缩模型的快速实现,并整合了基于图像质量、重建网格的Chamfer距离、渲染帧率及资源消耗等关键指标的自动化评测流水线。 Result: 提供了统一的评测框架,能够系统地评估3DGS压缩算法的多维度性能,并促进该领域的可比性和可重复性。 Conclusion: Splatwizard填补了3DGS压缩领域标准化评估工具的空白,有助于推动该方向的技术发展与比较。 Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard

[151] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman,Srinath R,Jaswanth Reddy,Lokesh R Boregowda,Venkatesh Babu Radhakrishnan

Main category: cs.CV

TL;DR: 提出了一种统一的3D实例分割框架,通过可学习的高斯特征嵌入和“嵌入到标签”解码机制,结合边界硬挖掘策略,有效提升多视角一致性与性能。

Details Motivation: 解决现有3D实例分割方法中2D实例标签跨视角不一致导致的3D预测质量差问题,以及两阶段方法训练耗时、依赖敏感超参数或预处理的局限性。 Method: 提出统一框架,将特征嵌入学习与标签生成融合,在高斯图元中引入可学习的特征嵌入,并通过新颖的'嵌入到标签'过程解码为实例标签;为缓解边界伪影,设计在光栅化特征上加线性层后计算三元组损失的稳定硬挖掘策略。 Result: 在ScanNet、Replica3D和Messy-Rooms数据集上实现了优于基线方法的定性和定量结果,训练效率更高,边界处理更优。 Conclusion: 所提方法通过统一优化框架和稳定的边界硬挖掘策略,显著提升了3D实例分割的准确性和训练效率,有效解决了跨视角标签不一致问题。 Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[152] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe,Yudai Hirose,Mashiho Mukaida,Satoshi Ono

Main category: cs.CV

TL;DR: 提出了一种基于投影的对抗攻击方法,利用物理闭环优化和分布式协方差矩阵自适应进化策略,验证了深度神经网络在单目深度估计中的脆弱性。

Details Motivation: 深度神经网络在单目深度估计中易受对抗攻击,影响其实际应用的可靠性,需验证其脆弱性并提升鲁棒性。 Method: 提出一种投影式对抗攻击方法,采用物理闭环(PITL)优化,并结合分布式协方差矩阵自适应进化策略生成对抗样本。 Result: 实验证明该方法成功生成对抗样本,导致目标场景中部分物体消失,引发深度估计错误。 Conclusion: DNN-based MDE模型对现实世界中的投影式对抗攻击敏感,需加强鲁棒性设计以应对实际威胁。 Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization -- evaluating candidate solutions in actual environments to account for device specifications and disturbances -- and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

[153] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

Andrew Tinits,Stephen Mann

Main category: cs.CV

TL;DR: 本文提出了一种改进的Noise2Noise方法,通过理论分析证明某些非线性函数在特定条件下可无显著偏差地用于噪声图像训练,解决了高动态范围图像去噪中因非线性色调映射引入偏差的问题,并成功应用于仅使用噪声数据训练蒙特卡洛渲染去噪器。

Details Motivation: Noise2Noise虽无需干净标签训练去噪模型,但其无法直接使用非线性函数处理目标图像,因非线性会导致期望值偏移,限制了预处理手段的应用,尤其在HDR图像去噪中难以应对异常值问题。 Method: 构建了一个分析非线性函数影响的理论框架,定义了一类低偏置的非线性函数;结合特定损失函数与色调映射函数,在减少动态范围的同时最小化训练偏差。 Result: 在蒙特卡洛渲染的HDR图像去噪任务中,使用带非线性色调映射的Noise2Noise训练,结果接近原需高采样参考图训练的模型性能,且仅使用噪声数据完成训练。 Conclusion: 某些非线性操作可在Noise2Noise框架中安全使用,扩展了该方法在实际图像处理流程中的适用性,特别是在高动态范围图像去噪中具有重要意义。 Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

[154] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage,Rico Sennnrich

Main category: cs.CV

TL;DR: 提出一种基于无导数优化和遗憾最小化的新方法,提升多变量互信息估计,使现成的跨模态系统能在线适应3D场景中的物体遮挡并区分特征,无需预训练或微调。

Details Motivation: 解决从2D视觉输入训练的跨模态系统在处理3D场景时存在的维度不匹配问题,尤其是物体遮挡和特征区分的挑战。 Method: 通过无导数优化进行遗憾最小化,提升多变量互信息估计,并结合值函数优化控制3D场景内的相机,直接利用视觉-语言模型的噪声输出进行学习。 Result: 所提方法使现成的跨模态系统无需预训练或微调即可在多物体3D场景的跨模态任务中提升性能。 Conclusion: 该方法有效弥合了2D训练与3D推理之间的维度差距,实现了对3D场景的自适应感知与控制。 Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[155] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

Md Ahmed Al Muzaddid,Jordan A. James,William J. Beksi

Main category: cs.CV

TL;DR: 提出了一种名为CropTrack的多目标跟踪框架,结合外观与运动信息,有效解决了农业环境中因遮挡和外观相似导致的身份保持难题。

Details Motivation: 农业环境中的多目标跟踪面临重复模式、外观相似、光照变化和频繁遮挡等挑战,现有方法在强遮挡下难以维持目标身份。 Method: CropTrack结合外观与运动信息,采用重排序增强的外观关联、基于外观的一对多关联与冲突解决策略,以及指数移动平均原型特征库来提升外观关联性能。 Result: 在公开农业MOT数据集上验证,CropTrack在ID F1和关联准确率上优于现有方法,身份切换次数更少,表现出更强的身份保持能力。 Conclusion: CropTrack通过融合外观与运动信息,显著提升了农业场景下的多目标跟踪性能,尤其在处理频繁遮挡和外观相似问题上表现优越。 Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

[156] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

Xunyi Zhao,Gengze Zhou,Qi Wu

Main category: cs.CV

TL;DR: 本文提出了一个名为VLN-MME的统一评估框架,用于探索多模态大语言模型(MLLMs)在视觉-语言导航(VLN)任务中作为零样本具身智能体的潜力,并揭示了其在3D空间推理和序列决策中的局限性。

Details Motivation: 尽管MLLMs在视觉-语言任务中表现出色,但其在需要多轮对话、空间推理和连续动作预测的具身智能体场景下的表现尚不明确,因此需要系统性的评估框架来探究其能力与缺陷。 Method: 提出VLN-MEE框架,将传统导航数据集整合为标准化基准,采用模块化设计支持对不同MLLM架构、智能体设计和导航任务进行结构化比较与组件级消融实验,并引入思维链(CoT)与自反思机制增强基线智能体。 Result: 实验发现,加入CoT与自反思反而导致性能下降,表明MLLMs在具身导航任务中存在上下文感知差、3D空间推理能力弱的问题,暴露了其在序列决策中的不足。 Conclusion: VLN-MME为评估MLLMs在具身导航中的表现提供了基础框架,揭示了当前模型在空间理解与连续决策方面的关键缺陷,为未来针对具身场景的MLLM后训练提供了重要指导。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[157] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

Meng Lan,Lefei Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 提出OFL-SAM2,一种无需手动提示的在线学习框架,通过轻量映射网络和自适应融合模块,实现少样本医学图像分割。

Details Motivation: 将SAM2应用于医学图像分割面临标注数据和高质量手工提示依赖的挑战,需减少对专家干预的依赖并提升在有限数据下的泛化能力。 Method: 设计一个轻量级映射网络,利用少量标注样本将通用图像特征转换为目标特征,并支持推理时在线参数更新;结合自适应融合模块动态整合SAM2的内存注意力特征,实现无需手动提示的分割。 Result: 在三个医学图像分割数据集上验证了方法的有效性,OFL-SAM2在极少量训练数据下达到当前最优性能。 Conclusion: OFL-SAM2通过在线少样本学习和特征自适应融合,实现了高效、无需手动提示的医学图像分割,显著降低了对标注数据和专家干预的依赖。 Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model's generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

[158] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao

Main category: cs.CV

TL;DR: FinMMDocR是一个新的双语多模态基准,用于评估多模态大语言模型在真实金融场景中的数值推理能力,包含情景感知、文档理解与多步计算三大创新。

Details Motivation: 现有基准在真实金融场景下的多模态推理评估存在不足,缺乏对隐含金融情景、长文档理解和复杂多步计算的支持。 Method: 构建包含1,200个专家标注问题的双语多模态数据集,涵盖12类金融场景和9类共837份中英文长文档,平均每个问题需进行11步推理,并要求跨页信息整合。 Result: 最佳多模态大语言模型准确率仅为58.0%,不同检索增强生成方法表现差异显著,显示出任务挑战性。 Conclusion: FinMMDocR能有效推动多模态大语言模型及其推理增强方法在复杂现实场景中的发展。 Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[159] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Bartłomiej Olber,Jakub Winter,Paweł Wawrzyński,Andrii Gamalii,Daniel Górniak,Marcin Łojek,Robert Nowak,Krystian Radlak

Main category: cs.CV

TL;DR: 提出了一种基于神经元激活模式的激光雷达域适应方法,仅需标注目标域中少量代表性样本即可实现最先进的3D物体检测性能。

Details Motivation: 现有3D目标检测器在跨域场景下泛化能力差,例如在美国训练的模型在亚洲或欧洲表现不佳。 Method: 基于神经元激活模式选择具有代表性和多样性的少量目标域样本进行标注,并结合受持续学习启发的后训练技术防止模型权重漂移。 Result: 所提方法在极低标注预算下优于线性探测和现有最先进域适应技术。 Conclusion: 通过精心选择少量样本并结合防止权重漂移的技术,可高效实现跨域3D物体检测的高性能域适应。 Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

[160] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Rongji Xun,Junjie Yuan,Zhongjie Wang

Main category: cs.CV

TL;DR: 提出HaineiFRDM,一种基于扩散模型的开源电影修复框架,通过全局-局部频率模块和分块训练策略实现高分辨率电影修复,并构建真实退化数据集,显著优于现有开源方法。

Details Motivation: 现有开源电影修复方法因使用低质量合成数据和噪声光流,性能不及商业方法,且未探索高分辨率影片修复。 Method: 提出HaineiFRDM框架,采用分块训练与测试策略以适应单张24GB显存GPU;设计位置感知的全局提示与帧融合模块;引入全球局-局部频率模块以保持纹理一致性;先恢复低分辨率结果作为全局残差以减少块状伪影;并构建包含真实退化与逼真合成数据的数据集。 Result: 实验结果表明,该方法在缺陷修复能力上显著优于现有开源方法,尤其在高分辨率电影修复方面表现突出。 Conclusion: HaineiFRDM有效利用扩散模型的内容理解能力,结合创新模块与高质量数据集,推动了开源电影修复技术的发展,具备实际应用潜力。 Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

[161] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Xinran Gong,Gorkem Durak,Halil Ertugrul Aktas,Vedat Cicek,Jinkui Hao,Ulas Bagci,Nilay S. Shah,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为ProDM的渐进式扩散模型,用于从非门控胸部CT中恢复无运动伪影的冠状动脉钙化病变,从而提高CAC评分的准确性与临床可用性。

Details Motivation: 非门控胸部CT常用于常规检查,但存在严重运动伪影,限制了冠状动脉钙化(CAC)评分的准确性;缺乏配对数据也阻碍了模型训练。因此需要一种无需配对数据且能保持钙化病变完整性的方法来提升CAC量化可靠性。 Method: 提出ProDM框架,包含三个关键组件:(1) CAC运动模拟数据引擎,通过心电图门控CT生成具有多样化运动轨迹的非门控图像以实现无配对数据的监督训练;(2) 引入可微分的钙一致性损失函数,结合钙化先验信息的属性感知学习策略;(3) 渐进式校正机制,在扩散过程中逐步减少伪影,提升稳定性和钙化保真度。 Result: 在真实患者数据集上的实验表明,ProDM显著提升了CAC评分的准确性、病灶空间保真度和风险分层性能;读片研究进一步证实其有效抑制运动伪影并改善临床可用性。 Conclusion: ProDM为从常规非门控胸部CT中实现可靠的冠状动脉钙化定量提供了有前景的解决方案,展示了渐进式、属性感知框架在医学图像伪影校正中的潜力。 Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

[162] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li,Yukai Gu,Yingqian Min,Zikang Liu,Yifan Du,Kun Zhou,Min Yang,Wayne Xin Zhao,Minghui Qiu

Main category: cs.CV

TL;DR: 提出了一种新的过程感知评估范式VIPER,用于评测视频生成模型的推理能力,引入POC@r指标衡量中间步骤与结果的一致性,发现现有模型存在严重的结果正确但推理错误(outcome-hacking)问题。

Details Motivation: 现有视频生成模型评估多依赖单帧判断,忽视推理过程,易导致模型通过错误推理得出正确结果(outcome-hacking),缺乏对推理过程有效性的评估。 Method: 构建涵盖时间、结构、符号、空间、物理和规划等16个任务的视频推理基准VIPER;提出Process-outcome Consistency (POC@r)指标,利用基于VLM的裁判和分层评分标准评估中间步骤和最终结果的有效性。 Result: 实验显示当前最先进的视频模型在POC@1.0上仅约20%;表现出显著的outcome-hacking现象;测试时扩展和采样鲁棒性分析揭示了当前生成能力与真正通用视觉推理之间的巨大差距。 Conclusion: 当前视频生成模型在复杂推理任务中仍远未实现可靠的链式推理能力,需更关注过程一致性评估,VIPER和POC@r为未来研究提供了重要工具和方向。 Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[163] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu,Kevin Qinghong Lin,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了ShowUI-π,首个基于流的生成模型,用于实现GUI环境下的灵巧操作,支持离散点击与连续拖拽的统一建模,并构建了包含20K拖拽轨迹的数据集和ScreenDrag基准测试。实验表明现有商用代理表现不佳,而ShowUI-π以450M参数达到26.98的性能,显著优于现有方法。

Details Motivation: 现有GUI代理依赖离散点击预测,无法支持需要连续感知与实时调整的自由形式闭环拖拽操作(如滑动进度条),限制了其在复杂人机交互中的灵活性与拟人性。 Method: 提出ShowUI-π,采用(i)统一离散-连续动作建模,集成点击与拖拽;(ii)基于流的动作生成,通过轻量级动作专家从视觉输入预测光标增量调整;(iii)构建ScreenDrag基准,包含手动收集与合成的20K跨域拖拽轨迹及在线/离线评估协议。 Result: 实验显示主流商用代理在ScreenDrag上表现差(Operator 13.27,Gemini-2.5-CUA 22.18),而ShowUI-π以仅450M参数达到26.98,验证了任务难度与方法有效性。 Conclusion: ShowUI-π推动了GUI代理向人类水平的数字世界灵巧控制迈进,为未来智能代理提供了更灵活、连续的交互能力基础。 Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

[164] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza,Baldoino Fonseca dos Santos Neto,Marcio de Medeiros Ribeiro

Main category: cs.CV

TL;DR: 本文研究了量化、剪枝和权重聚类等压缩技术对卷积神经网络在自然损坏下的鲁棒性影响,发现某些压缩策略不仅能保持甚至提升模型鲁棒性,尤其对复杂架构更明显。

Details Motivation: 模型压缩可能影响在自然损坏下的鲁棒性,因此需要评估压缩后模型在真实环境中的表现。 Method: 对ResNet-50、VGG-19和MobileNetV2应用单独及组合的压缩技术,并在CIFAR-10-C和CIFAR-100-C数据集上评估其鲁棒性、准确性和压缩比之间的权衡。 Result: 某些压缩策略可提升模型鲁棒性,多目标评估表明定制化的组合能实现更好的综合性能。 Conclusion: 合理选择压缩方法组合有助于在资源受限设备上部署兼具高效性与鲁棒性的视觉模型。 Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

[165] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park,Hyunwoo Ha,Wonjun Jo,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了DarkEQA,一个用于评估视觉语言模型在多级低光条件下感知能力的开源基准,强调了现有模型在低光环境下的性能局限。

Details Motivation: 现有的视觉语言模型(VLM)基准主要在理想光照条件下进行评估,忽略了实际应用中常见的低光等视觉退化情况,限制了模型在全天候场景中的鲁棒性。 Method: 构建了一个具有物理真实感的开源基准DarkEQA,通过在线性RAW空间中模拟基于物理的光照衰减和传感器噪声,并结合ISP启发的渲染流程,对以自我为中心的观察下的问题回答能力进行受控退化测试。 Result: 实验评估了多种最先进的VLM和低光图像增强(LLIE)模型,系统地揭示了它们在低光条件下的性能下降问题,验证了DarkEQA的有效性和必要性。 Conclusion: DarkEQA为评估VLM在真实低光环境中的感知鲁棒性提供了可靠工具,突出了改进VLM在恶劣视觉条件下表现的重要性,并推动了相关领域的发展。 Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[166] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Zhenyu Cui,Jiahuan Zhou,Yuxin Peng

Main category: cs.CV

TL;DR: 本文提出了一种无需重新索引历史图库图像的终身行人重识别新任务(RFL-ReID),并设计了双向连续兼容表示(Bi-C2R)框架,在避免灾难性遗忘的同时实现新旧模型特征的兼容,显著提升了性能。

Details Motivation: 现有L-ReID方法依赖对历史图库重新索引以保持性能,但面临隐私问题和高昂计算成本,且导致新旧特征不兼容,限制实际应用。 Method: 提出Bi-C2R框架,通过双向特征更新机制,在不重新提取历史特征的前提下,持续学习新知识并保持新旧模型输出特征的兼容性,支持高效推理。 Result: 在多个基准上进行了广泛实验,结果表明所提方法在RFL-ReID和传统L-ReID任务上均取得领先性能,并得到理论支持。 Conclusion: Bi-C2R有效解决了RFL-ReID中无需重索引的挑战,实现了新旧知识平衡与特征兼容,为实际部署提供了可行方案。 Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

[167] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu,Jiahe Li,Fabio Tosi,Matteo Poggi,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: FoundationSLAM是一种基于学习的单目稠密SLAM系统,通过融合基础深度模型的几何引导,提升跟踪与建图的准确性与鲁棒性。

Details Motivation: 解决以往基于光流的方法在单目稠密SLAM中缺乏几何一致性的缺陷,实现更准确和鲁棒的位姿估计与稠密重建。 Method: 提出混合光流网络生成几何感知的匹配点,结合双向一致性束调整层进行多视角约束下的联合优化,并设计可靠性感知的细化机制,动态调整光流更新过程。 Result: 在多个具有挑战性的数据集上实现了优于现有方法的轨迹精度和稠密重建质量,同时以18 FPS实现实时运行。 Conclusion: FoundationSLAM通过将光流估计与几何推理相结合,显著提升了单目稠密SLAM的性能与泛化能力,具备实际应用价值。 Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

[168] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Xu He,Haoxian Zhang,Hejia Chen,Changyuan Zheng,Liyang Chen,Songlin Tang,Jiehui Huang,Xiaoqiang Liu,Pengfei Wan,Zhiyong Wu

Main category: cs.CV

TL;DR: 本文提出一种自举式框架,将音频驱动的视觉配音从病态的修复任务重构为条件良好的视频到视频编辑问题,利用扩散Transformer生成理想训练数据并进行端到端编辑,显著提升唇部同步精度、身份保持和真实场景鲁棒性。

Details Motivation: 现有方法因缺乏理想的配对训练数据(仅唇部运动不同而其他视觉条件一致的视频对)而依赖掩码修复范式,导致模型需同时幻构内容和同步唇动,引发视觉伪影、身份漂移和同步效果差等问题。 Method: 提出一种基于扩散Transformer(DiT)的自举框架:首先用DiT作为数据生成器,为每个真实视频样本合成对应的唇部变化但视觉对齐的伴生视频,构建理想训练对;然后在这些配对数据上端到端训练一个DiT-based音频驱动编辑器,并引入时间步自适应的多阶段学习策略以解耦扩散过程中不同时间步的冲突编辑目标。 Result: 该方法在唇部同步准确性、身份保持度和视觉保真度方面优于现有方法,在真实复杂场景下表现出更强鲁棒性;所提出的ContextDubBench基准测试集支持多样化和具挑战性的应用场景评估。 Conclusion: 通过构建理想训练数据并对完整帧进行对齐输入条件化,本文成功将视觉配音转化为良好定义的视频编辑问题,结合多阶段训练策略,实现了高质量、高稳定性的音频驱动唇部同步编辑。 Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

[169] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Dian Shao,Mingfei Shi,Like Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为FineTec的统一框架,用于在时间序列损坏的情况下进行细粒度动作识别。该方法通过上下文感知补全、空间分解和基于物理的动力学估计,显著提升了在严重数据缺失情况下的识别性能。

Details Motivation: 在线姿态估计常导致大量缺失数据,现有方法难以恢复时间动态和细粒度空间结构,从而丢失关键的细微运动线索。因此,需要一种能有效应对时间损坏并保留细粒度动作特征的方法。 Method: 1) 使用上下文感知补全和多样时间掩码恢复基础骨架序列;2) 通过语义区域划分与动静态分组生成增强骨架序列;3) 利用拉格朗日动力学估计关节加速度;4) 将位置与加速度序列融合输入GCN进行动作识别。 Result: 在NTU-60、NTU-120、Gym99和Gym288等多个基准上验证了方法的有效性,在Gym99-severe和Gym288-severe设置下分别达到89.1%和78.1%的top-1准确率。 Conclusion: FineTec在不同程度的时间损坏下均显著优于现有方法,展现出强大的鲁棒性和泛化能力,适用于真实场景中的细粒度动作识别任务。 Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

[170] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Jiageng Liu,Weijie Lyu,Xueting Li,Yejie Guo,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Edit3r是一种前馈框架,能够从无姿态、视角不一致的指令编辑图像中单次重建和编辑3D场景,无需每场景优化或姿态估计,具有快速、高质量和实时应用潜力。

Details Motivation: 现有3D场景编辑方法通常需要每场景优化和精确姿态估计,耗时且难以实现快速、一致的编辑。缺乏多视角一致的编辑图像也限制了模型训练。因此,需要一种高效、无需优化的单次3D编辑框架。 Method: 提出Edit3r,采用前馈网络直接预测与指令对齐的3D编辑;使用基于SAM2的重着色策略生成跨视角一致的监督数据,并采用非对称输入策略融合参考视图与辅助视图,以提升对未见2D编辑(如InstructPix2Pix)的泛化能力。 Result: 在新构建的DL3DV-Edit-Bench基准上进行大规模评估,Edit3r在语义对齐、3D一致性方面优于现有方法,推理速度显著更快,支持高质量实时渲染。 Conclusion: Edit3r实现了快速、无需优化的单次3D场景编辑,在真实感、一致性和效率之间取得了良好平衡,为实时3D内容创作提供了可行方案。 Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

[171] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang,Hao-Jen Chien,Chin-Yang Lin,Ying-Huan Chen,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了GaMO(Geometry-aware Multi-view Outpainter),一种通过多视角补全来解决稀疏视角3D重建问题的新框架。与生成新视角的扩散方法不同,GaMO从现有相机位姿扩展视野,保持几何一致性并提升场景覆盖范围,且以零样本方式运行无需训练,在多个数据集上实现了优于现有方法的重建质量与25倍的速度提升。

Details Motivation: 现有稀疏视角3D重建方法存在三大问题:已知视图外围覆盖不足、生成视图间几何不一致、计算成本高。为此,作者提出需一种能保持几何一致、扩大场景覆盖且高效的方法。 Method: 提出GaMO框架,将稀疏视角重建重新定义为多视角补全任务。利用多视角条件控制和几何感知去噪策略,在无需生成新视角的情况下从现有位姿扩展视野,实现零样本、无需训练的高质量补全。 Result: 在Replica和ScanNet++数据集上,使用3、6、9个输入视图进行实验,GaMO在PSNR和LPIPS指标上均优于现有方法,并比最先进的扩散方法快25倍,处理时间低于10分钟。 Conclusion: GaMO通过多视角补全策略有效解决了稀疏视角重建中的覆盖、一致性与效率问题,在质量与速度上均达到SOTA,为未来稀疏输入下的3D场景重建提供了新方向。 Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

[172] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Zhening Huang,Hyeonho Jeong,Xuelin Chen,Yulia Gryaditskaya,Tuanfeng Y. Wang,Joan Lasenby,Chun-Hao Huang

Main category: cs.CV

TL;DR: 提出SpaceTimePilot,一种通过解耦时空控制实现可控制生成渲染的视频扩散模型,支持对单目视频的相机视角和运动序列独立调整。

Details Motivation: 现有方法难以在生成过程中独立控制视频的空间视角和时间动态,缺乏具备连续时序变化的配对视频数据用于训练。 Method: 引入动画时间嵌入机制以显式控制输出视频的运动序列;提出时间扭曲训练策略,利用多视角数据集模拟时序差异;改进相机条件机制,并构建首个全时空覆盖渲染数据集CamxTime用于联合训练。 Result: 在真实和合成数据上验证了模型能有效实现时空解耦,支持任意视角与运动控制,时序控制更精确,效果优于先前方法。 Conclusion: SpaceTimePilot通过新颖的时间建模与数据策略,实现了高质量的时空解耦视频生成,为可控动态场景渲染提供了新思路。 Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot