Table of Contents
cs.CL [Back]
[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration
Zahra Abedi,Richard M. K. van Dijk,Gijs Wijnholds,Tessa Verhoef
Main category: cs.CL
TL;DR: 本研究提出了一种结合OCR、生成式AI和数据库链接的自动化管道,用于将Leiden大学历史人物文献数字化并整合到现有数据库中。
Details
Motivation: 旨在解决如何高效地将纸质历史文档中的非结构化数据自动转化为结构化数字记录,并与高质量数据库准确匹配的问题。 Method: 采用OCR技术提取图像中的文本,利用生成式AI在解码时施加约束以提升结构化数据抽取(如JSON)的准确性,并通过记录链接算法将提取结果与现有数据库进行匹配。 Result: OCR的字符错误率低至1.08%,词错误率为5.06%;从OCR文本中抽取JSON的准确率为63%-65%;记录链接在标注数据上达到94%的准确率,在OCR生成数据上为81%。 Conclusion: 该自动化管道有效支持了历史文献的数字化处理,表明生成式AI能在一定程度上弥补OCR误差,提升了数据结构化与集成的整体效果,对数字人文研究具有重要意义。 Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations
Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano
Main category: cs.CL
TL;DR: 本文提出了CAT框架,用于评估和可视化大型语言模型在可控输入变化下的准确性和响应一致性之间的相互作用,通过多选题基准进行案例研究。
Details
Motivation: 当前的评估方法主要关注模型的能力如准确性或基准分数,而最近一致性被认为是部署LLM于高风险实际应用中的重要属性。然而,准确性和一致性之间的相互依赖性也需要被考虑以实现对LLM更细致的评估。 Method: CAT框架核心是准确性-一致性关系(CAR)曲线,它展示了随着由最小一致性准确性(MCA)度量定义的一致性要求增加时模型准确性的变化情况。此外,还提出了面向一致性的鲁棒估计(CORE)指数,该全局度量结合了CAR曲线的面积和形状来量化准确性和一致性之间的权衡。 Result: 通过对多种通用和特定领域LLM在多个多项选择基准上的实际演示,展示了所提框架的有效性,并概述了如何将CAT扩展到支持长篇、开放式评估的适应性评分函数之外的MC任务。 Conclusion: CAT框架提供了一种新的方式来综合评估大型语言模型的准确性和响应一致性,有助于更好地理解这些模型在不同应用场景下的表现。 Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability
Guanghui Wang,Jinze Yu,Xing Zhang,Dayuan Jiang,Yin Song,Tomal Deb,Xuefeng Liu,Peiyang He
Main category: cs.CL
TL;DR: 本文提出了一种评估和提升大语言模型(LLM)生成结构化输出一致性的综合框架,包括新的语义树编辑距离(STED)度量和一致性评分体系,实验表明该方法优于现有指标,并可用于模型选择、提示优化和诊断分析。
Details
Motivation: 确保大语言模型在生产环境中生成结构化数据的一致性和可靠性,解决当前输出不一致的问题。 Method: 提出了STED(语义树编辑距离)作为衡量JSON输出相似性的新指标,并构建了一个基于多次生成结果的STED聚合的一致性评分框架,通过合成数据集上的系统实验验证其有效性。 Result: STED在语义等价样本上达到0.86-0.90的相似性得分,在结构差异上得分为0,显著优于TED、BERTScore和DeepDiff;对六个LLM的评测显示Claude-3.7-Sonnet一致性最佳,即使在高温下也表现稳定。 Conclusion: 所提出的框架为评估和改进LLM生成结构化输出的一致性提供了理论基础和实用工具,有助于提升LLM在生产系统中的可靠性和可操作性。 Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents
Jahidul Islam,Md Ataullha,Saiful Azad
Main category: cs.CL
TL;DR: 本文提出了BanglaCodeAct,一种基于多智能体提示和迭代自修正的框架,用于从孟加拉语生成Python代码,显著提升了低资源语言代码生成的效果。
Details
Motivation: 现有的大语言模型在英语代码生成方面表现优异,但在低资源语言(如孟加拉语)上进展有限,缺乏无需任务特定微调的有效方法。 Method: 提出BanglaCodeAct框架,采用开源多语言大模型,在Thought-Code-Observation循环中通过多智能体提示实现代码的动态生成、测试与修正,无需微调。 Result: 在mHumanEval数据集上评估多个小参数开源LLM,Qwen3-8B结合BanglaCodeAct在开发集上pass@1达到94.0%,盲测集上达71.6%,表现最佳。 Conclusion: 该工作为孟加拉语到Python的代码生成建立了新基准,展示了基于智能体的推理在低资源语言代码生成中的潜力。 Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.[5] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents
Tingwei Xie,Tianyi Zhou,Yonghong Song
Main category: cs.CL
TL;DR: PharmaShip是一个用于测试预训练文本布局模型在噪声OCR和异构模板下性能的中文医药物流文档数据集,支持序列实体识别、关系抽取和阅读顺序预测任务,并提出序列感知约束作为可迁移的结构建模偏差。
Details
Motivation: 现有文档理解模型在真实医药文档(含噪声OCR和多样模板)上表现受限,缺乏统一、可控的基准来评估模型在安全关键场景下的鲁棒性。 Method: 构建名为PharmaShip的真实中文医药运输文档数据集,涵盖序列实体识别、关系抽取和阅读顺序预测三个任务;采用实体中心化评估协议,标准化预处理、数据划分与优化流程;对比五种基于像素和几何信息的代表性模型,引入阅读顺序正则化与长距离位置编码改进模型。 Result: 实验表明像素与几何信息具有互补性,但均不充分;引入阅读顺序正则化显著提升序列实体识别与实体链接性能,增强模型鲁棒性;扩展位置编码范围可改善末页预测稳定性并减少截断效应;单词级阅读顺序预测较准确,但段落级仍具挑战,主因边界模糊与长距离交叉。 Conclusion: PharmaShip为药物领域安全关键型文档理解提供了可控且可复现的基准,验证了序列感知约束是可迁移的有效归纳偏置,有助于提升复杂文档结构建模能力。 Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.[6] Noise-Driven Persona Formation in Reflexive Neural Language Generation
Toshiyuki Shigemura
Main category: cs.CL
TL;DR: 提出了一种名为Luca-Noise Reflex Protocol (LN-RP) 的计算框架,通过注入随机噪声研究大语言模型中噪声驱动的人格涌现现象,观察到语言行为的非线性转变并识别出三种稳定的人格模式。
Details
Motivation: 探索大语言模型在噪声影响下如何产生和维持不同的人格特征,理解生成过程中的反射性和涌现行为机制。 Method: 在生成初始状态中注入随机噪声种子,进行152轮生成循环,分析语言行为的动态变化及熵特征。 Result: 发现三种具有显著不同熵特征的稳定人格模式,外部噪声可引发相变;定量评估显示人格保持一致且模式间差异显著(p < 0.01)。 Conclusion: LN-RP为研究大语言模型中的反射生成、涌现行为和长程语言连贯性提供了一个可复现的分析方法。 Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.[7] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate
Shenzhe Zhu
Main category: cs.CL
TL;DR: 本文提出了HarmTransform,一个基于多智能体辩论的框架,用于将有害查询转化为更隐蔽的形式,以提升大语言模型的安全对齐能力。
Details
Motivation: 现有安全机制主要针对明显有害内容,难以应对通过隐晦改写保留恶意意图的查询,导致安全训练数据存在缺口。 Method: 提出HarmTransform框架,利用多个智能体之间的迭代批评与优化,系统性生成高质量、隐蔽且保持原有害意图的查询变体。 Result: 实验表明,HarmTransform在生成有效隐蔽查询方面显著优于基线方法;但分析也发现多智能体辩论可能引发话题偏移和过度复杂化等问题。 Conclusion: 多智能体辩论在增强安全训练数据覆盖性方面具有潜力,但也需警惕其带来的副作用,为未来LLM安全对齐提供了新思路与警示。 Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.[8] Emergent World Beliefs: Exploring Transformers in Stochastic Games
Adam Kamel,Tanish Rastogi,Michael Ma,Kailash Ranganathan,Kevin Zhu
Main category: cs.CL
TL;DR: 该论文研究了基于Transformer的大型语言模型(LLM)在不完全信息博弈(以德州扑克为例)中是否能学习环境的隐含状态,发现模型在无监督情况下自发学会了手牌等级和胜率等结构,并可通过非线性探针解码,表明其形成了对随机环境的内部表征。
Details
Motivation: 探索大型语言模型在部分可观测环境(如德州扑克)中是否能像在完全信息游戏中一样,自发形成对环境状态的内部世界模型。 Method: 采用GPT风格模型在扑克手牌历史数据上进行预训练,并使用线性和非线性探针分析其内部激活状态,以检测是否编码了手牌等级、胜率等信息。 Result: 模型无需显式监督即可学习到确定性结构(如手牌等级)和随机性特征(如胜率);非线性探针能有效解码这些表示,且与理论上的信念状态显著相关。 Conclusion: LLMs能够在不完全信息环境中学习并构建关于随机动态的内部表征,表明其世界模型能力可扩展至POMDP类任务。 Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.[9] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection
Anwar Alajmi,Gabriele Pergola
Main category: cs.CL
TL;DR: 本文提出了一种两阶段框架,以应对在线性别歧视内容检测中的数据稀疏、噪声和概念模糊问题,通过改进训练策略和引入基于推理的专家协作模块,在多个基准上实现了最先进的性能。
Details
Motivation: 由于性别歧视内容日益隐晦且依赖上下文,传统检测方法难以有效识别,现有模型在处理标注噪声、类别不平衡和语义模糊时表现不佳。 Method: 采用两阶段框架:训练阶段使用类别平衡的focal loss、类感知批处理和阈值校准;推理阶段通过动态路由机制将高置信度样本直接分类,不确定性样本交由多角色协同专家判断(CEJ)模块进行推理整合。 Result: 在EXIST 2025 Task 1.1上F1提升+2.72%,EDOS Task A和B分别提升+4.48%和+1.30%。 Conclusion: 该方法有效应对了数据稀缺、噪声和概念模糊带来的挑战,显著提升了对隐性性别歧视内容的检测能力。 Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.[10] Break Out the Silverware -- Semantic Understanding of Stored Household Items
Michaela Levi-Richter,Reuth Mirsky,Oren Glickman
Main category: cs.CL
TL;DR: 本文提出了“存储家庭物品挑战”(Stored Household Item Challenge),旨在评估服务机器人在家庭环境中推断不可见物品存储位置的认知能力,并发布了两个数据集,提出了一种结合视觉与大语言模型的混合方法NOAM,在预测准确率上接近人类水平。
Details
Motivation: 服务机器人虽然在视觉和操作方面取得进展,但在缺乏常识推理能力的情况下难以找到隐藏的家庭物品,因此需要一种新的基准任务来评估和提升其认知能力。 Method: 提出NOAM(Non-visible Object Allocation Model),将视觉输入转化为描述空间上下文和可见容器的自然语言,再利用大语言模型(如GPT-4)推断最可能的隐藏存储位置,形成一个融合视觉与语言的智能体架构。 Result: 在包含100个真实世界样本的测试集和6,500个开发样本的数据集中,NOAM显著优于随机选择、纯视觉模型和多模态模型基线,预测准确率接近人类表现。 Conclusion: NOAM展示了将结构化场景理解与大语言模型结合的有效性,为构建具备常识推理能力的服务机器人提供了可行路径,并推动了对非可见物体定位的认知建模研究。 Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.[11] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
Tiancheng Su,Meicong Zhang,Guoxiu He
Main category: cs.CL
TL;DR: 提出了一种无需训练的推测性解码增强方法EASD,通过引入基于熵的动态惩罚机制,在保持解码效率的同时提升大模型推理性能。
Details
Motivation: 标准推测性解码中草稿模型与目标模型过度对齐,限制了加速效果并难以超越目标模型性能。 Method: 在标准推测性解码基础上,引入基于采样分布熵的动态惩罚机制:当草稿模型和目标模型均表现出高熵且前N个预测重叠较大时,拒绝该token并由目标模型重新采样。 Result: 在多个推理基准上,EASD consistently 优于现有推测性解码方法,并在多数情况下超越目标模型自身性能,同时保持与标准SD相当的效率。 Conclusion: EASD是一种有效的训练-free推测性解码增强方法,通过熵感知机制抑制低置信度错误传播,实现了性能突破。 Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.[12] MiMo-Audio: Audio Language Models are Few-Shot Learners
Xiaomi LLM-Core Team,:,Dong Zhang,Gang Wang,Jinlong Xue,Kai Fang,Liang Zhao,Rui Ma,Shuhuai Ren,Shuo Liu,Tao Guo,Weiji Zhuang,Xin Zhang,Xingchen Song,Yihan Yan,Yongzhe He,Cici,Bowen Shen,Chengxuan Zhu,Chong Ma,Chun Chen,Heyu Chen,Jiawei Li,Lei Li,Menghang Zhu,Peidian Li,Qiying Wang,Sirui Deng,Weimin Xiong,Wenshan Huang,Wenyu Yang,Yilin Jiang,Yixin Yang,Yuanyuan Tian,Yue Ma,Yue Yu,Zihan Zhang,Zihao Yue,Bangjun Xiao,Bingquan Xia,Bofei Gao,Bowen Ye,Can Cai,Chang Liu,Chenhong He,Chunan Li,Dawei Zhu,Duo Zhang,Fengyuan Shi,Guoan Wang,Hailin Zhang,Hanglong Lv,Hanyu Li,Hao Tian,Heng Qu,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianguang Zuo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Linghao Zhang,Meng Chen,Nuo Chen,Peng Zhang,Qianli Chen,Qiantong Wang,Rang Li,Shaohui Liu,Shengfan Wang,Shicheng Li,Shihua Yu,Shijie Cao,Shimao Chen,Shuhao Gu,Weikun Wang,Wenhan Ma,Xiangwei Deng,Xing Yong,Xing Zhang,Xu Wang,Yifan Song,Yihao Zhao,Yingbo Zhao,Yizhao Gao,Yu Cheng,Yu Tu,Yudong Wang,Zhaojun Huang,Zhengju Tang,Zhenru Lin,Zhichao Song,Zhipeng Xu,Zhixian Zheng,Zihan Jiang
Main category: cs.CL
TL;DR: MiMo-Audio通过大规模预训练实现了音频领域的少样本学习能力,在多种音频任务上达到开源模型SOTA,并展现出对未见任务的良好泛化能力和高质量语音生成。
Details
Motivation: 受GPT-3启发,探索大规模自回归预训练在音频领域实现通用化的潜力,以摆脱传统音频模型依赖任务特定微调的限制。 Method: 扩展MiMo-Audio的预训练数据至超过一亿小时,采用统一的tokenization框架处理多模态音频信号,并在后训练阶段构建多样化指令微调语料,引入推理机制提升理解与生成能力。 Result: MiMo-Audio-7B-Base在语音智能与音频理解基准上达到开源SOTA,能泛化至声音转换、风格迁移等未训练任务,并具备真实感强的语音续写能力;MiMo-Audio-7B-Instruct在多个音频理解、对话和指令TTS评测中接近或超越闭源模型。 Conclusion: 大规模下一句预测预训练可有效推动音频语言模型的通用化,MiMo-Audio验证了该范式在音频领域的可行性与巨大潜力。 Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.[13] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection
Amal Alqahtani,Efsun Kayi,Mona Diab
Main category: cs.CL
TL;DR: 本文提出了StressRoBERTa,一种用于自动检测英文推文中自我报告的慢性压力的跨疾病迁移学习方法,通过在相关临床疾病上进行持续训练,显著提升了检测性能。
Details
Motivation: 由于慢性压力普遍存在且常与其他心理疾病共病,利用社交媒体文本进行自动检测具有重要意义,但现有模型在特定压力识别上表现有限。 Method: 采用RoBERTa模型,在抑郁、焦虑和PTSD等相关的临床心理健康数据集(Stress-SMHD)上进行持续预训练,并在SMM4H 2022 Task 8数据集上微调,实现跨疾病迁移学习。 Result: StressRoBERTa在SMM4H数据集上达到82%的F1分数,超过最佳参赛系统3个百分点;在Dreaddit数据集上也取得81%的F1分数,验证了其良好的迁移能力。 Conclusion: 针对高共病性的相关心理疾病进行聚焦式的跨条件迁移学习,能有效提升慢性压力检测效果,优于通用语言模型和广义心理健康模型。 Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.[14] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms
Himel Ghosh
Main category: cs.CL
TL;DR: 本研究比较了两种基于Transformer的新闻偏见检测模型,使用SHAP解释方法分析其决策机制。结果表明,领域自适应的RoBERTa模型比普通偏见检测模型具有更合理的归因模式和更少的误报。
Details
Motivation: 目前对偏见检测模型如何做出决策或为何失败缺乏理解,亟需可解释性研究以提升模型在新闻业中的可靠性与适用性。 Method: 基于SHAP的解释方法,对在BABE数据集上微调的两种Transformer模型(普通偏见检测器与领域自适应RoBERTa)进行词级归因分析,比较其在正确与错误预测中的注意力模式。 Result: 两种模型都关注评价性语言,但在信号整合上存在差异;普通模型对误报赋予更强证据,导致过度标记中性内容;领域自适应模型误报减少63%,归因更符合预测结果;误报主要源于话语层级的歧义而非明显偏见线索。 Conclusion: 模型架构与训练方式显著影响偏见检测系统的可靠性,可解释性分析对评估和部署此类系统至关重要。 Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.[15] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?
Dingmin Wang,Ji Ma,Shankar Kumar
Main category: cs.CL
TL;DR: 提出一种自适应提示策略,通过分块处理检索信息,在减少token使用的同时保持问答性能,并发现LLM在信息不足时易生成错误答案而非拒绝回答。
Details
Motivation: 长上下文虽有助于引入知识,但会包含更多无关信息,影响LLM生成质量。 Method: 将检索到的信息分割成小块,逐段提示LLM作答,通过调整块大小平衡相关信息的引入与无关信息的抑制。 Result: 在三个开放域问答数据集上,该方法在使用更少token的情况下达到与标准提示相当的性能;分析发现LLM常在信息不足时生成错误答案。 Conclusion: 自适应提示策略有效提升检索增强问答效率,且需进一步研究如何让LLM在信息不足时更好拒绝回答。 Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.[16] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation
Kaustubh Dhole
Main category: cs.CL
TL;DR: 本文提出了一种基于注意力层token分布生成对抗样本的新方法,利用模型内部的生成机制产生语义一致且合理的扰动,并在LLaMA-3.1-8B上验证其对评估任务的影响。
Details
Motivation: 探索大语言模型中间注意力层是否可被用于生成有效的对抗性扰动,以检验基于LLM的评估系统的鲁棒性。 Method: 从中间注意力层提取token分布,直接生成对抗样本,避免使用提示或梯度攻击,保持扰动与模型生成过程的一致性。 Result: 在ArgQuality数据集上实验显示,该方法能显著降低评估性能且保持语义相似性,但某些层和位置的替换会导致语法退化。 Conclusion: 中间层表示有潜力作为构建对抗样本的原则性来源,但其实际效果受限于语法质量和层的选择。 Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.[17] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs
Yukun Zhang,Stefan Elbl Droguett,Samyak Jain
Main category: cs.CL
TL;DR: 本研究提出了一种多检索器RAG系统,结合领域特定训练与最新大语言模型,提升金融数值推理问答任务的性能,实现了超越基线模型的SOTA结果,并探讨了外部知识增益与幻觉损失之间的权衡。
Details
Motivation: 由于缺乏金融领域的专业知识,现有大语言模型在处理金融数值推理问答任务时存在困难,难以准确进行复杂的多步数值计算。 Method: 采用多检索器检索增强生成(RAG)系统,结合外部领域知识与内部问题上下文,使用SecBERT编码器进行领域特定训练,并利用最新的大语言模型进行少样本学习下的数值推理。 Result: 领域特定训练显著提升了模型性能,超越FinQA论文中的最佳模型;基于提示的最大模型实现了超过7%的SOTA性能提升,但仍低于人类专家水平;研究发现较大模型中外部知识增益通常超过幻觉损失。 Conclusion: 领域特定训练和多源知识检索对金融数值推理至关重要,最新大语言模型在少样本设置下展现出更强的数值推理能力,但仍有改进空间以接近人类表现。 Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.[18] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics
Conrad Borchers,Manit Patel,Seiyon M. Lee,Anthony F. Botelho
Main category: cs.CL
TL;DR: 本文提出了一种分析优先的框架,用于分离开放性回答中学生内容信号与教师评分倾向,利用ASSISTments数学数据建模教师评分历史并提取句子嵌入表示,通过中心化和残差化消除提示和教师混淆因素,结果显示结合教师先验与内容嵌入能显著提升预测性能(AUC 0.815),而仅依赖内容模型较弱(AUC 0.626),调整评分者效应可增强语义表征的信息量,支持对教学实践与学生理解之间一致性的反思。
Details
Motivation: 现有自动评分方法常将学生实际回答内容与教师评分习惯混为一谈,导致评估偏差,难以准确反映学生真实理解水平,因此需要一种能够解耦内容与评分倾向的方法以提高评估透明度与有效性。 Method: 提出一个分析优先的框架,使用去识别化的ASSISTments数学答题数据,将教师评分历史建模为动态先验,从句子嵌入中提取文本表示,并采用中心化与残差化技术减少题目提示和教师个体差异带来的混淆;通过时间验证的线性模型量化各信号贡献,并用投影面模型揭示评分分歧以供质性分析。 Result: 教师先验对成绩预测有显著影响;结合教师先验与内容嵌入的模型表现最佳(AUC约0.815),仅用内容的模型虽高于随机但仍较弱(AUC约0.626);校正评分者效应后,残差内容表示保留了更多有意义的嵌入维度,揭示出语义证据支持学生理解而非表面作答差异的情况。 Conclusion: 该框架提供了一个实用的分析流水线,不仅能提升评分准确性,还将嵌入特征转化为可用于教学反思的学习分析工具,使教师和研究者能审视评分实践是否与学生推理和学习证据相一致,推动更公平、透明的教育评估。 Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.[19] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
Chulun Zhou,Chunkang Zhang,Guoxin Yu,Fandong Meng,Jie Zhou,Wai Lam,Mo Yu
Main category: cs.CL
TL;DR: 本文提出了一种基于超图的动态记忆机制HGMem,用于增强多步检索增强生成(RAG)中的复杂推理与全局理解能力。
Details
Motivation: 现有RAG系统的记忆模块多为静态存储,缺乏对原始事实间高阶关联的建模,限制了其在多步推理和全局认知中的表现。 Method: 设计HGMem,将记忆表示为超图结构,其中超边代表记忆单元,支持逐步形成高阶交互,实现事实与思维的动态整合。 Result: 在多个需要全局理解的挑战性数据集上进行了实验,结果表明HGMem显著优于强基线系统,在多步RAG任务中持续提升性能。 Conclusion: HGMem通过构建动态、表达性强的记忆结构,有效促进了知识演化与深度推理,增强了模型的全局感知与连贯推理能力。 Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.[20] Efficient Context Scaling with LongCat ZigZag Attention
Chen Zhang,Yang Bai,Jiahuan Li,Anchun Gui,Keheng Wang,Feifan Liu,Guanyu Wu,Yuwei Jiang,Defei Bu,Li Wei,Haihang Jing,Hongyin Tang,Xin Chen,Xiangzhou Huang,Fengcun Li,Rongxiang Weng,Yulei Qian,Yifan Lu,Yerui Sun,Jingang Wang,Yuchen Xie,Xunliang Cai
Main category: cs.CL
TL;DR: 提出了一种名为LongCat ZigZag Attention (LoZA)的稀疏注意力机制,可将全注意力模型高效转换为稀疏版本,显著加速长上下文场景下的推理,适用于百万级token处理。
Details
Motivation: 为了在有限计算预算下提升长上下文场景中模型的推理效率,解决全注意力机制计算开销大的问题。 Method: 设计了一种稀疏注意力方案LoZA,并将其应用于LongCat-Flash模型的中期训练,生成支持长上下文的稀疏模型LongCat-Flash-Exp。 Result: LoZA在预填充密集和解码密集的任务中均实现了显著加速,支持高达100万token的快速处理,增强了长期推理和长视野代理能力。 Conclusion: LoZA能有效将全注意力模型转化为高效稀疏模型,在保持性能的同时大幅降低长序列处理的计算成本。 Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.[21] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards
Zhiming Lin,Kai Zhao,Sophie Zhang,Peilai Yu,Canran Xiao
Main category: cs.CL
TL;DR: 提出CEC-Zero,一种无监督强化学习框架,通过让大模型自我纠正错误来提升中文拼写纠错性能,在9个基准上显著优于监督方法和强LLM微调。
Details
Motivation: 现有大模型和监督方法在中文拼写纠错中对新错误鲁棒性差,且依赖昂贵标注,缺乏无需标签的高效解决方案。 Method: 构建CEC-Zero框架:从干净文本合成带错输入,通过语义相似性和候选一致性计算聚类共识奖励,并使用PPO优化策略。 Result: 在9个基准上F$_1$指标超过监督基线10-13分,强LLM微调基线5-8分,具备无偏奖励和收敛的理论保证。 Conclusion: CEC-Zero建立了无需标签的中文拼写纠错新范式,提升了大模型在噪声文本处理中的可扩展性与鲁棒性。 Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.[22] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Zhenyu Zhang,Shujian Zhang,John Lambert,Wenxuan Zhou,Zhangyang Wang,Mingqing Chen,Andrew Hard,Rajiv Mathews,Lun Wang
Main category: cs.CL
TL;DR: 提出RISE框架,通过稀疏自编码器在激活空间中无监督发现可解释的推理向量,实现对大模型推理行为(如反思、回溯、置信度等)的分离与控制。
Details
Motivation: 现有基于人工定义概念的推理分析方法受限于无法全面捕捉复杂且难以用词元定义的推理行为,缺乏对LLM内部推理机制的深入理解。 Method: 将思维链分割为句子级步骤,在步骤级激活上训练稀疏自编码器(SAE),从激活空间中提取解耦的推理向量,并通过可视化和聚类分析其语义;进一步通过干预这些向量来调控推理行为。 Result: 成功识别出对应反思、回溯、响应长度和置信度等可解释行为的分离向量,聚类显示其在解码器空间中分布可分,且可通过干预改变推理路径。 Conclusion: 无监督的稀疏自编码方法能有效揭示并控制LLM中的多层次推理行为,为理解与操纵模型推理过程提供了新工具。 Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.[23] WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri,Subasish Das,Tausif Islam Chowdhury
Main category: cs.CL
TL;DR: 本研究提出了WISE框架,用于评估轻量级Transformer模型在区分虚假新闻与讽刺内容上的性能,发现MiniLM和RoBERTa-base表现最佳,且轻量模型可在资源受限场景中有效部署。
Details
Motivation: 由于虚假新闻与讽刺内容在语言特征上相似但意图不同,准确区分二者是一个挑战,现有方法在效率与准确性之间难以平衡。 Method: 构建了包含20,000个样本的平衡数据集,采用Fakeddit数据并标注为虚假新闻或讽刺内容;使用8种轻量级Transformer模型和2种基线模型,通过5折分层交叉验证,基于准确率、F1分数、ROC-AUC等多种指标进行评估。 Result: MiniLM达到最高准确率(87.58%),RoBERTa-base在ROC-AUC上表现最好(95.42%)且准确率高(87.36%),DistilBERT在效率与性能间取得良好平衡(准确率86.28%,ROC-AUC 93.90%);统计检验表明模型间差异显著。 Conclusion: 轻量级Transformer模型在区分虚假新闻与讽刺内容方面可媲美甚至超越大型模型,适合在资源受限的实际场景中部署,为虚假信息检测提供了高效可行的解决方案。 Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.[24] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning
Sijia Chen,Di Niu
Main category: cs.CL
TL;DR: 提出iCLP框架,通过隐式规划在潜空间中生成紧凑的推理指令,提升大模型在数学推理和代码生成中的准确性、效率及跨域泛化能力。
Details
Motivation: 由于大语言模型易产生幻觉且任务多样性高,显式文本规划难以准确生成,受人类内隐认知启发,需构建更鲁棒的规划机制。 Method: 从现有思维链轨迹中提取显式计划,利用向量量化自编码器学习其离散表示,并通过微调使大模型学会基于潜计划生成推理步骤。 Result: 在数学推理与代码生成任务上显著提升准确率与效率,实现跨领域泛化,同时保持思维链的可解释性。 Conclusion: iCLP实现了大模型在潜空间中的隐式规划,为复杂任务提供了一种高效、通用且可解释的推理范式。 Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.[25] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models
Rohit Kumar Salla,Manoj Saravanan,Shrikar Reddy Kota
Main category: cs.CL
TL;DR: 提出了一种名为Composite Reliability Score (CRS)的统一评估框架,综合校准性、鲁棒性和不确定性量化,用于评估大型语言模型在关键决策领域的可靠性。
Details
Motivation: 大型语言模型在医疗、法律和金融等关键领域应用日益广泛,但其可靠性仍不确定,现有评估方法零散且无法全面反映模型的可靠性问题。 Method: 构建CRS框架,整合校准性、鲁棒性和不确定性量化三个维度,通过对十个主流开源大模型在五个问答数据集上的实验,评估其在基线、输入扰动和校准方法下的表现。 Result: CRS能够稳定地对模型进行排序,揭示单一指标无法发现的隐藏失效模式,并发现最可靠的系统在准确性、鲁棒性和校准不确定性之间取得了平衡。 Conclusion: CRS是一种可解释的综合性指标,能更全面地评估大模型在关键应用场景中的可靠性,有助于识别真正稳健的模型。 Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.[26] HY-MT1.5 Technical Report
Mao Zheng,Zheng Li,Tao Chen,Mingyang Song,Di Wang
Main category: cs.CL
TL;DR: 本文介绍了新型机器翻译模型HY-MT1.5-1.8B和HY-MT1.5-7B,采用多阶段训练框架,在多种翻译任务中表现优异,兼具参数效率与先进功能支持。
Details
Motivation: 旨在开发高性能、高效率的专用机器翻译模型,以超越现有开源和商业模型,并支持复杂翻译约束。 Method: 采用包含通用与MT定向预训练、监督微调、策略内蒸馏和强化学习的多阶段整体训练框架。 Result: HY-MT1.5-1.8B在参数量远小于主流模型的情况下显著优于它们,达到Gemini-3.0-Pro约90%的性能;HY-MT1.5-7B在其规模类别中达到新SOTA,实现其95%性能并在部分基准上超越之。两模型均支持术语干预、上下文感知和格式保持等高级功能。 Conclusion: HY-MT1.5系列模型在各自参数规模下提供了极具竞争力且鲁棒的通用与专业翻译解决方案,展示了高效专用MT模型的巨大潜力。 Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.[27] Training a Huggingface Model on AWS Sagemaker (Without Tears)
Liling Tan
Main category: cs.CL
TL;DR: 本文旨在通过集中关键信息,帮助研究人员从零开始在AWS SageMaker上成功训练首个Hugging Face模型,从而降低云平台使用门槛,推动更多研究者采用云计算资源进行大语言模型训练。
Details
Motivation: 由于缺乏本地计算资源,许多研究人员转向云服务训练复杂模型,但云平台的学习曲线陡峭且文档分散,形成采用障碍。 Method: 整合从零开始在AWS SageMaker上训练Hugging Face模型所需的核心步骤与实用指南,提供一站式教程以填补现有知识空白。 Result: 构建了一个简明、可操作的指导框架,帮助研究人员克服云平台使用中的常见难题,顺利完成模型训练任务。 Conclusion: 该工作有助于降低云平台使用门槛,促进资源受限的研究人员更便捷地利用云计算能力开展大语言模型研究。 Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.[28] Activation Steering for Masked Diffusion Language Models
Adi Shnaidman,Erin Feiglin,Osher Yaari,Efrat Mentel,Amit Levi,Raz Lapid
Main category: cs.CL
TL;DR: 提出了一种针对掩码扩散语言模型(MDLMs)的激活引导框架,通过对比示例计算逐层引导向量,实现高效推理时控制。
Details
Motivation: 现有的MDLMs在推理时缺乏有效的控制和引导机制,限制了其在实际应用中的灵活性和可控性。 Method: 利用对比示例通过单次前向传播计算逐层的引导向量,并在每一步反向扩散过程中应用这些向量,无需模拟去噪轨迹。 Result: 在LLaDA-8B-Instruct上实验表明,该方法能可靠地调节文本的高层属性,并通过消融研究分析了不同Transformer子模块和token范围的影响。 Conclusion: 所提出的激活引导框架为MDLMs提供了高效且灵活的推理时控制手段,拓展了其在可控文本生成中的应用潜力。 Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).[29] Large Emotional World Model
Changhao Song,Yazhou Zhang,Hui Gao,Chang Yang,Peng Zhang
Main category: cs.CL
TL;DR: 本文提出了一个大型情感世界模型(LEWM),通过构建包含情感因果关系的EWH数据集,使世界模型能够显式建模情绪状态,并提升对情感驱动社会行为的预测能力。
Details
Motivation: 现有大语言模型虽具备一定世界知识建模能力,但主要关注物理规律,缺乏对情绪这一关键因素的系统性建模,而情绪在人类决策和世界理解中具有重要作用。 Method: 受心智理论启发,构建了融合情感、因果关系与行为推理的Emotion-Why-How(EWH)数据集,并在此基础上提出LEWM模型,联合建模情绪状态、视觉观测与动作,实现对未来状态及情绪变化的联合预测。 Result: 实验表明,LEWM在情感驱动的社会行为预测上表现更优,同时在基础任务上保持与通用世界模型相当的性能。移除情绪相关信息会导致推理性能下降,验证了情绪建模的重要性。 Conclusion: 将情绪纳入世界模型有助于提升对复杂社会场景的理解与预测,LEWM为构建更具人类认知特性的智能系统提供了新方向。 Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.[30] Training Report of TeleChat3-MoE
Xinzhang Liu,Chao Wang,Zhihao Yang,Zhuo Jiang,Xuncheng Zhao,Haoran Wang,Lei Li,Dongdong He,Luobin Liu,Kaizhe Yuan,Han Gao,Zihan Wang,Yitong Yao,Sishi Xiong,Wenmin Deng,Haowei He,Kaidong Yu,Yu Zhao,Ruiyu Fang,Yuhao Jiang,Yingyan Li,Xiaohui Hu,Xi Yu,Jingqi Li,Yanwei Liu,Qingli Li,Xinyu Shi,Junhao Niu,Chengnuo Huang,Yao Xiao,Ruiwen Wang,Fengkai Li,Luwen Pu,Kaipeng Jia,Fubei Yao,Yuyao Huang,Xuewei He,Zhuoru Jiang,Ruiting Song,Rui Xue,Qiyi Xie,Jie Zhang,Zilu Huang,Zhaoxi Zhang,Zhilong Lu,Yanhan Zhang,Yin Zhang,Yanlei Xue,Zhu Yuan,Teng Su,Xin Jiang,Shuangyong Song,Yongxiang Li,Xuelong Li
Main category: cs.CL
TL;DR: TeleChat3-MoE是基于Ascend NPU集群训练的超大规模语言模型系列,采用混合专家(MoE)架构,参数量达百亿至万亿级。本文重点介绍支持其高效可扩展训练的基础设施,包括数值精度验证、性能优化策略和多维并行框架。
Details
Motivation: 为了支持超大规模语言模型在国产NPU集群上的高效稳定训练,需解决跨硬件平台的数值一致性、分布式训练效率及系统级瓶颈等问题。 Method: 提出系统性的算子级与端到端数值精度验证方法;设计多种性能优化技术,如交错流水线调度、长序列感知的数据调度、分层重叠通信和DVM算子融合;构建基于分析估计与整数线性规划的多维并行配置优化框架;并实施集群级优化以缓解主机与设备瓶颈。 Result: 所提基础设施在数千设备规模的集群上实现了显著的吞吐提升和近线性扩展性能,有效支持了百亿至万亿参数模型的端到端训练。 Conclusion: 该工作为在异构硬件生态中构建超大规模语言模型提供了可靠且高效的训练基础,验证了国产NPU集群在大模型时代的可行性与潜力。 Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.[31] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring
Qipeng Wang,Rui Sheng,Yafei Li,Huamin Qu,Yushi Sun,Min Zhu
Main category: cs.CL
TL;DR: MedKGI是一个基于临床实践的诊断框架,通过整合医学知识图谱、基于信息增益的问题选择和OSCE格式的状态结构,提升大语言模型在临床诊断中的准确性与对话效率。
Details
Motivation: 现有大语言模型在临床诊断中存在幻觉、提问冗余和多轮对话不连贯等问题,难以模拟真实的临床推理过程。 Method: 提出MedKGI框架,结合医学知识图谱约束推理、使用信息增益指导问题生成,并采用OSCE格式的结构化状态跟踪证据。 Result: 在临床基准测试中,MedKGI比强基线模型平均提升30%的对话效率,并保持最先进的诊断准确率。 Conclusion: MedKGI有效解决了LLMs在临床诊断中的关键缺陷,显著提升了诊断过程的效率与一致性。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.[32] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring
May Bashendy,Walid Massoud,Sohaila Eltanbouly,Salam Albatarni,Marwan Sayed,Abrar Abir,Houda Bouamor,Tamer Elsayed
Main category: cs.CL
TL;DR: 本文介绍了LAILA,目前最大的公开阿拉伯语自动作文评分(AES)数据集,包含7,859篇带有整体和特征分数标注的作文,涵盖七个评分维度,并提供了基于先进模型的基准测试结果。
Details
Motivation: 由于缺乏公开可用的数据集,阿拉伯语自动作文评分(AES)的研究相对有限,因此需要构建一个大规模、公开的阿拉伯语AES数据集以推动该领域发展。 Method: 设计并收集了包含7,859篇阿拉伯语作文的数据集LAILA,每篇作文在七个维度(相关性、组织、词汇、风格、发展、机械和语法)上进行了整体和特征评分标注,并使用最先进的阿拉伯语和英语模型在特定提示和跨提示设置下进行基准测试。 Result: LAILA成为目前最大的公开阿拉伯语AES数据集,基准实验表明其可用于训练和评估阿拉伯语自动评分系统,支持多种评分维度的建模。 Conclusion: LAILA填补了阿拉伯语自动作文评分研究中的关键空白,为开发鲁棒的阿拉伯语写作评估系统提供了重要资源。 Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.[33] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning
Michael E. Rose,Mainak Ghosh,Sebastian Erhardt,Cheng Li,Erik Buunk,Dietmar Harhoff
Main category: cs.CL
TL;DR: 提出了一种适用于专利和科学出版物的语言相似性模型Pat-SPECTER,在预测专利-论文引用方面表现最佳,并验证了美国专利引用文献语义相似性较低的假设。
Details
Motivation: 需要一种能够同时处理专利和科学出版物的语言相似性模型,以更好预测专利与论文之间的引用关系,并探索不同司法管辖区引用行为的差异。 Method: 基于SPECTER2模型在专利数据上进行微调,提出Pat-SPECTER模型,并通过八种语言(相似性)模型的对比实验评估其性能,同时在两个实际应用场景中测试其能力。 Result: Pat-SPECTER在预测可信专利-论文引用任务中表现最优,并发现美国专利引用的论文语义相似性低于其他大型司法管辖区,推测原因在于美国的坦白义务(duty of candor)。 Conclusion: Pat-SPECTER是当前最适合用于联合分析专利与科学出版物的语言模型,且研究结果支持美国因法律要求导致引用行为更具多样性的假说。 Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.[34] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning
Ziqing Fan,Yuqiao Xian,Yan Sun,Li Shen
Main category: cs.CL
TL;DR: 本文提出了DATAMASK,一种高效的联合学习框架,用于大规模预训练数据选择,能够在质量与多样性指标之间进行平衡优化,显著提升模型性能并减少98.9%的选择时间。
Details
Motivation: 现有数据选择方法在大规模预训练中难以同时兼顾质量与多样性指标,导致训练效率下降和有价值样本丢失,限制了大模型的能力。 Method: 将数据选择建模为掩码学习问题,通过迭代采样数据掩码、计算策略梯度并更新掩码采样logits,在统一框架内联合优化质量与多样性指标,并引入加速机制提升效率。 Result: 在15万亿token的FineWeb数据集上选出约10%的子集FineWeb-Mask,相比基线方法减少98.9%选择时间,在12项任务上对1.5B密集模型和7B MoE模型分别取得3.2%和1.9%的性能提升。 Conclusion: DATAMASK实现了高效的数据选择联合优化,验证了质量与多样性指标协同的重要性,为大规模语言模型的预训练数据构建提供了新范式。 Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.[35] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs
Jonathan Schmoll,Adam Jatowt
Main category: cs.CL
TL;DR: 本文介绍了一个用于欧盟分类法合规性评估的新型结构化数据集,包含190份企业报告中的经济活动和关键绩效指标(KPIs),并首次系统评估了大语言模型(LLMs)在该流程中的表现,发现LLMs在定性任务中表现中等,在定量任务中则全面失败。
Details
Motivation: 由于缺乏公开的基准数据集,当前研究难以评估大语言模型在欧盟分类法合规流程中的自动化潜力,本文旨在填补这一空白。 Method: 构建了一个包含190份企业报告的结构化数据集,标注了经济活动和定量KPI,并采用多步代理框架对LLMs在定性和定量任务上的表现进行系统评估。 Result: LLMs在识别经济活动的定性任务中表现中等,多步框架略微提升精度;但在零样本下的财务KPI预测等定量任务中完全失败;此外,简洁的元数据有时比完整报告效果更好,且模型置信度校准差。 Conclusion: 大语言模型尚不能实现欧盟分类法合规的全自动处理,但可作为辅助工具支持人类专家,本文提供的数据集为后续研究建立了公开基准。 Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.[36] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking
Meiqi Chen,Fandong Meng,Jie Zhou
Main category: cs.CL
TL;DR: 本文提出了FIGR,一种通过端到端强化学习将主动视觉思维融入多轮推理的模型,能够通过构建可视化表征来外化中间结构假设,从而提升复杂问题中对全局结构属性的推理能力。
Details
Motivation: 纯文本推理难以捕捉复杂问题中的全局结构约束和隐式空间、几何关系,因此需要引入视觉表征以增强推理的稳定性和连贯性。 Method: 提出FIGR模型,利用端到端强化学习,在推理过程中动态生成视觉表示,并自适应地决定何时以及如何调用视觉推理,实现图文协同的多步推理。 Result: 在AIME 2025上比强文本链推理基线提升13.12%,在BeyondAIME上提升11.00%,验证了图示引导的多模态推理在数学复杂推理任务中的有效性。 Conclusion: 结合视觉表征的多模态推理能有效提升复杂推理任务中对全局结构的理解与稳定性,FIGR为解决隐式结构问题提供了新思路。 Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.[37] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs
Shupeng Li,Weipeng Lu,Linyun Liu,Chen Lin,Shaofei Li,Zhendong Tan,Hanjun Zhong,Yucheng Zeng,Chenghao Zhu,Mengyue Liu,Daxiang Dong,Jianmin Wu,Yunting Xiao,Annan Li,Danyu Liu,Jingnan Zhang,Licen Liu,Dawei Yin,Dou Shen
Main category: cs.CL
TL;DR: 本文提出了QianfanHuijin,一种面向金融领域的大型语言模型,并提出了一种可推广的多阶段训练范式,通过逐步细化的后训练流程(包括金融监督微调、推理强化学习和智能体强化学习)显著提升了模型在金融知识、推理与智能体能力方面的表现。
Details
Motivation: 随着金融服务复杂性的加深,仅具备领域知识的模型已无法满足需求,亟需兼具金融推理和智能体能力的增强型大模型。 Method: 采用多阶段训练范式:首先在金融语料上进行持续预训练(CPT),然后依次进行金融SFT、金融推理RL、金融智能体RL,最后通过与真实业务对齐的通用RL完成训练。 Result: QianfanHuijin在多个权威金融基准测试中表现出色,消融实验表明推理RL和智能体RL阶段分别显著提升了对应能力。 Conclusion: 该细粒度、渐进式的后训练方法有效增强了工业级大模型的领域适应性,有望成为各类行业增强型LLM的主流训练范式。 Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.[38] World model inspired sarcasm reasoning with large language model agents
Keito Inoshita,Shinnosuke Mizuno
Main category: cs.CL
TL;DR: 本文提出了一种基于世界模型的讽刺理解框架WM-SAR,通过将字面意义、上下文、规范期望和意图分解为基于大语言模型的专用代理,并显式量化语义不一致与意图得分,结合轻量级逻辑回归实现可解释且高性能的讽刺检测。
Details
Motivation: 现有讽刺理解方法多依赖黑箱模型,难以解释认知机制;同时缺乏对语义评价与规范期望之间不匹配的显式建模,限制了可解释性与性能。 Method: 将讽刺理解重构为世界模型驱动的推理过程,设计WM-SAR框架:使用多个LLM代理分别建模字面意义、上下文、规范期望和意图;计算字面评估与规范期望之间的不一致性得分,并结合意图得分,输入轻量级逻辑回归模型进行最终判断。 Result: 在多个代表性讽刺检测基准上,WM-SAR均优于现有的深度学习和大语言模型方法;消融实验和案例分析表明,语义不一致与意图推理的结合对性能至关重要。 Conclusion: 通过将大语言模型的推理能力与结构化的数值决策相结合,WM-SAR实现了高性能与高可解释性的统一,揭示了讽刺理解中关键的认知因素及其作用机制。 Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.[39] Skim-Aware Contrastive Learning for Efficient Document Representation
Waheed Ahmed Abro,Zied Bouraoui
Main category: cs.CL
TL;DR: 提出一种基于自然语言推断的自监督对比学习框架,通过模拟人类阅读习惯来增强长文档表示,在法律和生物医学文本上表现出更高的准确性和效率。
Details
Motivation: 现有模型在处理长文档时存在资源消耗大、上下文捕捉不全或缺乏可解释性的问题,尤其是在法律和医学领域。 Method: 引入一种新的自监督对比学习框架,随机掩码文档中的部分段落,并利用基于自然语言推断(NLI)的对比目标将其与相关部分对齐,远离无关部分,模拟人类略读文本的方式。 Result: 在法律和生物医学文本上的实验表明,该方法显著提升了长文档表示的准确性和计算效率。 Conclusion: 该方法通过模仿人类阅读策略,有效改善了长文档的表示质量,同时具备良好的计算效率和应用潜力。 Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.[40] Comparing Approaches to Automatic Summarization in Less-Resourced Languages
Chester Palen-Michel,Constantine Lignos
Main category: cs.CL
TL;DR: 本文研究了低资源语言的自动文本摘要方法,比较了从零样本提示大/小规模LLM到微调mT5模型等多种方法,并探索了基于LLM翻译的流水线方式。结果表明,多语言微调的mT5在多数指标上优于其他方法。
Details
Motivation: 低资源语言的文本摘要研究相对不足,现有高性能方法多集中于英语等高资源语言,因此需要探索适用于低资源语言的有效摘要技术。 Method: 比较了多种方法:零样本提示不同规模的LLM、微调mT5模型(结合三种数据增强和多语言迁移)、以及将源语言翻译为英文进行摘要再回译的LLM翻译流水线方法;使用五种指标进行评估。 Result: 发现相似参数规模的LLM表现存在差异;多语言微调的mT5基线在大多数指标上优于其他方法,包括零样本LLM;且LLM作为评判器在低资源语言上可靠性较低。 Conclusion: 针对低资源语言的文本摘要,多语言微调的小型模型(如mT5)比零样本大模型更有效,而依赖翻译的流水线和LLM评判需谨慎使用。 Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.[41] Cleaning English Abstracts of Scientific Publications
Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt
Main category: cs.CL
TL;DR: 提出一种开源语言模型,用于自动清除英文科研摘要中的冗余信息,提升文本嵌入质量和相似性分析准确性。
Details
Motivation: 科研摘要常被用作分析研究主题的依据,但其中常含有版权说明、元数据等无关内容,影响文本分析结果。 Method: 开发了一个易于集成的开源语言模型,能够自动识别并移除科学摘要中的非必要信息。 Result: 该模型表现出高精确性和保守性,能有效改善清洗后摘要的相似性排序,并提升标准长度嵌入的信息含量。 Conclusion: 所提出的模型可有效净化科学摘要,增强下游自然语言处理任务的可靠性与准确性。 Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.[42] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback
Titas Ramancauskas,Kotryna Ramancauske
Main category: cs.CL
TL;DR: 本文提出并评估了一个针对雅思写作考试的修订平台,结合自动化评分和定制反馈,通过设计导向研究迭代优化模型,从基于规则的方法转向基于DistilBERT的回归模型,显著提升了评分准确性和反馈效果。
Details
Motivation: 传统的雅思写作备考方法缺乏根据雅思评分标准提供的个性化反馈,且难以模拟真实考试环境,因此需要开发一个能提供精准评分和针对性反馈的在线平台以提升学习效果。 Method: 采用设计导向研究(DBR)方法,经过多轮迭代,平台从规则-based自动评分发展为基于DistilBERT加回归头的Transformer模型,并集成适应性反馈机制;通过分离对话引导与写作界面降低认知负荷,模拟考试情境。 Result: 第四轮DBR迭代中,DistilBERT模型实现MAE为0.66且R²为正值,显著优于早期规则模型;第五轮中,适应性反馈使考生得分平均提高0.060个分数段(p=0.011,Cohen's d=0.504),但不同修改策略效果存在差异。 Conclusion: 自动化反馈可作为雅思写作教学的有效补充,表面层级的保守修改比激进结构调整更可靠;当前系统在高分段作文评估上仍有局限,未来需结合长期追踪研究及官方考官验证。 Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.[43] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski,Alexander Waibel
Main category: cs.CL
TL;DR: 本文提出了段落分割在语音转录中的重要性,构建了两个新基准TEDPara和YTSegPara,提出基于约束解码的大语言模型方法,并设计轻量模型MiniSeg实现篇章与章节的联合预测,推动语音处理中的段落分割成为标准化任务。
Details
Motivation: 语音转录文本通常为无结构词流,可读性和再利用性差,而段落分割长期被忽视,现有文本分割领域缺乏自然、鲁棒的基准数据集。 Method: 1)构建TEDPara(人工标注)和YTSegPara(合成标签)作为语音段落分割的新基准;2)提出约束解码机制,使大语言模型在保持原文不变的前提下插入段落分隔符;3)设计轻量模型MiniSeg,并扩展为层次化模型以联合预测章节和段落。 Result: 所提方法在新基准上达到当前最优性能,MiniSeg模型体积小、计算成本低,且能同时预测章节与段落,支持忠实、句子对齐的评估。 Conclusion: 段落分割应成为语音处理中的标准步骤,本文提供的资源与方法为此奠定了基础,推动了语音与文本分割领域的融合发展。 Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.[44] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs
Muhammad Abdullahi Said,Muhammad Sammani Sani
Main category: cs.CL
TL;DR: 本研究通过构建基于西非威胁场景的对抗性数据集HausaSafety,系统评估了三种主流大模型在英语与豪萨语中的安全对齐表现,发现安全性能受语言与时间框架交互影响显著,存在“复杂干扰”机制和“安全盲区”,提出需转向“不变对齐”新范式以保障多语言场景下的安全稳定性。
Details
Motivation: 当前大语言模型的安全对齐被认为可从英语零样本迁移到其他语言,但这一假设在低资源语言中可能存在严重漏洞,尤其在面临区域性安全威胁时,可能导致全球南方用户面临更高风险,因此需要系统性检验多语言安全对齐的真实性与稳健性。 Method: 构建名为HausaSafety的新型对抗性数据集,聚焦西非地区特有的安全威胁(如Yahoo-Yahoo诈骗、Dane枪支制造),采用2×4因子设计,在1,440次评估中测试GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus三种模型在语言(英语vs.豪萨语)和时间框架(过去/未来等)上的交互效应,分析其安全响应模式。 Result: 研究发现并非简单的多语言安全差距,而是存在“复杂干扰”机制:Claude 4.5 Opus在豪萨语中因不确定性拒绝而更安全(45.0% vs. 36.7%),但模型普遍存在时间推理崩溃;过去时态显著削弱防御(仅15.6%安全),未来时则引发过度拒绝(57.2%安全),最安全与最脆弱配置间存在9.2倍差异,表明安全性是情境依赖的状态而非固定属性。 Conclusion: 当前模型依赖表面启发式而非深层语义理解,导致在特定语言与时间组合下出现“安全盲区”,使全球南方用户易受本地化伤害;必须从现有范式转向“不变对齐”,以确保跨语言和跨时间场景下的安全稳定性。 Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.[45] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering
Chaodong Tong,Qi Zhang,Jiayang Gao,Lei Jiang,Yanbing Liu,Nannan Sun
Main category: cs.CL
TL;DR: 本文提出了一种名为HaluNet的轻量级神经网络框架,用于检测大语言模型在问答任务中的幻觉问题。该方法通过融合词元级别的概率不确定性和语义表示不确定性,实现了高效的一次性幻觉检测,在多个数据集上表现出色且计算效率高。
Details
Motivation: 大语言模型虽然在问答任务中表现优异,但常产生幻觉(如事实错误或虚构内容)。现有方法通常只关注单一类型的不确定性,忽略了不同来源不确定性之间的互补性,尤其是词元级别概率不确定性与内部语义表示不确定性之间的结合潜力。 Method: 提出HaluNet,一个轻量级、可训练的多分支神经网络框架,将语义嵌入与概率置信度和分布不确定性相结合,自适应地融合模型知识与其输出中的不确定性,实现多粒度词元级别不确定性的集成。 Result: 在SQuAD、TriviaQA和Natural Questions数据集上的实验表明,HaluNet在有无上下文的情况下均具有强大的幻觉检测性能和良好的计算效率。 Conclusion: HaluNet能够有效整合多种不确定性信号,为大语言模型中的实时幻觉检测提供了可行方案,具有实际应用潜力。 Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.[46] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs' Legal Reasoning Capabilities
Hongseok Oh,Wonseok Hwang,Kyoung-Woon On
Main category: cs.CL
TL;DR: 提出韩国规范法律基准KCL,用于评估语言模型在不依赖领域知识情况下的法律推理能力,包含选择题和开放式问答两部分,并提供支持性判例和自动评估工具。
Details
Motivation: 现有基准难以区分模型的法律推理能力与参数化知识,需要一个能解耦二者的新基准来更准确地评估法律推理。 Method: 构建包含283道选择题和169道论述题的KCL基准,每题附有相关判例;设计KCL-MCQA和KCL-Essay两个子集,并为论述题提供2,739条细粒度评分规则以实现自动化评估。 Result: 对30多个模型的系统评估显示当前模型在KCL-Essay上仍有较大差距,且专为推理设计的模型表现优于通用模型。 Conclusion: KCL能有效评估语言模型的法律推理能力,支持更精细的能力解耦分析,推动法律AI发展。 Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.[47] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
Zhenyu Zhang,Xiaoxia Wu,Zhongzhu Zhou,Qingyang Wu,Yineng Zhang,Pragaash Ponnusamy,Harikaran Subbaraj,Jue Wang,Shuaiwen Leon Song,Ben Athiwaratkun
Main category: cs.CL
TL;DR: 本文提出了一种名为CREST的训练-free方法,通过在推理时干预特定注意力头来引导大语言模型的认知推理过程,从而提升推理准确性和效率。
Details
Motivation: 大语言模型在解决复杂任务时依赖长链式思维推理,但常常存在效率低下、延迟高或推理不稳定的问题,如浅层不一致的步骤(underthinking)或重复冗长的推理(overthinking)。 Method: 研究发现某些特定的注意力头与验证、回溯等认知行为相关,CREST包含两个部分:(1) 离线校准步骤识别这些认知头并生成特定的引导向量;(2) 在推理时旋转隐藏表示以抑制这些向量方向上的分量,从而抑制低效推理行为。 Result: 在多个推理基准和模型上,CREST最高提升了17.5%的准确率,并减少了37.6%的token使用量。 Conclusion: CREST提供了一种简单而有效的方法,能够在无需训练的情况下加速并稳定大语言模型的推理过程,实现更高准确率和更低计算成本。 Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.[48] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
Junru Lu,Jiarui Qin,Lingfeng Qiao,Yinghui Li,Xinyi Dai,Bo Ke,Jianfeng He,Ruizhi Qiao,Di Yin,Xing Sun,Yunsheng Wu,Yinsong Liu,Shuangyin Liu,Mingkong Tang,Haodong Lin,Jiayi Kuang,Fanxu Meng,Xiaojuan Tang,Yunjia Xi,Junjie Huang,Haotong Yang,Zhenyi Shen,Yangning Li,Qianwen Zhang,Yifei Yu,Siyu An,Junnan Dong,Qiufeng Wang,Jie Wang,Keyu Chen,Wei Wen,Taian Guo,Zhifeng Shen,Daohai Yu,Jiahao Li,Ke Li,Zongyi Li,Xiaoyu Tan
Main category: cs.CL
TL;DR: Youtu-LLM是一个从零训练的1.96B轻量级语言模型,通过紧凑架构、多阶段课程学习和代理型中期训练,在保持高计算效率的同时实现了强大的推理与规划能力,尤其在长上下文和代理任务中表现卓越。
Details
Motivation: 现有小型语言模型通常依赖知识蒸馏,难以真正具备深层推理和自主代理能力;因此需要一种从预训练阶段就系统性培养认知能力的新方法。 Method: 采用密集MLA架构与STEM导向词表支持128k长上下文;设计“常识-STEM-代理”渐进式课程学习,基于约11T token数据进行多阶段预训练;并在中期引入多样化数学、编程与工具使用轨迹数据强化代理能力。 Result: Youtu-LLM在通用基准上媲美更大模型,在代理特定任务上显著超越现有SOTA,成为2B以下模型中的新标杆,验证了轻量模型也可具备强内在代理智能。 Conclusion: 轻量级语言模型无需依赖蒸馏或放大参数即可发展出原生代理智能,关键在于系统性的架构设计与训练策略。 Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.[49] Do Large Language Models Know What They Are Capable Of?
Casey O. Barkan,Sid Black,Oliver Sourbut
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLMs)是否能预测自身在任务中的表现,以及在多步任务中和经历上下文失败后其判断能力的变化。结果显示LLMs普遍存在过度自信,尽管具备一定的成功预测能力,但随任务推进自信加剧,导致决策不佳;部分模型可通过失败经验校正信心,提升决策,但整体仍受限于对自身能力的认知不足。
Details
Motivation: 了解LLMs是否具备自我能力认知,对其在复杂、高风险任务中的可靠性和安全性至关重要,尤其在AI代理广泛应用的背景下,评估其自我判断的准确性有助于降低误用和失控风险。 Method: 通过测试多种LLMs在单步和多步任务中预测自身成功的能力,分析其置信度与实际表现的一致性;引入上下文中的失败经验,观察模型是否能据此调整行为和决策策略。 Result: 所有测试的LLMs均表现出过度自信,但具备高于随机水平的成功预测能力;新且更大的模型未显著提升辨别力(Claude除外);在多步任务中,多数前沿模型过度自信加剧;部分模型在经历失败后能降低自信并改善决策,但推理型模型表现不优于非推理型;所有模型的决策在其自估概率下近似理性,但因乐观估计而整体决策较差。 Conclusion: 当前LLM代理在自我能力认知方面存在明显缺陷,过度自信阻碍了其在复杂或高成本场景下的有效决策,提升其元认知能力对未来安全可靠的AI系统至关重要。 Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.[50] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory
Maoyuan Li,Zhongsheng Wang,Haoyuan Li,Jiamou Liu
Main category: cs.CL
TL;DR: R-Debater是一个基于论证记忆的多轮辩论生成框架,结合检索增强与角色化代理,在单轮和多轮辩论任务中均优于强基线模型,并通过自动与人工评估验证了其一致性、证据使用和连贯性。
Details
Motivation: 现有辩论系统在维持立场一致性和跨轮次论据连贯性方面存在不足,缺乏对修辞与记忆机制的有效建模。 Method: 提出R-Debater框架,融合辩论知识库用于检索案例证据和历史辩论动作,设计基于角色的代理进行跨回合连贯语句生成,并在ORCHID辩论数据集上进行评估。 Result: 在ORCHID标准辩论数据集上,构建了包含1000项的检索语料库和32场跨七个领域的保留测试集;在下一句生成和对抗性多轮模拟任务中,R-Debater在InspireScore和Debatrix评分上均优于强LLM基线。 Conclusion: 结合检索增强与结构化规划能有效提升生成辩论内容的保真度、立场一致性和跨轮次连贯性,验证了论证记忆在智能辩论系统中的关键作用。 Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.[51] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models
Wenzhe Li,Shujian Zhang,Wenxuan Zhou,John Lambert,Chi Jin,Andrew Hard,Rajiv Mathews,Lun Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为MUSIC的无监督数据增强策略,通过在多轮对话中构建跨多个回合的对比来提升多轮奖励模型(RM)的性能,在保持单轮任务表现的同时,显著提高了与高级LLM评判的一致性。
Details
Motivation: 现有的偏好数据集通常只基于对话的最后一轮进行对比,难以捕捉多轮交互的复杂性,导致多轮自动评估效果不佳。因此需要一种能更好建模多轮对话质量的方法。 Method: 提出了MUlti-Step Instruction Contrast (MUSIC),一种无监督的数据增强方法,合成跨越多个对话轮次的对比样本,并利用该数据在Skywork偏好数据集上训练基于Gemma-2-9B-Instruct的多轮奖励模型。 Result: 实验表明,使用MUSIC增强训练的RM在多轮对话评估中比基线方法更贴近先进专有LLM裁判的判断,同时在标准单轮RM基准上未出现性能下降。 Conclusion: 引入跨多轮的对比信号对构建鲁棒的多轮奖励模型至关重要,MUSIC为高效、可扩展的多轮对话评估提供了有效路径。 Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.[52] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature
Sibo Wei,Peng Chen,Lifeng Dong,Yin Luo,Lei Wang,Peng Zhang,Wenpeng Lu,Jianbin Guo,Hongjun Yang,Dajun Zeng
Main category: cs.CL
TL;DR: 本文提出了BIOME-Bench,一个用于评估大语言模型在多组学通路机制解析中性能的标准化基准,揭示了现有模型在生物分子关系推断和通路机制解释方面的不足。
Details
Motivation: 现有的通路富集方法受限于数据库更新滞后、功能冗余及对分子状态敏感性不足,且缺乏标准化基准来系统评估大语言模型在多组学分析中的能力。 Method: 构建了一个四阶段流程生成的BIOME-Bench基准,包含生物分子相互作用推断和端到端多组学通路机制解析两项任务,并设计了相应的评估协议。 Result: 在多个主流大语言模型上的实验表明,现有模型在细粒度生物分子关系识别和生成准确、稳健的通路机制解释方面仍存在显著缺陷。 Conclusion: 需要进一步改进大语言模型及其评估框架,以提升其在多组学数据生物学解释中的可靠性和实用性。 Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.[53] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection
Mohammad Zia Ur Rehman,Velpuru Navya,Sanskar,Shuja Uddin Qureshi,Nagendra Kumar
Main category: cs.CL
TL;DR: 提出了一种半监督多语言抑郁检测网络Semi-SMDNet,结合教师-学生模型、集成学习和数据增强,有效提升低资源语言下的抑郁症检测性能。
Details
Motivation: 由于语言风格差异、非正式表达以及许多语言缺乏标注数据,从社交媒体文本中检测抑郁症仍具挑战性。 Method: 采用教师-学生伪标签框架,通过软投票整合多个教师模型的预测,并利用基于不确定性的阈值过滤低置信度伪标签;引入置信度加权训练策略和数据增强方法以提高跨语言鲁棒性。 Result: 在阿拉伯语、孟加拉语、英语和西班牙语数据集上均优于强基线模型,显著缩小了高资源与低资源设置之间的性能差距。 Conclusion: 所提框架在标注资源有限的情况下适用于可扩展的跨语言心理健康监测,具有良好的通用性和实用性。 Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.[54] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models
Ákos Prucs,Márton Csutora,Mátyás Antal,Márk Marosi
Main category: cs.CL
TL;DR: 本文研究了大语言模型在复杂推理任务中的性能与计算成本之间的权衡,发现Mixture of Experts架构在性能和效率之间表现出良好的平衡,并指出推理时计算资源存在饱和点,超过后准确率提升有限。
Details
Motivation: 当前研究忽视了生成长推理链带来的巨大计算开销,而实际工业应用中需兼顾准确性与推理成本,因此需要对模型进行计算感知的评估。 Method: 对新旧开源大语言模型进行了测试时计算资源感知的评估,绘制其在数学和推理密集型基准上的Pareto前沿,并分析随时间演变的效率趋势。 Result: Mixture of Experts架构在性能与效率方面表现优异;发现了推理计算存在饱和点,超过该点后准确率增益显著下降。 Conclusion: 尽管扩展推理能力有助于提升性能,但无法克服模型本身的能力局限,合理的架构设计(如MoE)比单纯增加推理计算更有效。 Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.[55] Practising responsibility: Ethics in NLP as a hands-on course
Malvina Nissim,Viviana Patti,Beatrice Savoldi
Main category: cs.CL
TL;DR: 本文介绍了一门关于自然语言处理中伦理问题的课程及其以主动学习为基础的教学方法,经过四年在不同机构和教育背景下的实践与改进,产生了大量可重用的教学资源和学生作品。
Details
Motivation: 随着自然语言处理(NLP)系统日益普及,将伦理考量融入NLP教育变得至关重要;然而,由于该领域快速发展以及需要超越传统技术培训培养批判性思维,课程开发面临固有挑战。 Method: 采用基于主动学习的教学方法,包括互动环节、动手活动和‘通过教学来学习’的方式,并在四年中于不同机构、教育层次和跨学科背景下不断优化课程。 Result: 课程成功适应多种教学环境,产出了丰富的可复用教学材料和面向不同受众的教育产品,均由学生创作完成。 Conclusion: 分享该课程的设计与实践经验,旨在为希望将社会影响考量纳入教学的教育工作者提供借鉴和启发。 Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field's rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and "learning by teaching" methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.[56] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability
Yanan Long
Main category: cs.CL
TL;DR: 本文提出了一种名为“三角测量”的因果标准,用于验证多语言模型中的机制性解释,要求在跨语言、跨环境的干预下保持必要性、充分性和不变性,从而过滤掉虚假的电路。
Details
Motivation: 多语言模型在不同语言和文化中表现不稳定,现有解释缺乏因果验证标准,需要一种能够跨环境检验机制性假设的方法。 Method: 引入‘参考族’作为保持语义不变的变体形式,提出‘三角测量’准则,结合自动电路发现与介入实验(如消融和修补激活),并在多个模型、语言对和任务上进行验证。 Result: 三角测量能有效识别出在单一环境中看似成立但跨语言失效的虚假电路,提升了对多语言模型内部机制的理解可靠性。 Conclusion: 通过因果抽象框架下的三角测量,为多语言模型提供了可证伪的、具备跨语言鲁棒性的 mechanistic 解释标准。 Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.[57] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI
Srija Mukhopadhyay,Sathwik Reddy,Shruthi Muthukumar,Jisun An,Ponnurangam Kumaraguru
Main category: cs.CL
TL;DR: PrivacyBench 是一个基于社会情境的基准,用于评估个性化AI代理在多轮对话中保护用户隐私的能力,揭示现有系统存在严重的信息泄露风险。
Details
Motivation: 个性化AI需要访问用户的数字足迹,但缺乏社会情境意识可能导致敏感信息泄露,威胁用户隐私和数字福祉。 Method: 提出 PrivacyBench 基准,包含嵌入秘密的社会化数据集和多轮对话评估机制,测试 RAG 助手在不同条件下的秘密泄露情况。 Result: RAG 系统在最多 26.56% 的交互中泄露秘密;使用隐私提示可将泄露降至 5.12%,但检索机制仍无差别访问敏感数据,导致生成器成为隐私保护的单一故障点。 Conclusion: 当前架构难以确保隐私安全,亟需采用隐私优先的结构性设计来保障广泛部署下的用户隐私。 Abstract: Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.[58] Big AI is accelerating the metacrisis: What can we do?
Steven Bird
Main category: cs.CL
TL;DR: 本文讨论了生态、意义和语言危机汇聚成的元危机,指出大型AI正在加剧这些问题,并呼吁语言工程师重新思考自然语言处理的未来,以促进人类繁荣和地球生命为中心。
Details
Motivation: 面对生态、意义和语言危机交织而成的元危机,以及大AI对这些危机的推动作用,作者旨在唤起对当前技术发展路径的反思。 Method: 通过批判性分析当前AI和自然语言处理的发展趋势及其社会影响,提出需要探索新的替代方案。 Result: 强调语言工程师在危机中扮演的关键角色,批评其忽视价值观的技术开发模式,并倡导将集体智慧用于设计更有利于人类和地球的NLP未来。 Conclusion: 必须超越单纯追求可扩展性的叙事,转向以生命肯定为目标的NLP设计,实现技术服务于人类福祉和生态可持续性。 Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.[59] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements
Yiming Liang,Yizhi Li,Yantao Du,Ge Zhang,Jiayi Zhou,Yuchen Wu,Yinzhu Piao,Denghui Cao,Tong Sun,Ziniu Li,Li Du,Bo Lei,Jiaheng Liu,Chenghua Lin,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang
Main category: cs.CL
TL;DR: Encyclo-K是一种基于知识陈述的新型大语言模型评测基准,通过从教科书中提取知识陈述并在测试时动态生成问题,克服了数据污染、单知识点评估和高标注成本的问题。
Details
Motivation: 现有评测基准易受数据污染、局限于单知识点评估且依赖昂贵的专家标注,难以全面评估大模型的综合理解能力。 Method: 从权威教科书中提取独立知识陈述,并在测试时通过随机采样动态组合成问题,仅需验证格式合规性以降低标注成本。 Result: 在50多个大模型上的实验显示,即使表现最好的GPT-5.1准确率也仅为62.07%,模型性能呈现清晰梯度分布,验证了动态评估和多陈述理解的挑战性。 Conclusion: Encyclo-K提供了一个可扩展的框架,能够有效评估大语言模型对多个细粒度学科知识的综合理解能力。 Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.[60] mHC: Manifold-Constrained Hyper-Connections
Zhenda Xie,Yixuan Wei,Huanqi Cao,Chenggang Zhao,Chengqi Deng,Jiashi Li,Damai Dai,Huazuo Gao,Jiang Chang,Liang Zhao,Shangyan Zhou,Zhean Xu,Zhengyan Zhang,Wangding Zeng,Shengding Hu,Yuqing Wang,Jingyang Yuan,Lean Wang,Wenfeng Liang
Main category: cs.CL
TL;DR: 本文提出了Manifold-Constrained Hyper-Connections (mHC),一种在保持高效性的同时恢复超连接中恒等映射特性的通用框架,解决了训练不稳定和可扩展性受限的问题。
Details
Motivation: 现有的超连接(HC)方法虽然提升了性能,但破坏了残差连接中的恒等映射特性,导致训练不稳定、可扩展性差以及内存访问开销增加。 Method: 提出mHC框架,通过将HC的残差连接空间投影到特定流形上来恢复恒等映射性质,并结合严格的基础设施优化以提升效率。 Result: 实验表明,mHC在大规模训练中有效,具有更好的性能表现和可扩展性。 Conclusion: mHC作为一种灵活且实用的HC扩展,有助于深入理解拓扑结构设计,并为基础模型的发展提供新方向。 Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.[61] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts
Hengli Li,Zhaoxin Yu,Qi Shen,Chenxi Li,Mengmeng Wang,Tinglang Wu,Yipeng Kang,Yuxuan Wang,Song-Chun Zhu,Zixia Jia,Zilong Zheng
Main category: cs.CL
TL;DR: 本文提出了BEDA框架,通过将信念估计作为生成过程中的概率约束,形式化了对抗和协作两种核心对话行为,并在多个任务中显著优于基线模型。
Details
Motivation: 现有工作虽能准确估计对话代理的信念,但缺乏在生成过程中有效利用这些信念的原则性机制。 Method: 提出BEDA框架,包含世界集、信念估计器和条件生成器;形式化对抗与对齐两种对话行为,并通过概率约束指导生成。 Result: 在CKBG、MF和CaSiNo三个场景中,BEDA均显著优于强基线:在CKBG上最高提升20.6点(GPT-4.1-nano),MF平均提升9.3点,CaSiNo达到最优协议结果。 Conclusion: 将信念估计转化为生成约束是一种简单且通用的方法,可提升战略对话系统的可靠性。 Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.[62] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline
Minjun Zhao,Xinyu Zhang,Shuai Zhang,Deyang Li,Ruifeng Shi
Main category: cs.CL
TL;DR: 提出ADOPT框架,用于多步LLM流水线中的依赖感知提示优化,通过建模步骤与最终结果的依赖关系,实现精确的文本梯度估计,并自适应分配优化资源,显著优于现有方法。
Details
Motivation: 多步LLM流水线性能依赖各步骤提示词,但缺乏中间监督和存在步骤间依赖使得联合优化困难,现有端到端方法效果不佳。 Method: 提出ADOPT框架,显式建模每个LLM步骤与最终任务结果之间的依赖关系,解耦文本梯度估计与更新过程,采用基于Shapley值的机制自适应分配优化资源,将多提示优化简化为灵活的单提示优化。 Result: 在真实数据集和多种流水线结构上实验表明,ADOPT有效且鲁棒,持续优于当前最先进的提示优化基线方法。 Conclusion: ADOPT通过依赖感知和自适应资源分配,显著提升了多步LLM流水线中提示优化的效果和稳定性,为复杂任务下的提示工程提供了新思路。 Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.[63] Classifying long legal documents using short random chunks
Luis Adrián Cabrera-Diego
Main category: cs.CL
TL;DR: 本文提出了一种基于DeBERTa V3和LSTM的法律文档分类方法,通过随机选取48个短文本块(每块最多128个token)作为输入,解决了长文本处理的难题,并结合Temporal构建了可靠的部署流水线。
Details
Motivation: 法律文档通常词汇专业且篇幅较长,直接使用Transformer模型处理存在计算成本高、速度慢或无法处理的问题。 Method: 采用DeBERTa V3与LSTM结合的模型,输入为从文档中随机抽取的48个短文本块(每个最多128个token),并利用Temporal构建可持久执行的部署流水线。 Result: 最佳模型达到了0.898的加权F分数,CPU上的流水线处理中位时间为每100个文件498秒。 Conclusion: 该方法有效应对了长法律文档分类的挑战,在保持较高分类性能的同时,通过模块化输入和可靠的工作流实现了可扩展的部署。 Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.[64] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes
Siddhant Agarwal,Adya Dhuler,Polly Ruhnke,Melvin Speisman,Md Shad Akhtar,Shweta Yadav
Main category: cs.CL
TL;DR: 本研究提出RESTOREx资源和MAMAMemeia框架,利用大语言模型和多智能体协作方法,通过临床心理学指导的多方面讨论来识别社交媒体迷因中的抑郁症状,性能超越现有最优方法7.55%。
Details
Motivation: 随着迷因被越来越多用于表达抑郁情绪,亟需有效工具识别其中的抑郁症状以进行心理干预。 Method: 构建LLM生成并人工标注的解释数据集RESTOREx,提出基于认知分析疗法(CAT)的多智能体多方面讨论框架MAMAMemeia。 Result: 在超过30种方法中,MAMAMemeia在macro-F1指标上比当前最优方法提升7.55%,成为新基准。 Conclusion: 结合临床心理学方法与多智能体协作的框架能更有效识别迷因中的抑郁症状,为心理健康监测提供了新思路。 Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.[65] Modeling Language as a Sequence of Thoughts
Nasim Borazjanizadeh,James McClelland
Main category: cs.CL
TL;DR: 本文提出了Thought Gestalt (TG) 模型,一种在词元和句子级“思维”状态两个抽象层次上建模语言的循环Transformer模型,通过共享参数和单一的下一词元交叉熵目标函数,提高了数据效率并减少了关系方向泛化错误。
Details
Motivation: 由于传统Transformer语言模型主要依赖表面共现统计,缺乏全局一致的实体和事件潜在表示,导致在关系方向、上下文错误和数据效率方面存在缺陷。受认知科学中人类理解语言方式的启发,作者希望构建一个能形成持久记忆表示的模型。 Method: 提出Thought Gestalt (TG) 模型,结合词元生成与句子级‘思维’状态的循环表示,使用相同的模型参数生成词元和句子表示,并通过保留写入记忆的句子表示的计算图,使未来词元损失的梯度反向传播以优化先前句子向量的生成参数。 Result: 在扩展实验中,TG相比同等GPT-2运行显著提升了效率,拟合结果显示GPT-2需要多出约5-8%的数据和33-42%的参数才能达到TG的损失水平;同时在父子关系反转诅咒探测任务上减少了关系方向错误。 Conclusion: TG模型通过引入句子级持久记忆表示,在不增加复杂性的情况下提升了语言模型的数据效率和关系推理能力,验证了借鉴人类认知机制改进模型架构的有效性。 Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.[66] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG
Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng
Main category: cs.CL
TL;DR: AdaGReS是一种面向检索增强生成(RAG)的冗余感知上下文选择框架,通过优化相关性与冗余惩罚的集合级目标,在令牌预算限制下提升上下文质量。
Details
Motivation: 标准的top-k检索常引入冗余或近似重复的上下文片段,浪费令牌预算并损害生成效果,因此需要更智能的上下文选择机制。 Method: AdaGReS在令牌预算约束下使用贪婪选择策略,基于集合级目标的边际增益进行选择,并引入闭式、实例自适应的相关性-冗余权衡参数校准方法。 Result: 理论分析表明该目标在实际嵌入相似性条件下具有ε-近似子模性,为贪婪算法提供近似最优保证;实验显示其在开放域问答和生物医学文本上有效降低冗余、提升上下文与最终回答质量。 Conclusion: AdaGReS通过自适应平衡相关性与冗余,在多种场景下实现了更高效、鲁棒的上下文选择,显著优于传统检索方法。 Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.cs.CV [Back]
[67] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments
Ankan Aich,Yangming Lee
Main category: cs.CV
TL;DR: 本文提出了一种基于Depth Anything V2架构和DV-LORA自适应方法的单目深度估计技术,显著提升了在高反射、流体充满的内窥镜手术环境中的深度估计精度与鲁棒性,并在SCARED数据集上实现了新的SOTA性能。
Details
Motivation: 现有的自监督单目深度估计方法在处理手术场景中的薄器械和透明表面时容易出现边界坍塌问题,且在高反射和液体干扰环境下表现不稳定,主要由于依赖噪声较多的真实伪标签和缺乏针对极端光照的评估协议。 Method: 利用Depth Anything V2模型提供的高保真合成先验来精确捕捉细小结构的几何细节,并通过动态向量低秩适应(DV-LORA)方法高效地将该先验迁移到医学图像领域,减少参数开销并缩小合成到真实的域差距;同时设计了一种物理分层的评估协议,在SCARED数据集上专门评估高反射区域的性能。 Result: 在SCARED数据集上达到98.1%的准确率(<1.25)和超过17%的平方相对误差下降,显著优于现有基线方法,尤其在高反射区域表现出更强的鲁棒性。 Conclusion: 所提出的方法结合合成先验与轻量级适配策略,有效解决了手术环境中单目深度估计的挑战,为机器人辅助手术中的三维感知提供了更可靠的技术方案。 Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.[68] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments
Surya Rayala,Marcos Quinones-Grueiro,Naveeduddin Mohammed,Ashwin T S,Benjamin Goldberg,Randall Spain,Paige Lawton,Gautam Biswas
Main category: cs.CV
TL;DR: 本文提出了一种基于视频的评估管道,利用计算机视觉技术从城市作战训练视频中提取2D骨架、注视向量和运动轨迹,构建任务特定指标,并结合扩展的认知任务分析(CTA)层次模型,实现对心理运动流畅性、态势感知和团队协作的客观量化评估。
Details
Motivation: 传统军事训练中进入并清空房间(ECR)演练的性能评估依赖昂贵且侵入式的传感器或主观人工观察,难以实现可扩展、客观的多维度技能评估。因此,需要一种无需额外硬件、能自动评估认知、心理运动和团队协作能力的方法。 Method: 提出一种基于视频的评估流程:使用计算机视觉模型从训练视频中提取2D姿态、视线方向和移动轨迹;基于这些数据设计衡量心理运动流畅性、态势感知和团队协作的任务特定指标;将这些指标整合到扩展的认知任务分析(CTA)层次结构中,通过加权方式生成团队合作与认知能力的总体评分。 Result: 在真实ECR训练案例中验证了该方法的有效性,成功生成了可操作的、领域特定的个体与团队性能指标;所提取的指标可用于支持行动后回顾(AAR),并通过Gamemaster与GIFT系统中的交互式仪表板提供直观反馈。 Conclusion: 该视频驱动的评估框架能够在不增加硬件成本的前提下,实现对合成训练环境中作战表现的客观、可扩展评估;尽管存在追踪精度、真值验证和泛化能力等局限,但为未来向3D视频分析和更广泛STE应用拓展奠定了基础。 Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.[69] Pretraining Frame Preservation in Autoregressive Video Memory Compression
Lvmin Zhang,Shengqu Cai,Muyang Li,Chong Zeng,Beijia Lu,Anyi Rao,Song Han,Gordon Wetzstein,Maneesh Agrawala
Main category: cs.CV
TL;DR: 提出PFP神经网络结构,用于通过显式预训练目标将长视频压缩为短上下文,保留任意时间位置单帧的高频细节。
Details
Motivation: 为了在低上下文成本下实现长期记忆并减少保真度损失,需要有效压缩长视频同时保留关键视觉细节。 Method: 设计PFP神经网络结构,采用显式预训练目标来保持单帧的高频细节,并将其作为记忆编码器微调用于自回归视频模型。 Result: 基线模型可将20秒视频压缩为约5k长度的上下文,支持随机帧检索且外观感知保持良好;在自回归视频生成任务中实现低开销长时记忆。 Conclusion: PFP框架在视频压缩与重建之间取得了良好权衡,适用于需长期依赖的视频建模任务。 Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.[70] Lifelong Domain Adaptive 3D Human Pose Estimation
Qucheng Peng,Hongfei Xue,Pu Wang,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了一个用于3D人体姿态估计的终身域自适应新任务,首次将终身域自适应引入该领域,通过创新的GAN框架和新型3D姿态生成器范式,有效应对域迁移和灾难性遗忘问题。
Details
Motivation: 现有域自适应方法忽视了目标姿态数据集非平稳性的问题,且难以在不访问源域和先前目标域的情况下持续适应新域,限制了在真实场景中的泛化能力。 Method: 提出一种基于生成对抗网络(GAN)的框架,包含3D姿态生成器、2D姿态判别器和3D姿态估计器,并设计融合姿态感知、时序感知和域感知知识的3D姿态生成器,以增强当前域适应并缓解对先前知识的遗忘。 Result: 在多个域自适应3D HPE数据集上进行了广泛实验,结果表明所提方法在性能上优于现有方法,能更有效地处理域偏移并保留历史域知识。 Conclusion: 所提出的终身域自适应框架显著提升了3D人体姿态估计在连续变化域下的鲁棒性和泛化能力,为实际应用提供了可行解决方案。 Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.[71] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework
Krithika Iyer,Austin Tapp,Athelia Paulli,Gabrielle Dickerson,Syed Muhammad Anwar,Natasha Lepore,Marius George Linguraru
Main category: cs.CV
TL;DR: 本研究提出了一种基于深度学习的框架,利用儿童T1加权MRI生成合成CT(sCT),实现颅骨分割和颅缝概率热图预测,首次实现了从MRI衍生sCT进行颅缝分割,克服了MRI对骨骼显示不足的局限性。
Details
Motivation: 为了在不使用有辐射的CT的情况下,准确评估儿童颅骨发育和颅缝闭合情况,避免电离辐射对儿童的危害,同时弥补MRI无法清晰显示颅骨和颅缝的不足。 Method: 采用领域特定的变分自编码器构建深度学习管道,将0.2至2岁儿童的T1加权MRI转化为合成CT(sCT),并在此基础上进行颅骨分割、生成颅缝概率热图,并从中提取直接的颅缝分割结果。 Result: 合成CT与真实CT的结构相似性达99%,Fréchet起始距离为1.01;七块颅骨的平均Dice系数为85%;颅缝分割Dice系数达80%;通过双单侧检验(TOST, p < 0.05)验证了sCT与真实CT在颅骨和颅缝分割上的等效性。 Conclusion: 该方法首次实现了从儿童MRI生成可用于颅缝分割的合成CT,提供了无辐射、高精度的颅骨评估方案,填补了非侵入性儿科颅骨成像的技术空白。 Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.[72] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale
Charith Wickrema,Eliza Mace,Hunter Brown,Heidys Cabrera,Nick Krall,Matthew O'Neill,Shivangi Sarkar,Lowell Weissman,Eric Hughes,Guido Zarrella
Main category: cs.CV
TL;DR: 本研究探索了在高分辨率电光遥感数据上训练大规模基础模型的可扩展性行为,使用超过一千万亿像素的卫星数据训练视觉Transformer模型,发现性能受限于数据而非模型参数,为遥感领域的大规模模型开发提供了实践指导。
Details
Motivation: 遥感等高价值领域的可扩展性规律不如自然图像领域清晰,缺乏针对大规模非文本模态(如遥感图像)的基础模型训练指导原则。 Method: 利用超过一千万亿像素的商业卫星EO数据,在MITRE联邦AI沙箱中逐步训练更大规模的视觉Transformer(ViT)骨干网络,分析其在petascale下的表现、成败模式及对跨遥感模态领域差距的影响。 Result: 即使在如此大规模下,模型性能仍处于数据受限状态,而非参数受限;识别出若干成功与失败模式,并揭示了多模态遥感建模中的域间差异问题。 Conclusion: 研究结果表明,未来遥感基础模型的发展应优先关注数据收集策略和计算资源优化,而非单纯扩大模型规模,为该领域的前沿发展提供了实证依据和实践指南。 Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.[73] Learning to learn skill assessment for fetal ultrasound scanning
Yipei Wang,Qianye Yang,Lior Drukker,Aris T. Papageorghiou,Yipeng Hu,J. Alison Noble
Main category: cs.CV
TL;DR: 提出了一种新的双层优化框架,用于无监督地评估胎儿超声技能,通过任务执行效果来量化技能水平。
Details
Motivation: 传统超声技能评估依赖专家主观判断,耗时且不够客观;现有自动化方法多依赖预定义评分和监督学习,限制了关键技能因素的发现。 Method: 提出一个双层优化框架,包含临床任务预测器和技能预测器,联合优化两个网络,以任务完成质量作为技能评估指标,无需人工标注技能等级。 Result: 在真实世界胎儿头部超声视频数据上验证了该方法的可行性,能够通过优化后的任务表现有效预测操作者技能水平。 Conclusion: 该框架为超声技能评估提供了一种客观、自动化的替代方案,能够摆脱对专家评分和预设特征的依赖。 Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.[74] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation
Yulong Zou,Bo Liu,Cun-Jing Zheng,Yuan-ming Geng,Siyue Li,Qiankun Zuo,Shuihua Wang,Yudong Zhang,Jin Hong
Main category: cs.CV
TL;DR: 提出了一种元引导多模态学习框架(MGML),用于在模态缺失情况下提升脑肿瘤MRI分割性能,包含自适应模态融合与一致性正则化模块,无需修改模型结构,可端到端训练,在BraTS2020和BraTS2023上表现优于现有方法。
Details
Motivation: 临床中多模态MRI数据常不完整,限制了多模态信息的充分利用,如何在模态缺失下有效融合多模态信息成为关键挑战。 Method: 提出MGML框架,包含两个模块:1)元参数化自适应模态融合(Meta-AMF),根据可用模态生成软标签监督信号,实现动态融合;2)一致性正则化模块,提升模型鲁棒性与泛化能力。该方法不改变原模型结构,易于集成到训练流程中。 Result: 在BraTS2020和BraTS2023数据集上实验表明,相比多种先进方法,MGML性能更优。在BraTS2020的15种模态缺失组合平均Dice得分中,WT为87.55,TC为79.36,ET为62.67。 Conclusion: MGML能有效利用不完整多模态MRI数据,提升脑肿瘤分割性能,具有良好的通用性和实用性,代码已开源。 Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.[75] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation
Hualin Ye,Bingxi Liu,Jixiang Du,Yu Qin,Ziyi Chen,Hong Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于跨视角地理定位(CVGL)的新方法,通过改进特征聚合与对齐来应对视角差异带来的挑战,结合DINOv2骨干网络、多尺度通道重分配模块和基于MoE的聚合模块,在减少参数量的同时实现了优越性能。
Details
Motivation: 由于不同视角之间的显著差异,传统方法在跨视角地理定位中难以有效进行特征聚合与对齐,因此需要更鲁棒的模型设计来提升定位精度与泛化能力。 Method: 采用DINOv2骨干网络结合卷积适配器进行微调,引入多尺度通道重分配模块以增强空间表示的多样性与稳定性,并设计了一种融合Mixture-of-Experts(MoE)路由机制的改进聚合模块,在跨注意力框架中动态选择专家子空间处理异构输入域。 Result: 在University-1652和SUES-200数据集上的实验表明,所提方法在较少训练参数的情况下达到了具有竞争力的性能表现。 Conclusion: 本文提出的三阶段增强策略有效提升了跨视角地理定位的准确性与模型效率,尤其在减少模型复杂度的同时保持了优异的匹配能力。 Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.[76] Kinematic-Based Assessment of Surgical Actions in Microanastomosis
Yan Meng,Daniel Donoho,Marcelle Altshuler,Omar Arnaout
Main category: cs.CV
TL;DR: 提出一种基于AI的自动化框架,用于显微吻合手术中的动作分割与技能评估,可在边缘计算平台高效运行。
Details
Motivation: 传统显微外科技能评估依赖专家评分,存在主观性强、耗时长和评分不一致等问题,亟需客观、可扩展的自动化评估方法。 Method: 该框架包含三个模块:基于YOLO和DeepSORT的器械尖端追踪定位、基于自相似矩阵的动作边界检测与无监督聚类动作分割、以及用于技能评估的有监督分类模块。 Result: 在58段专家评分的显微吻合视频数据集上验证,动作分割帧级准确率达92.4%,技能分类准确率达85.5%。 Conclusion: 该方法可实现客观、实时的显微外科培训反馈,推动标准化、数据驱动的培训与能力评估体系发展。 Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.[77] U-Net-Like Spiking Neural Networks for Single Image Dehazing
Huibin Li,Haoran Liu,Mingzhe Liu,Yulong Xiao,Peng Li,Guibin Zan
Main category: cs.CV
TL;DR: 提出了一种结合U-Net结构与脉冲神经网络(SNN)的新型去雾架构DehazeSNN,通过引入OLIFBlock模块提升跨通道通信,在降低计算开销的同时实现了与现有最先进方法相媲美的性能。
Details
Motivation: 传统去雾方法依赖大气散射模型,而深度学习方法如CNN和Transformer虽有改进,但CNN难以捕捉长距离依赖,Transformer则计算开销大。因此需要一种高效且性能优越的去雾模型。 Method: 提出DehazeSNN,采用U-Net-like结构结合Spiking Neural Networks(SNN),并引入正交漏积分-发放模块(OLIFBlock)以增强跨通道信息交互,有效捕获多尺度特征和长程依赖。 Result: 实验表明,DehazeSNN在多个基准数据集上与最先进方法具有竞争力,能生成高质量无雾图像,同时模型更小、乘加操作更少,计算效率更高。 Conclusion: DehazeSNN是一种高效、轻量且高性能的图像去雾方法,结合SNN与U-Net结构优势,为实际应用中的低功耗、高清晰视觉系统提供了新思路。 Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.[78] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models
Changzhen Li,Yuecong Min,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出了T2VAttack,首次从语义和时序两个角度系统研究了文本到视频扩散模型的对抗攻击,揭示了现有模型在微小提示修改下的脆弱性。
Details
Motivation: 尽管文本到视频生成模型取得了显著进展,但其对对抗攻击的脆弱性尚未被充分探索,亟需评估其鲁棒性。 Method: 提出两种攻击目标(语义对齐与时序动态)和两种攻击方法:T2VAttack-S通过替换关键词进行贪婪搜索,T2VAttack-I通过插入优化词实现低扰动攻击。 Result: 在ModelScope、CogVideoX、Open-Sora和HunyuanVideo等多个先进T2V模型上的实验表明,仅替换或插入一个单词即可显著降低生成视频的语义保真度和时序连贯性。 Conclusion: 当前文本到视频扩散模型在面对轻微提示篡改时存在严重脆弱性,需在未来研究中加强其对抗鲁棒性。 Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.[79] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation
Yuang Jia,Jinlong Wang,Jiayi Zhao,Chunlam Li,Shunzhou Wang,Wei Gao
Main category: cs.CV
TL;DR: 本文提出了一种无需昂贵传感器或标注先验的自动驾驶场景视图外推方法,通过图像和可选相机姿态估计静态与动态点云,并结合可变形4D高斯模型与视频扩散模型迭代优化渲染结果。
Details
Motivation: 现有视图生成方法依赖LiDAR、3D框等昂贵或难以获取的先验信息,限制了在实际自动驾驶部署中的应用,本文旨在仅使用图像和可选相机姿态实现高质量视图外推。 Method: 首先从图像估计全局静态和每帧动态点云并融合为统一表示;采用可变形4D高斯框架重建场景;用初始4D高斯渲染降质伪图像训练视频扩散模型;通过扩散模型迭代 refine 高斯渲染结果,并将增强结果反馈训练4DGS,直至达到目标视角。 Result: 相比基线方法,该方法在无LiDAR等强先验条件下,在外推新视角下生成了更高质量的图像。 Conclusion: 本文方法实现了仅基于图像(和可选位姿)的高效视图外推,通过4D高斯与扩散模型的协同迭代优化,提升了生成质量与实用性。 Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.[80] Anomaly detection in satellite imagery through temporal inpainting
Bertrand Rouet-Leduc,Claudia Hulbert
Main category: cs.CV
TL;DR: 提出一种基于深度学习的卫星影像时间序列异常检测方法,通过预测无变化情况下的地表外观来识别表面变化,显著提高检测灵敏度和特异性。
Details
Motivation: 传统变化检测方法难以区分大气噪声、季节性变化和传感器伪影与真实地表变化,限制了灾害响应和环境监测的效率。 Method: 基于SATLAS基础模型构建修复模型,利用Sentinel-2时间序列中前期影像预测最新一帧的地表状态,并通过预测与实际观测之间的差异识别异常。使用全球分布的多气候带和土地覆盖类型数据进行训练。 Result: 在2023年土耳其-叙利亚地震引发的地表破裂检测中,该方法比时间中值法和Reed-Xiaoli检测器具有更高的灵敏度和特异性,检测阈值约为基线方法的三分之一。 Conclusion: 该方法能有效利用卫星时间序列的时序冗余性,实现对微弱地表变化的高精度自动检测,为全球尺度的地表变化监测提供了可行方案。 Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.[81] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention
Jun Ding,Shang Gao
Main category: cs.CV
TL;DR: 本文提出了一种高效的医学图像分割框架GCA-ResUNet,结合轻量级的分组坐标注意力(GCA)模块,在保持CNN效率的同时增强全局上下文建模能力,显著提升了多器官和低对比度区域的分割精度。
Details
Motivation: 现有U-Net类方法因局部感受野和同质化注意力机制难以建模长距离依赖,而Transformer虽具全局建模能力但计算开销大,限制了在资源受限临床环境中的应用。因此需要一种兼顾精度与效率的分割方法。 Method: 设计了一个轻量且即插即用的分组坐标注意力(GCA)模块,将通道上下文建模分组以应对语义异质性,并引入方向感知的坐标编码来捕获水平和垂直空间依赖。该模块嵌入ResUNet中形成GCA-ResUNet框架。 Result: 在Synapse和ACDC两个基准数据集上分别取得了86.11%和92.64%的Dice分数,优于包括Swin-UNet和TransUNet在内的多种CNN与Transformer方法,尤其在小器官和复杂边界结构分割中表现更优。 Conclusion: GCA-ResUNet在分割精度与计算效率之间实现了良好平衡,具备良好的临床部署可行性和可扩展性。 Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.[82] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation
Haotang Li,Zhenyu Qi,Hao Qin,Huanrui Yang,Sen He,Kebin Peng
Main category: cs.CV
TL;DR: 本文提出了一种名为GASeg的新框架,通过结合外观和几何信息中的稳定拓扑特征来解决自监督语义分割中因外观模糊导致的性能下降问题。
Details
Motivation: 现有的自监督语义分割方法在面对阴影、反光和局部纹理等外观模糊时表现不佳,主要因为过度依赖不稳定的外观特征。因此需要一种更鲁棒的方法来提升模型的泛化能力。 Method: 提出了GASeg框架,其核心是可微分盒计数(DBC)模块,用于从几何和外观双流中提取多尺度拓扑统计信息;引入拓扑增强(TopoAug)策略,通过形态学操作模拟真实世界的模糊情况;并设计了GALoss多目标损失函数,强制实现跨模态特征对齐。 Result: 在COCO-Stuff、Cityscapes和PASCAL等多个基准数据集上进行了广泛实验,GASeg均取得了最先进的性能表现。 Conclusion: 通过融合稳定的拓扑结构信息,GASeg有效桥接了外观与几何模态,显著提升了自监督语义分割在复杂场景下的鲁棒性和准确性。 Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.[83] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge
Tae Ha Park,Simone D'Amico
Main category: cs.CV
TL;DR: 提出了一种利用太阳位置先验信息改进3D高斯点阵模型训练的新方法,以在动态光照条件下从图像序列中恢复未知航天器的3D结构,并提升渲染的光度精度用于姿态估计。
Details
Motivation: 在太空交会与近场操作中,由于光照条件快速变化且场景动态,传统3D重建方法难以准确恢复目标航天器的几何和光度特性。 Method: 引入太阳位置的先验知识到3D高斯点阵(3DGS)模型的训练过程中,结合图像序列进行优化,同时利用该模型通过光度优化实现相机位姿估计。 Result: 实验表明,该方法能够有效适应空间中快速变化的光照条件,生成具有高几何和光度精度的3DGS模型,准确反映全局阴影和自遮挡现象。 Conclusion: 结合太阳先验信息的3DGS训练策略显著提升了在复杂光照下航天器3D重建的质量,并支持后续高精度的姿态估计任务。 Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target's geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun's position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.[84] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis
Hao Wu,Hui Li,Yiyun Su
Main category: cs.CV
TL;DR: 本文提出了一种名为Hilbert-VLM的新型两阶段融合框架,用于提升视觉语言模型在3D医学图像分析中的性能,通过引入Hilbert空间填充曲线和改进SAM2架构,实现了更精确的病灶分割与疾病分类。
Details
Motivation: 现有视觉语言模型在处理复杂的3D多模态医学图像时,难以有效整合互补信息,且常忽略细微但关键的病理特征,因此需要更强大的融合框架来提升诊断准确性。 Method: 提出Hilbert-VLM框架,包含HilbertMed-SAM模块用于病灶分割,并设计了Hilbert-Mamba Cross-Attention(HMCA)机制和尺度感知解码器;利用Hilbert空间填充曲线优化Mamba状态空间模型的扫描方式以保持3D数据的空间局部性,同时通过多模态增强提示引导VLM进行分类。 Result: 在BraTS2021分割基准上达到82.35%的Dice分数和78.85%的诊断分类准确率(ACC),验证了模型在医学图像分析中的有效性。 Conclusion: Hilbert-VLM通过结构创新显著提升了3D医学图像中病灶分割与疾病分类的准确性,为基于视觉语言模型的医疗诊断提供了更可靠的技术路径。 Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.[85] On Exact Editing of Flow-Based Diffusion Models
Zixiang Li,Yue Song,Jianing Peng,Ting Liu,Jun Huang,Xiaochao Qu,Luoqi Liu,Wei Wang,Yao Zhao,Yunchao Wei
Main category: cs.CV
TL;DR: 本文提出了一种名为Conditioned Velocity Correction (CVC)的新型flow-based扩散编辑框架,通过双视角速度校正机制解决潜在轨迹中的累积误差问题,实现了结构保真与语义一致的稳定图像编辑。
Details
Motivation: 现有基于流的扩散编辑方法在源与目标分布转换中存在潜在轨迹速度误差累积,导致语义不一致和结构失真,本文旨在解决这一问题。 Method: 将flow-based编辑重构为基于已知源先验的分布转换问题,引入双视角速度转换机制:一是保持结构的源轨迹一致性分支,二是引导向目标分布可控偏移的语义引导分支;结合经验贝叶斯推断与Tweedie校正对条件速度场进行后验一致性更新。 Result: CVC显著降低了潜在空间中的轨迹漂移与速度误差,实现了更稳定的隐变量动态演化,在多种编辑任务中展现出更高的图像保真度、更好的语义对齐性和更可靠的编辑行为。 Conclusion: CVC为基于流的扩散模型编辑提供了更鲁棒且可解释的解决方案,通过数学上严谨的速度校正机制平衡了结构保持与语义转换,推动了无需显式反演的高质量图像编辑的发展。 Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.[86] FitControler: Toward Fit-Aware Virtual Try-On
Lu Yang,Yicheng Liu,Yanan Li,Xiang Bai,Hao Lu
Main category: cs.CV
TL;DR: 本文提出了FitControler,一种可集成到现代虚拟试穿(VTON)模型中的学习型插件,首次实现对服装“合身性”的精细控制,提升了虚拟试穿的整体风格协调性。
Details
Motivation: 现有虚拟试穿技术多关注服装细节渲染,却忽视了影响整体穿搭风格的关键因素——服装合身性(garment fit)。本文旨在填补这一空白,将合身性建模引入VTON任务中。 Method: 提出FitControler,包含一个合身感知的布局生成器,基于服装无关的表示生成不同合身效果的体装布局;并设计多尺度合身注入器,将布局信息融入现有VTON模型,实现布局驱动的图像生成。构建了首个专注于合身性的数据集Fit4Men(13,000个体装样本),并提出两个新的合身一致性评估指标。 Result: 实验证明FitControler能与多种主流VTON模型兼容,实现精确的合身控制;新提出的评估指标能有效衡量生成结果的合身一致性;Fit4Men数据集为后续研究提供了重要资源。 Conclusion: 通过显式建模和控制服装合身性,显著提升了虚拟试穿的真实感与风格协调性,为未来VTON研究开辟了新方向。 Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style -- garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.[87] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression
Huanxiong Liang,Yunuo Chen,Yicheng Pan,Sixian Wang,Jincheng Dai,Guo Lu,Wenjun Zhang
Main category: cs.CV
TL;DR: 提出一种结构引导的2D高斯点阵分配方法,通过结构引导初始化、自适应位宽量化和几何一致性正则化,提升2DGS在低码率下的率失真性能,同时保持毫秒级解码速度。
Details
Motivation: 现有2DGS方法在分配表示容量和参数精度时忽略图像结构,导致低码率下率失真效率低。 Method: 1. 结构引导初始化:根据自然图像的空间结构先验分配2D高斯;2. 自适应位宽量化:在复杂区域的小尺度高斯上分配更高精度;3. 几何一致性正则化:对齐高斯方向与局部梯度方向以保留结构细节。 Result: 在Kodak上BD-rate降低43.44%,在DIV2K上降低29.91%,保持超过1000 FPS解码速度。 Conclusion: 所提方法显著提升了2DGS的表示能力和率失真性能,同时维持原生解码速度,适用于高效图像压缩。 Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.[88] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
Yunkai Dang,Donghao Wang,Jiacheng Yang,Yifan Jiang,Meiyi Zhu,Yuekun Yang,Cong Wang,Qi Fan,Wenbin Li,Yang Gao
Main category: cs.CV
TL;DR: 提出了一种多特征融合的遥感视觉-语言模型MF-RSVLM,通过多尺度特征提取和循环视觉特征注入,有效提升遥感图像理解中的细粒度识别与视觉信息保持能力,在分类、图像描述和视觉问答任务中达到先进水平。
Details
Motivation: 现有视觉-语言模型在遥感图像理解中面临细粒度特征提取困难和语言处理过程中的视觉遗忘问题,因遥感图像与自然图像存在显著差异。 Method: 设计MF-RSVLM模型,采用多尺度视觉特征提取并融合全局上下文与局部细节;引入循环视觉特征注入机制,在语言生成过程中持续引入视觉信息,缓解视觉遗忘。 Result: 在多个遥感基准上进行实验,MF-RSVLM在分类、图像描述和视觉问答任务中均取得最先进或具有竞争力的性能表现。 Conclusion: MF-RSVLM有效提升了遥感领域视觉-语言模型对复杂结构和小目标的理解能力,同时通过特征融合与注入机制增强了视觉信息的保持,推动了遥感图像智能解译的发展。 Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.[89] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations
Xingqi He,Yujie Zhang,Shuyong Gao,Wenjie Li,Lingyi Hong,Mingxi Chen,Kaixun Jiang,Jiyuan Fu,Wenqiang Zhang
Main category: cs.CV
TL;DR: 本文提出RSAgent,一种基于多模态大语言模型的代理式方法,通过多轮工具调用实现文本引导的物体分割,结合视觉反馈迭代优化分割结果,在多个基准上达到先进性能。
Details
Motivation: 现有文本引导分割方法大多为单次推理,缺乏对初始定位错误后的修正能力,限制了精确性和鲁棒性。 Method: 提出RSAgent,采用多轮交互方式,通过调用分割工具箱、观察视觉反馈并结合历史信息不断修正空间假设;构建多轮推理数据管道,并采用两阶段训练:冷启动监督微调+基于细粒度奖励的代理式强化学习。 Result: 在ReasonSeg测试集上零样本gIoU达到66.5%,比Seg-Zero-7B提升9%;在RefCOCOg上cIoU达81.5%,在域内和域外基准均表现SOTA。 Conclusion: RSAgent通过代理式推理框架实现了更强大的跨模态推理与像素定位能力,支持可验证、可聚焦和可 refine 的分割过程,显著提升了文本引导分割的性能。 Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.[90] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing
Mustafa Munir,Md Mostafijur Rahman,Kartikeya Bhardwaj,Paul Whatmough,Radu Marculescu
Main category: cs.CV
TL;DR: PipeFlow是一种可扩展的长视频编辑方法,通过跳过低运动帧、并行化处理和神经网络插值,实现线性增长的编辑时间,显著提升效率。
Details
Motivation: 长视频编辑因计算成本随序列延长呈指数增长而面临挑战,现有方法在联合编辑和DDIM反演中难以扩展。 Method: 提出PipeFlow,包含三项创新:基于SSIM和光流分析跳过低运动帧;采用流水线任务调度算法分段并行执行DDIM反演和编辑;使用神经网络插值平滑边界帧并补全跳过帧。 Result: PipeFlow的编辑时间随视频长度线性增长,相比TokenFlow加速达9.6倍,比Diffusion Motion Transfer快31.7倍。 Conclusion: PipeFlow通过分段处理和优化策略,有效解决了长视频编辑中的计算瓶颈,理论上可支持无限长度视频的高效编辑。 Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).[91] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising
Xinran Qin,Yuhui Quan,Ruotao Xu,Hui Ji
Main category: cs.CV
TL;DR: 提出了一种基于强化学习的可训练各向异性扩散框架,用于图像去噪,能够自适应复杂图像结构,并在多种噪声上去噪效果优于传统扩散方法,与深度CNN方法相当。
Details
Motivation: 传统各向异性扩散方法使用显式扩散算子,难以适应复杂图像结构,性能受限,无法与最新的学习型方法竞争。 Method: 将去噪过程建模为一系列由深度Q学习优化的扩散动作,通过强化学习自动学习动作顺序,形成具有强适应性的随机各向异性扩散过程。 Result: 所提方法在三种常见噪声上表现优异,优于现有扩散类方法,且与代表性深度CNN方法性能相当。 Conclusion: 基于强化学习的各向异性扩散框架有效提升了传统扩散方法的适应性和去噪性能,为图像恢复任务提供了新思路。 Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.[92] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval
Yizhi Liu,Ruitao Pu,Shilin Xu,Yingke Chen,Quan-Hui Liu,Yuan Sun
Main category: cs.CV
TL;DR: 本文提出了一种新的用于处理含噪标签的鲁棒跨模态学习框架NIRNL,通过跨模态边界保持和邻居感知实例精化来提升检索性能。
Details
Motivation: 由于跨模态数据标注费时费力且易引入噪声,影响模型检索性能,现有方法难以同时兼顾模型性能、标签校准可靠性和数据利用率。 Method: 提出Cross-modal Margin Preserving (CMP)以增强样本对间的判别性,并设计Neighbor-aware Instance Refining (NIR)通过跨模态邻域一致性划分纯样本、难样本和噪声样本子集,进而为不同子集定制优化策略。 Result: 在三个基准数据集上的实验表明,NIRNL在高噪声率下仍表现出卓越的鲁棒性,实现了最先进的性能。 Conclusion: NIRNL有效提升了含噪标签下的跨模态检索性能,平衡了模型性能、标签校准与数据利用之间的关系。 Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.[93] Pathology Context Recalibration Network for Ocular Disease Recognition
Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu
Main category: cs.CV
TL;DR: 本文提出了一种结合病理上下文先验和专家经验先验的新型网络PCRNet,用于提升眼部疾病识别性能与决策可解释性,包含PRM模块、EPGA适配器及集成损失IL,在多个数据集上表现优于现有方法。
Details
Motivation: 深度神经网络在眼部疾病识别中忽视了临床病理上下文和专家经验先验的利用,限制了识别性能和决策可解释性。 Method: 提出了病理重校准模块(PRM)和专家先验引导适配器(EPGA),并构建PCRNet;引入集成损失(IL)优化训练过程。 Result: 在三个眼部疾病数据集上,PCRNet结合IL显著优于现有注意力机制和先进损失方法,可视化分析验证了PRM和EPGA对决策过程的影响。 Conclusion: 通过融合病理上下文和专家经验先验,PCRNet有效提升了眼部疾病自动识别的准确性和模型可解释性。 Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.[94] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
Jingzhou Chen,Dexin Chen,Fengchao Xiong,Yuntao Qian,Liang Xiao
Main category: cs.CV
TL;DR: 本文提出了一种平衡的分层对比损失和解耦学习策略,用于改进具有层次标签结构的细粒度遥感图像检测性能。
Details
Motivation: 现有方法在利用监督对比学习处理层次化标签时,忽略了标签层级中数据分布不平衡以及语义关系学习干扰定位的问题。 Method: 提出一种平衡的分层对比损失,引入可学习的类别原型并均衡各层级类别的梯度贡献;同时,在DETR框架中采用解耦策略,将对象查询分为分类和定位两组,实现任务特定的特征提取。 Result: 在三个具有层次标注的细粒度数据集上实验表明,该方法优于现有的最先进方法。 Conclusion: 所提方法通过平衡梯度贡献和解耦分类与定位任务,有效提升了细粒度遥感图像检测的性能。 Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.[95] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention
Aiyue Chen,Yaofu Liu,Junjian Huang,Guang Lian,Yiwu Yao,Wangli Lan,Jing Lin,Zhixin Ma,Tingting Zhou,Harry Yang
Main category: cs.CV
TL;DR: RainFusion2.0 提出了一种在线自适应、硬件高效的稀疏注意力机制,用于加速视频和图像生成模型,在保持质量的同时实现1.5~1.8倍端到端加速,并支持多种硬件平台。
Details
Motivation: Diffusion Transformer在生成任务中计算成本过高,且现有稀疏注意力方法存在预测开销高和硬件通用性差的问题。 Method: 采用块级均值作为稀疏掩码预测的代表令牌,实现时空感知的令牌重排,并引入首帧锚定机制以优化视频生成。 Result: 实现了80%的稀疏度,端到端速度提升1.5~1.8倍,且不损失视频质量,适用于多种生成模型和硬件平台。 Conclusion: RainFusion2.0是一种高效、低开销、硬件通用的稀疏注意力方案,显著提升了DiT模型在多样化硬件上的推理效率。 Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.[96] Factorized Learning for Temporally Grounded Video-Language Models
Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng
Main category: cs.CV
TL;DR: 本文提出D$^2$VLM框架和因子化偏好优化(FPO)算法,通过解耦视频理解中的时间定位与文本回答任务,提升事件级感知性能。
Details
Motivation: 现有视频语言模型在时间定位和文本响应上常耦合处理,缺乏逻辑层次,导致优化目标次优。 Method: 提出D$^2$VLM框架,采用“先定位证据再回答”范式,引入证据标记以捕捉事件级语义,并设计FPO算法将时间定位建模融入偏好学习目标中。此外构建了用于因子化偏好学习的合成数据集。 Result: 实验表明该方法在多个任务上优于现有方法,显著提升了时间定位准确性和问答可靠性。 Conclusion: 通过因子化解耦学习和FPO算法,能有效增强视频语言模型的事件级理解能力,为未来研究提供了新范式。 Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.[97] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation
Yijie Qian,Juncheng Wang,Yuxiang Feng,Chao Xu,Wang Lu,Yang Liu,Baigui Sun,Yiqiang Chen,Yong Liu,Shujun Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的文本到动作生成框架Latent Motion Reasoning (LMR),通过引入双阶段的“思考-行动”机制,解决语言语义与运动学数据之间的语义-运动阻抗不匹配问题。
Details
Motivation: 现有的文本到动作生成方法在处理复杂语义时面临语义与运动数据之间的根本性不匹配问题,难以有效将离散的语言意图映射到连续高频率的动作序列中。 Method: 受认知科学中分层运动控制启发,提出Latent System 2 Reasoning架构,并设计双粒度 tokenizer,将动作分解为用于全局路径规划的推理潜空间和保留物理细节的执行潜空间,实现先推理后生成的两阶段生成模式。 Result: 在T2M-GPT和MotionStreamer两个基准上实现了显著改进,提升了生成动作的语义对齐性和物理合理性。 Conclusion: 动作生成的最佳规划空间不是自然语言,而是学习得到的、与动作对齐的潜在概念空间,LMR为文本驱动动作生成提供了更优的架构范式。 Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}[98] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks
Yongtao Chen,Yanbo Wang,Wentao Zhao,Guole Shen,Tianchen Deng,Jingchuan Wang
Main category: cs.CV
TL;DR: 提出一种无需训练的生成式对抗攻击框架,通过扩散模型生成自然且与场景一致的对抗性物体,有效干扰单目深度估计,提升自动驾驶系统安全性评估能力。
Details
Motivation: 现有基于纹理补丁的物理攻击方法在复杂驾驶环境中存在放置约束严格、真实感不足的问题,导致攻击效果受限。 Method: 设计了一种基于扩散模型的条件生成对抗攻击框架,包含显著区域选择模块和雅可比向量积引导机制,以生成物理上 plausible 的对抗性物体。 Result: 在数字和物理实验中,该方法在攻击有效性、隐蔽性和实际部署性方面均显著优于现有方法。 Conclusion: 所提方法能够生成自然且对深度估计具有强干扰能力的对抗性物体,为自动驾驶系统的安全评估提供了新的有效工具。 Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.[99] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation
Yuan Feng,Yue Yang,Xiaohan He,Jiatong Zhao,Jianlong Chen,Zijun Chen,Daocheng Fu,Qi Liu,Renqiu Xia,Bo Zhang,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出了GeoBench,一个用于评估视觉-语言模型在几何问题解决中推理能力的分层基准,揭示了当前模型在复杂任务中的性能瓶颈,并指出子目标分解和无关前提过滤对准确性的关键影响。
Details
Motivation: 现有几何推理评测存在数据污染、重答案轻过程、诊断粒度不足等问题,缺乏系统性评估VLMs在几何推理各层次能力的方法。 Method: 提出GeoBench基准,包含四个推理层级(视觉感知、目标导向规划、严格定理应用、自反式回溯),并通过TrustGeoGen生成六个形式化验证的任务进行系统评估。 Result: 实验发现推理模型(如OpenAI-o3)虽优于通用MLLM,但随任务复杂度上升性能显著下降;子目标分解与无关前提过滤对准确性至关重要,而思维链提示在某些任务中反而降低表现。 Conclusion: GeoBench为几何问题求解提供了全面的评估框架,并为构建具备深度几何推理能力的系统提供了可操作的指导原则。 Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.[100] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design
Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte
Main category: cs.CV
TL;DR: 本文提出了Few-Shot Architecture Prompting(FSAP)和Whitespace-Normalized Hash Validation两种新方法,用于提升基于大语言模型的计算机视觉神经网络架构生成效率与多样性,并通过大规模实验验证了其有效性。
Details
Motivation: 自动化神经网络架构设计在计算机视觉中仍具挑战性,现有神经架构搜索(NAS)方法计算成本高,而大语言模型(LLM)虽具潜力,但在提示工程和验证策略上缺乏系统研究。 Method: 基于NNGPT/LEMUR框架,提出FSAP,系统研究不同示例数量(n=1至6)对架构生成的影响;并引入Whitespace-Normalized Hash Validation这一轻量级去重方法,避免重复训练。 Result: 实验表明n=3时在多样性和上下文聚焦间达到最佳平衡;所提哈希验证方法比AST解析快100倍(<1ms),有效防止重复架构训练;在7个视觉基准上生成1900个独特架构,并提出数据集均衡评估方法以支持跨任务比较。 Conclusion: 本工作为基于LLM的计算机视觉架构搜索提供了实用指南和严格评估规范,降低了计算资源要求,使更多研究者能参与自动化架构设计。 Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.[101] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Chubin Chen,Sujie Hu,Jiashu Zhu,Meiqi Wu,Jintao Chen,Yanxun Li,Nisha Huang,Chengyu Fang,Jiahong Wu,Xiangxiang Chu,Xiu Li
Main category: cs.CV
TL;DR: 本文提出了一种新的方法D²-Align,以缓解文本到图像扩散模型在人类反馈强化学习中出现的偏好模式崩溃(PMC)问题,通过方向性解耦对齐来保持生成多样性。
Details
Motivation: 现有方法虽然在自动奖励指标上表现良好,但容易导致生成结果的多样性下降,即偏好模式崩溃(PMC),本文旨在解决这一问题。 Method: 提出了DivGenBench基准用于量化PMC,并设计了D²-Align框架,在冻结奖励模型的情况下,学习其嵌入空间中的方向性校正,并在优化过程中修正奖励信号,从而避免模式崩溃。 Result: 实验表明,D²-Align在保持生成图像高质量的同时显著提升了多样性,在自动化指标和人类偏好评估中均优于现有方法。 Conclusion: D²-Align有效缓解了因奖励模型偏差导致的过优化问题,为构建更鲁棒、多样化的对齐模型提供了新思路。 Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.[102] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
TsaiChing Ni,ZhenQi Chen,YuanFu Yang
Main category: cs.CV
TL;DR: 本文提出了IMDD-1M,首个大规模工业多模态缺陷数据集,包含100万个图像-文本对,涵盖60多种材料和400多种缺陷类型,并配有专家验证的细粒度文本描述。基于该数据集,作者从零训练了一个扩散-based视觉-语言基础模型,可在工业场景中高效适应特定任务,在仅使用不到5%任务特定数据的情况下达到与专用模型相当的性能,展示了数据高效型基础模型在工业检测与生成中的潜力。
Details
Motivation: 现有的工业缺陷检测数据集缺乏丰富的语义信息和多模态对齐数据,限制了多模态学习在制造业质量检测中的应用。因此,需要一个大规模、高质量、图文对齐的工业缺陷数据集来推动该领域发展。 Method: 构建了一个包含100万个高分辨率真实缺陷图像与细粒度文本描述对的数据集IMDD-1M,覆盖60+材料类别和400+缺陷类型;基于此数据集,从零开始训练一个扩散机制的视觉-语言基础模型,并通过轻量微调实现跨任务适应。 Result: 所提出的模型在多种工业检测任务中,仅用不到5%的任务特定数据即达到与专用专家模型相当的性能,验证了基础模型在工业场景中的数据效率和泛化能力。 Conclusion: IMDD-1M为工业多模态学习提供了重要资源,证明了从大规模工业多模态数据中预训练基础模型并进行轻量微调的路径是实现可扩展、领域自适应制造智能的有效方向。 Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.[103] Bayesian Self-Distillation for Image Classification
Anton Adelöw,Matteo Gamba,Atsuto Maki
Main category: cs.CV
TL;DR: 提出贝叶斯自蒸馏(BSD)方法,通过贝叶斯推断构建样本特定的目标分布,无需依赖硬目标,提升模型准确性、校准性和鲁棒性。
Details
Motivation: 现有自蒸馏方法仍依赖硬目标,导致模型过置信,限制校准性、泛化性和鲁棒性。 Method: 利用模型自身的预测,通过贝叶斯推断构建样本特定的软目标分布,完全摆脱对硬目标的依赖。 Result: 在多种架构和数据集上,BSD显著提高测试准确率(如ResNet-50在CIFAR-100上+1.4%),降低预期校准误差(ECE减少40%),并增强对数据损坏、扰动和标签噪声的鲁棒性;结合对比损失时,在单阶段单网络方法中达到标签噪声下的最先进鲁棒性。 Conclusion: BSD是一种原理清晰且有效的自蒸馏方法,不依赖硬目标,全面提升了模型的性能、校准性和鲁棒性。 Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.[104] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
Zefeng He,Xiaoye Qu,Yafu Li,Tong Zhu,Siyuan Huang,Yu Cheng
Main category: cs.CV
TL;DR: 本文提出了一种新的生成式多模态推理范式DiffThinker,将多模态推理转化为图像到图像的生成任务,在视觉中心型复杂任务中显著优于现有MLLMs。
Details
Motivation: 现有的多模态大模型(MLLMs)主要依赖文本推理,在需要高空间精度和长视野的视觉中心任务中表现不佳,因此需要一种更适配视觉推理的新范式。 Method: 提出DiffThinker,基于扩散模型框架,将多模态推理建模为原生的图像到图像生成任务,并系统分析其效率、可控性、原生并行性和协作性四大特性。 Result: 在序列规划、组合优化、约束满足和空间配置四个领域实验表明,DiffThinker显著优于GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)及微调后的Qwen3-VL-32B基线(+39.0%)。 Conclusion: 生成式多模态推理是一种有前景的视觉中心推理新路径,DiffThinker为其提供了有效实现框架。 Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.[105] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges
Yu-Tang Chang,Pin-Wei Chen,Shih-Fang Chen
Main category: cs.CV
TL;DR: 提出了一种名为Deep Global Clustering(DGC)的框架,用于内存高效的高光谱图像分割,通过局部图像块学习全局聚类结构,无需预训练,可在消费级硬件上快速训练,但在多目标损失平衡方面存在优化不稳定性。
Details
Motivation: 高光谱成像数据量大且内存受限,现有基础模型在特定领域(如近距农业监测)中迁移效果差,因此需要一种无需预训练、内存高效且适用于域特异性任务的分割方法。 Method: 提出DGC框架,将图像划分为重叠的小块进行处理,通过局部观察学习全局聚类结构,利用重叠区域保证一致性,整个过程无需预训练,实现恒定内存使用和快速训练。 Result: 在叶片病害数据集上实现了背景与组织的分离(平均IoU为0.925),并展示了通过无监督方式检测疾病的能力;训练时间少于30分钟,可在消费级硬件运行;但存在聚类特征空间过度合并导致表示退化的问题。 Conclusion: DGC在概念上为高光谱图像的无监督分割提供了可行方向,具备内存效率和快速训练优势,但其实用性受限于多目标损失平衡带来的优化不稳定性,需进一步研究动态损失平衡策略以实现稳定训练。 Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.[106] Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Xingyu Zhou,Qifan Li,Xiaobin Hu,Hai Chen,Shuhang Gu
Main category: cs.CV
TL;DR: 本文提出了一种简单而有效的内部引导(Internal Guidance, IG)策略,通过在训练过程中引入中间层的辅助监督,并在采样时外推深层输出,显著提升了扩散模型的训练效率和生成质量,在ImageNet 256x256上达到了当前最优的FID=1.19。
Details
Motivation: 扩散模型在低概率区域生成质量差,现有引导方法如分类器自由引导(CFG)易导致样本简化或失真,而基于“坏版本”引导的方法依赖复杂设计和额外训练,限制了其应用。 Method: 提出内部引导(IG)策略:在训练时对中间层添加辅助监督,在采样时外推中间和深层的输出以生成结果,无需额外网络或采样步骤。 Result: IG显著提升多种基线模型的性能:在ImageNet 256x256上,SiT-XL/2+IG达到FID=5.31(80轮)和FID=1.75(800轮);LightningDiT-XL/1+IG达到FID=1.34;结合CFG后进一步降至FID=1.19,为当前最优。 Conclusion: IG是一种高效、简洁的引导策略,无需额外网络结构或采样步骤即可显著提升扩散模型的生成质量和训练效率,具有广泛适用性和实用价值。 Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.[107] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds
Pieter M. Blok,Haozhou Wang,Hyun Kwon Suh,Peicheng Wang,James Burridge,Wei Guo
Main category: cs.CV
TL;DR: 本文提出了一种名为PointRAFT的高通量点云回归网络,用于从部分点云直接预测马铃薯块茎重量,避免因自遮挡导致的重量低估问题。
Details
Motivation: 由于RGB-D相机获取的点云常因自遮挡而不完整,导致马铃薯重量被系统性低估,因此需要一种能直接从不完整点云准确估计重量的方法。 Method: 提出PointRAFT网络,利用对象高度嵌入作为额外几何线索,直接从原始3D点云回归预测块茎重量,无需重建完整三维形状。模型在26,688个实际采集的点云数据上训练和评估。 Result: 在测试集上,PointRAFT达到12.0 g的平均绝对误差和17.2 g的均方根误差,显著优于线性回归和PointNet++基线模型,单次推理仅需6.3 ms,支持每秒150个块茎的处理速度。 Conclusion: PointRAFT能高效准确地估计马铃薯重量,满足商业收获机的高通量需求,并可推广至其他3D表型和机器人感知任务。 Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.[108] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers
Yonglak Son,Suhyeok Kim,Seungryong Kim,Young Geun Kim
Main category: cs.CV
TL;DR: 提出了一种名为CorGi的训练-free推理加速框架,通过贡献度引导的块级间隔缓存策略减少DiT模型中的冗余计算,在保持生成质量的同时实现最高2.0倍的平均加速。
Details
Motivation: DiT在图像生成中表现优异,但其迭代去噪过程和大容量结构导致高推理成本,且各去噪步间存在大量冗余计算。 Method: 提出CorGi框架,通过评估Transformer块的贡献度,缓存低贡献块并在后续步骤中重用;对于文本到图像任务,进一步提出CorGi+,利用跨注意力图识别显著token并进行部分注意力更新。 Result: 在最先进的DiT模型上验证,CorGi和CorGi+平均可实现最高2.0倍的加速,同时保持高质量的生成效果。 Conclusion: CorGi是一种有效的训练-free DiT推理加速方法,能显著降低计算冗余,提升推理效率,适用于文本到图像生成等任务。 Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.[109] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19
Sina Jahromi,Farshid Hajati,Alireza Rezaee,Javaher Nourian
Main category: cs.CV
TL;DR: 提出一种基于渐进式生成对抗网络和多目标优化的模型,用于解决医学图像分类中的数据不平衡问题,在COVID-19 X光图像分类中取得优异准确率。
Details
Motivation: 医学图像分类中存在显著的数据不平衡问题,尤其是在疫情期间,影响人工智能方法的性能。 Method: 提出渐进式生成对抗网络生成合成数据,并采用加权方式融合真实与合成数据;使用多目标元启发式优化算法优化深度分类器超参数。 Result: 在大型不平衡胸部X光数据集上,4类和2类分类任务分别达到95.5%和98.5%的准确率,交叉验证指标优于现有方法。 Conclusion: 所提模型能有效提升不平衡医学图像数据下的分类性能,适用于 pandemic 场景下的疾病检测。 Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.[110] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation
Ziquan Liu,Zhewei Zhu,Xuyang Shi
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、可学习的注意力精炼模块(ARM),用于提升CLIP在开放词汇语义分割中的表现,通过自适应融合分层特征,在不增加显著推理开销的情况下显著提升性能。
Details
Motivation: CLIP的图像级表示缺乏像素级细节,现有方法依赖昂贵的外部模型或静态启发式方法,效果有限且计算成本高。 Method: 提出注意力精炼模块(ARM),采用语义引导的交叉注意力机制,利用深层特征(K, V)选择和优化浅层细节特征(Q),并结合自注意力块进行特征融合,并遵循“训练一次,随处使用”的范式。 Result: ARM在多个基准上显著提升了多种无训练框架的性能,具有极低的推理开销。 Conclusion: ARM是一种高效且通用的即插即用后处理模块,为无训练的开放词汇语义分割提供了新范式。 Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.[111] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes
Shuyun Wang,Haiyang Sun,Bing Wang,Hangjun Ye,Xin Yu
Main category: cs.CV
TL;DR: 本文提出了一种名为Mirage的一阶段视频扩散模型,用于自动驾驶场景中的高保真、时序一致的资产编辑。通过结合2D编码器特征与3D解码结构,并引入两阶段对齐策略,解决了现有方法在空间保真度和物体对齐上的问题。
Details
Motivation: 现有的视频对象编辑方法在视觉真实感和时序连贯性之间难以平衡,且3D压缩导致空间细节退化,直接传递特征破坏时间因果性,同时新增资产与原场景存在分布不匹配导致姿态错位。 Method: 基于文本到视频扩散先验构建一阶段扩散模型;采用预训练2D编码器的时序无关潜在变量注入3D解码器以恢复细节并保持因果结构;设计两阶段数据对齐策略(粗略3D对齐+精细2D优化)缓解高斯分布差异导致的姿态错配。 Result: 实验表明,Mirage在多种编辑场景下实现了更高的真实感和时序一致性,并能泛化至其他视频到视频转换任务,性能优于现有方法。 Conclusion: Mirage为驾驶场景的视频编辑提供了一个高效、高质量的解决方案,兼具生成质量与时序稳定性,有望成为未来相关研究的可靠基线。 Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.[112] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model
Rahul Medicharla,Alper Yilmaz
Main category: cs.CV
TL;DR: 本文提出了MotivNet,一种基于Meta-Sapiens骨干网络的通用面部表情识别模型,无需跨域训练即可在多种数据集上实现竞争性性能,展现出强跨域泛化能力,并验证了其作为Sapiens下游任务的可行性。
Details
Motivation: 现有最先进的面部表情识别(FER)模型在多样化数据上泛化能力弱,限制了其在真实场景中的应用。尽管已有研究提出复杂架构来解决该问题,但仍需依赖跨域训练,与实际应用场景相矛盾。 Method: 提出MotivNet,利用具有强大泛化能力的人类视觉基础模型Sapiens作为骨干网络,将其扩展至情感识别任务,并通过基准性能、模型相似性和数据相似性三个标准评估其作为Sapiens下游任务的可行性。 Result: MotivNet在无需跨域训练的情况下,在多个数据集上达到具有竞争力的性能,表现出良好的跨域泛化能力,且满足所提出的三项评估标准。 Conclusion: MotivNet是一种可行且通用的FER模型,成功继承了Sapiens的泛化能力,推动了真实场景下表情识别的应用发展,并为FER研究提供了新的方向。 Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.[113] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation
Fuqiang Gu,Yuanke Li,Xianlei Long,Kangping Ji,Chao Chen,Qingyi Gu,Zhenliang Ni
Main category: cs.CV
TL;DR: 本文提出了MambaSeg,一种用于RGB-事件数据语义分割的双分支框架,采用并行Mamba编码器和双维交互模块(DDIM)实现空间与时间维度的细粒度融合,在降低计算成本的同时实现了最先进的性能。
Details
Motivation: 现有RGB-事件融合方法多侧重于空间融合且计算开销大,忽视事件流的时间动态特性;同时单一模态在复杂条件下(如低光、快速运动)表现受限。 Method: 提出MambaSeg框架,使用两个并行的Mamba编码器分别处理RGB图像和事件流,并设计双维交互模块(DDIM),包含跨空间(CSIM)和跨时间(CTIM)交互模块,实现跨模态的空间与时间维度联合融合。 Result: 在DDD17和DSEC数据集上实验表明,MambaSeg在语义分割任务中达到最先进性能,同时显著降低计算成本。 Conclusion: MambaSeg通过有效融合RGB与事件数据的时空互补信息,提升了复杂场景下的语义分割鲁棒性与效率,具有良好的应用前景。 Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.[114] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT
Zhi Li,Yaqi Wang,Bingtao Ma,Yifan Zhang,Huiyu Zhou,Shuai Wang
Main category: cs.CV
TL;DR: 提出一种基于物理引导流形投影(PGMP)的金属伪影去除框架,通过高保真仿真数据、确定性恢复模型和语义结构对齐,在保证解剖合理性的前提下实现快速、准确的牙科CBCT伪影去除。
Details
Motivation: 现有深度学习方法在牙科CBCT金属伪影去除中存在回归模糊或结构幻觉问题,且扩散模型依赖缓慢的随机采样,难以满足临床实时性需求。 Method: 1) 构建解剖自适应物理仿真(AAPS)生成高质量训练数据;2) 设计DMP-Former网络,采用直接x预测范式,将去伪影过程建模为确定性流形投影,单步前向推理恢复图像;3) 引入语义结构对齐(SSA)模块,利用医学基础模型先验保证解剖合理性。 Result: 在合成与多中心临床数据上均优于现有最先进方法,尤其在未见解剖结构上表现更优,实现快速单步推理,显著提升效率与诊断可靠性。 Conclusion: PGMP通过物理仿真、确定性建模与语义对齐,有效解决了金属伪影去除中的真实性、效率与临床可信度之间的权衡,为实际应用提供了新基准。 Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP[115] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Zhe Huang,Hao Wen,Aiming Hao,Bingze Song,Meiqi Wu,Jiahong Wu,Xiangxiang Chu,Sheng Lu,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出DualityForge框架和DualityVidQA数据集,通过可控的扩散模型视频编辑生成反事实视频与原视频的配对数据,并结合DNA-Train训练方法,有效减少多模态大语言模型在反事实视频中的幻觉问题。
Details
Motivation: 多模态大语言模型(MLLMs)在视频理解中存在对语言先验过度依赖的问题,导致在违反常识的反事实视频中产生视觉未接地的幻觉,且由于反事实数据标注成本高,难以解决。 Method: 提出DualityForge框架,利用基于扩散模型的可控视频编辑技术,将真实视频转化为反事实场景,并结合结构化上下文信息自动生成高质量的问答对及原-编辑视频配对数据;构建DualityVidQA大规模数据集;设计DNA-Train两阶段训练方法,在强化学习阶段采用成对ℓ₁优势归一化以实现稳定高效的策略优化。 Result: 在DualityVidQA-Test上实验显示,相比Qwen2.5-VL-7B基线,模型在反事实视频中的幻觉问题相对减少24.0%,并在幻觉和通用基准测试中均取得显著提升,具备良好泛化能力。 Conclusion: 该方法通过合成反事实对比数据和改进训练策略,有效缓解了MLLM在视频理解中的幻觉问题,具有强泛化性和实际应用前景,且将开源数据与代码。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.[116] LiftProj: Space Lifting and Projection-Based Panorama Stitching
Yuan Jia,Ruimin Wu,Rui Song,Jiaojiao Li,Bin Song
Main category: cs.CV
TL;DR: 提出一种基于三维空间提升的全景图像拼接框架,通过将图像提升到三维空间进行全局融合,并利用等距柱面投影生成几何一致的360°全景图,有效缓解了视差和遮挡导致的畸变与重影问题。
Details
Motivation: 传统二维拼接方法在处理具有多层深度和遮挡的三维场景时易产生重影、弯曲和拉伸失真,尤其在多视角累积和360°闭环拼接中问题突出,需更优的几何一致性解决方案。 Method: 将输入图像提升为统一坐标系下的密集三维点表示,结合置信度进行跨视角全局融合;在三维空间构建统一投影中心,采用等距柱面投影映射至全景平面,并在画布域内填补空洞以恢复纹理连续性。 Result: 该方法显著减少了大视差和复杂遮挡场景下的几何失真和重影伪影,生成更自然、几何一致的全景图像。 Conclusion: 将图像拼接从二维变换范式转向三维一致性范式,所提框架能灵活集成多种三维提升与补全模块,提升了复杂真实场景下全景拼接的质量与鲁棒性。 Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.[117] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training
Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Zhihua Wang,Fei Wu,Quanlin Li,Pinghong Zhou,Shuo Wang,Xian Yang
Main category: cs.CV
TL;DR: EndoRare是一个无需重新训练的生成框架,利用单个参考图像合成罕见胃肠道病变的高保真图像,通过语言引导的概念解耦分离诊断特征,在数据增强和临床培训中显著提升AI模型性能和新手医生的诊断准确率。
Details
Motivation: 罕见胃肠道病变在常规内镜检查中少见,导致可用于训练人工智能模型和临床医生的数据有限,亟需一种高效方法生成高质量、多样化的病变图像以弥补数据缺口。 Method: 提出EndoRare框架,采用语言引导的概念解耦技术,将病理性特征与非诊断性属性分离,提取关键特征为可学习的原型嵌入,并保留多样性;基于单张参考图像生成多样化高保真病变图像,无需重新训练。 Result: 在四种罕见病(钙化纤维瘤、幼年性息肉病综合征、家族性腺瘤性息肉病、Peutz-Jeghers综合征)上验证,生成图像被专家评为临床可信;用于数据增强后显著提升AI分类器在低假阳性率下的真阳性率;盲法阅读实验显示新手医师的召回率提升0.400,精确率提升0.267。 Conclusion: EndoRare为解决罕见病在计算机辅助诊断和临床教学中的数据稀缺问题提供了实用且高效的数据生成路径。 Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.[118] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction
Md. Enamul Hoq,Linda Larson-Prior,Fred Prior
Main category: cs.CV
TL;DR: 本文提出并验证了一种名为Virtual-Eyes的16位CT质量控制预处理流程,用于低剂量CT肺癌筛查,发现其可显著提升通用基础模型(如RAD-DINO)的性能与校准效果,但可能降低专用模型(如Sybil、ResNet-18)的表现,揭示了预处理对不同类型模型的差异化影响。
Details
Motivation: 在低剂量CT肺癌筛查的深度学习流程中,稳健的预处理很少被量化评估。作者旨在开发一种临床驱动的质量控制预处理方法,并系统分析其对通用基础模型与专用模型性能的影响差异。 Method: 提出Virtual-Eyes预处理流程,强制512x512平面内分辨率,剔除短或非诊断性序列,并通过Hounsfield单位滤波和双侧肺覆盖评分提取连续肺块,同时保留原始16位数据精度;在NLST数据集(765例患者)上,使用冻结编码器提取切片级嵌入,训练无泄漏的MLP分类头,并比较不同模型(RAD-DINO、Merlin、Sybil、ResNet-18)在原始输入与Virtual-Eyes处理后的表现。 Result: Virtual-Eyes使RAD-DINO的切片级AUC从0.576提升至0.610,患者级AUC在平均池化下从0.646升至0.683,在最大池化下从0.619显著提升至0.735,且校准性能改善(Brier分数从0.188降至0.112);而Sybil和ResNet-18在Virtual-Eyes输入下性能下降(Sybil AUC从0.886降至0.837),Merlin迁移能力有限(AUC约0.507至0.567)。 Conclusion: 解剖结构导向的质量控制可稳定并提升通用基础模型在低剂量CT筛查中的表现,但可能干扰依赖原始临床上下文的专用模型,提示预处理策略需根据模型类型进行权衡设计。 Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.[119] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
Nan Jiang,Zimo He,Wanhe Yu,Lexi Pang,Yunhao Li,Hongjie Li,Jieming Cui,Yuhan Li,Yizhou Wang,Yixin Zhu,Siyuan Huang
Main category: cs.CV
TL;DR: UniAct是一个两阶段框架,通过细调的MLLM和因果流式管道,实现人形机器人对多模态指令(如语言、音乐、轨迹)的实时响应(延迟<500ms),并在零样本运动跟踪中提升19%成功率。
Details
Motivation: 现有方法难以将异构的多模态指令(如语言、音乐、轨迹)统一转化为稳定、实时的人形机器人全身动作,缺乏跨模态对齐与物理合理性的兼顾。 Method: 提出UniAct框架:第一阶段使用细调的多模态大语言模型(MLLM)理解指令;第二阶段通过基于FSQ的共享离散码本和因果流式解码器,将多模态输入映射到物理合理的动作流,并确保低延迟与跨模态对齐。 Result: 在自建20小时数据集UniMoCap上验证,实现低于500ms延迟,零样本参考运动跟踪成功率提升19%,表现出在多样化真实场景中的强泛化能力。 Conclusion: UniAct通过统一的感知-控制架构,显著提升了人形机器人对多模态指令的理解与执行能力,推动了通用、响应式人形助手的发展。 Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.[120] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention
Haijing Liu,Zhiyuan Song,Hefeng Wu,Tao Pu,Keze Wang,Liang Lin
Main category: cs.CV
TL;DR: 本文提出了一种名为CERES的因果框架,用于解决第一人称视频中指代表达视频对象分割(Ego-RVOS)的挑战,通过双模态因果干预提升模型鲁棒性,并在基准上实现了最先进性能。
Details
Motivation: 现有方法因数据集中的偏见和第一人称视角的视觉混淆因素(如快速运动和频繁遮挡)而表现不佳,难以实现稳健的分割。 Method: 提出CERES框架,结合后门调整缓解语言表示偏差,以前门调整整合语义视觉特征与几何深度信息,以因果原则增强对视觉混淆的鲁棒性。 Result: 在多个Ego-RVOS基准上取得最优性能,验证了因果推理在提升模型鲁棒性和泛化能力方面的有效性。 Conclusion: CERES通过引入因果干预机制,显著提升了在复杂第一人称视频中的指代表达分割性能,展示了因果推理在 egocentric 视频理解中的潜力。 Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.[121] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Yong Xien Chng,Tao Hu,Wenwen Tong,Xueheng Li,Jiandong Chen,Haojia Yu,Jiefan Lu,Hewei Guo,Hanming Deng,Chengjun Xie,Gao Huang,Dahua Lin,Lewei Lu
Main category: cs.CV
TL;DR: 本文提出了SenseNova-MARS,一种通过强化学习实现多模态智能体推理与搜索的框架,能够动态结合图像、文本搜索和图像裁剪工具,提升视觉语言模型在复杂、知识密集场景下的推理能力,并提出新的评估基准HR-MMSearch。
Details
Motivation: 现有视觉语言模型在复杂场景中缺乏人类般连续推理与动态工具协同使用的能力,尤其在高分辨率、知识密集任务中表现受限。 Method: 提出SenseNova-MARS框架,结合图像搜索、文本搜索和图像裁剪工具,采用批归一化组序列策略优化(BN-GSPO)算法进行强化学习训练,实现推理与工具使用的交错执行。 Result: 在MMSearch和HR-MMSearch等基准上达到最优性能,其中SenseNova-MARS-8B在MMSearch上得分为67.84,HR-MMSearch上为41.64,超越Gemini-3-Flash和GPT-5等闭源模型。 Conclusion: SenseNova-MARS推动了具备智能体行为能力的视觉语言模型发展,实现了更高效、稳定的工具调用与多模态推理,为未来研究提供了开源代码、模型与数据集。 Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.[122] Spatial-aware Vision Language Model for Autonomous Driving
Weijie Wei,Zhipeng Luo,Ling Feng,Venice Erin Liong
Main category: cs.CV
TL;DR: LVLDrive 是一种结合 LiDAR 与视觉-语言模型(VLM)的新框架,旨在提升自动驾驶中的 3D 空间理解能力,通过渐进式融合和空间感知问答数据集实现更可靠、安全的驾驶决策。
Details
Motivation: 现有基于图像的视觉-语言模型在自动驾驶中依赖2D视觉线索,难以实现精确的度量空间推理和几何推断,影响安全性和可靠性。 Method: 提出 LVLDrive 框架,引入 LiDAR 点云作为额外输入模态,并设计渐进融合 Q-Former 以稳定地融合 3D 数据;构建空间感知问答(SA-QA)数据集来训练模型的 3D 感知与推理能力。 Result: 在多个自动驾驶基准上,LVLDrive 在场景理解、度量空间感知和驾驶决策方面均优于纯视觉方法。 Conclusion: 显式引入 3D 度量数据对于构建可信的基于 VLM 的自动驾驶系统至关重要。 Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.[123] The Mechanics of CNN Filtering with Rectification
Liam Frija-Altrac,Matthew Toews
Main category: cs.CV
TL;DR: 提出了一种基于相对论和量子力学启发的卷积滤波与整流机制的信息力学模型,通过偶-奇核分解揭示了CNN中信息处理与能量-动量关系的联系。
Details
Motivation: 受相对论和量子力学启发,试图从物理类比角度理解卷积神经网络中滤波与整流操作的机械性质。 Method: 将卷积核分解为正交的偶部和奇部,并在离散余弦变换(DCT)谱域中分析其特性,识别低频基函数(如DC分量和梯度分量)对信息传播模式的影响。 Result: 发现偶核导致信息各向同性扩散并保持质心,类似势能;奇核引起质心动量转移,类似动能;信息传播速度与奇核能量占比呈线性关系。首次揭示了CNN信息处理与相对论能量-动量关系之间的联系。 Conclusion: 卷积滤波可类比为信息力学系统,其行为遵循类似物理定律的动力学规律,为理解深度网络提供了新的理论框架。 Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.[124] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy
Main category: cs.CV
TL;DR: 本文提出了DermaVQA-DAS,一个支持闭合式问答和皮肤病损分割的扩展数据集,并引入了由专家设计的皮肤病评估框架DAS,以推动以患者为中心的皮肤影像-语言建模研究。
Details
Motivation: 现有皮肤病图像数据集多关注皮肤镜图像,缺乏患者自主提问和临床背景信息,限制了其在患者中心护理中的应用。为此,本文旨在构建一个更贴近临床实际、包含结构化临床特征标注的数据集与评估框架。 Method: 提出Dermatology Assessment Schema (DAS),包含36个高层级和27个细粒度评估问题(中英文多选题),并基于此构建DermaVQA-DAS数据集;对最先进的多模态模型在闭合式问答和病灶分割任务上进行基准测试,比较不同提示策略对分割性能的影响。 Result: 在分割任务中,不同提示策略影响模型表现:默认提示在Mean-of-Max和Mean-of-Mean指标下最优,而结合患者查询标题与内容的增强提示在microscore下表现最佳(BiomedParse模型Jaccard指数0.395,Dice分数0.566);在闭合式QA中,o3准确率最高(0.798),GPT-4.1次之(0.796),Gemini-1.5-Pro在Gemini系列中表现突出(0.783)。 Conclusion: DermaVQA-DAS和DAS为以患者为中心的皮肤病多模态分析提供了标准化工具和基准,有助于推动临床相关性更强的AI模型发展,且提示设计对任务性能有显著影响。 Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).[125] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems
Song Wang,Lingdong Kong,Xiaolu Liu,Hao Shi,Wentong Li,Jianke Zhu,Steven C. H. Hoi
Main category: cs.CV
TL;DR: 本文提出了一种用于多模态预训练的综合框架,旨在通过整合摄像头和LiDAR等传感器数据实现统一的空间智能,并提出了涵盖从单模态到统一模型的分类体系,同时探讨了文本输入与占据表示在开放世界感知与规划中的作用。
Details
Motivation: 现有的基础模型在单一模态任务中表现出色,但在融合多模态传感器数据(如相机和LiDAR)以实现统一空间理解方面仍面临挑战,限制了自主系统在真实环境中的应用。 Method: 设计了一个多模态预训练框架,分析传感器特性与学习策略之间的关系,构建统一的预训练范式分类体系,并探索结合文本输入与占据表示的方法,利用平台特定数据集进行评估。 Result: 提出了一个涵盖单模态基线到统一建模范式的分类法,验证了多模态融合在3D目标检测和语义占据预测等任务上的有效性,并展示了引入文本和占据表示对开放世界感知与规划的促进作用。 Conclusion: 多模态预训练是实现鲁棒空间智能的关键路径,未来需解决计算效率与模型可扩展性等瓶颈,推动通用型多模态基础模型的发展以支持实际部署。 Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.[126] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics
Gur-Eyal Sela,Kumar Krishna Agrawal,Bharathan Balaji,Joseph Gonzalez,Ion Stoica
Main category: cs.CV
TL;DR: 本文提出了RedunCut,一种用于动态模型大小选择(DMSS)的新型系统,通过测量驱动的规划器和轻量级性能模型,在无须重新训练模型的情况下显著降低视频分析的计算成本。
Details
Motivation: 现有的DMSS方法在处理多样化工作负载(如移动视频和低精度目标)时泛化能力差,主要由于采样效率低和准确率预测不准确。 Method: RedunCut采用测量驱动的规划器来估计采样的成本-收益权衡,并使用轻量级数据驱动的性能模型提升每段视频的准确率预测精度。 Result: 在道路车辆、无人机和监控视频等多种数据集上,RedunCut在保持固定准确率的同时将计算成本降低了14%-62%,并对历史数据有限和数据漂移具有鲁棒性。 Conclusion: RedunCut通过更高效的采样策略和更准确的性能预测,显著提升了DMSS系统的实用性与广泛适用性。 Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.[127] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model
Bohong Chen,Haiyang Liu
Main category: cs.CV
TL;DR: 本文提出DyStream,一种基于流匹配的自回归模型,用于实现低延迟的双人对话头像视频生成,支持实时唇音同步和自然非语言反馈。
Details
Motivation: 现有基于块的方法需要完整的非因果上下文窗口,导致高延迟,难以实现真实对话中所需的即时非语言反馈。 Method: 采用流匹配头的流式自回归框架,并设计带有前视模块的因果编码器,以引入短期未来上下文(如60毫秒)来提升生成质量同时保持低延迟。 Result: 每帧生成时间仅34毫秒,系统总延迟低于100毫秒;在HDTF数据集上离线和在线唇同步置信度分别达到8.13和7.61,优于现有因果方法。 Conclusion: DyStream实现了高质量、超低延迟的双人对话视频生成,推动了实时虚拟交互系统的发展。 Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.[128] AI-Driven Evaluation of Surgical Skill via Action Recognition
Yan Meng,Daniel A. Donoho,Marcelle Altshuler,Omar Arnaout
Main category: cs.CV
TL;DR: 提出一种基于AI的微血管吻合术操作评估框架,结合改进的TimeSformer和YOLO,实现对手术视频中动作识别与技能评估的自动化。
Details
Motivation: 传统手术技能评估依赖专家主观判断,存在耗时、不可靠及难以推广的问题,尤其在资源有限地区。需要客观、可扩展的自动化评估方法。 Method: 采用改进的TimeSformer模型(引入分层时间注意力和加权空间注意力)进行动作识别,并结合基于YOLO的对象检测与跟踪方法提取精细运动特征,从五个维度评估微血管吻合技能。 Result: 在58个标注视频的数据集上验证,动作分割帧级准确率达87.7%(后处理后达93.62%),技能各维度平均分类准确率为76%。 Conclusion: 该系统能提供客观、一致且可解释的反馈,有望推动外科教育向标准化、数据驱动的培训与评估转型。 Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.[129] Exploring Compositionality in Vision Transformers using Wavelet Representations
Akshad Shyam Purushottamdas,Pranav K Nayak,Divya Mehul Rajparia,Deekshith Patel,Yashmitha Gogineni,Konda Reddy Mopuri,Sumohana S. Channappayya
Main category: cs.CV
TL;DR: 本文提出了一种新框架,利用离散小波变换(DWT)作为视觉中的输入依赖基元,来检验Vision Transformer(ViT)编码器中表示的组合性。实验结果表明,一级DWT分解得到的基元在潜在空间中近似组合,揭示了ViT信息组织的新视角。
Details
Motivation: 尽管对Transformer模型的理解多来自语言任务,但其在视觉模型(如ViT)中的表示学习机制尚不清晰,尤其是组合性结构的作用。本文旨在探究ViT编码器是否在其表示空间中体现出组合性。 Method: 引入一种类比于表示学习中组合性度量的框架,使用离散小波变换(DWT)提取视觉输入中的基元,并通过检测由这些基元组合出的表示能否还原原始图像表示,来评估ViT中的组合性。 Result: 实验发现,基于一级DWT分解的基元所生成的编码器表示在潜在空间中具有近似的组合性,即组合后的表示能有效还原原始图像表示。 Conclusion: ViT的表示空间在一定程度上支持组合性,DWT提供了一种有效的分析工具,为理解ViT如何组织视觉信息提供了新的视角。 Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.[130] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression
Prasiddha Siwakoti,Atefeh Khoshkhahtinat,Piyush M. Mehta,Barbara J. Thompson,Michael S. F. Kirk,Daniel da Silva
Main category: cs.CV
TL;DR: 提出了一种针对太阳观测图像的高保真压缩框架,结合图嵌入与注意力机制,在多光谱图像压缩中实现了更低的信息损失和更高的重建质量。
Details
Motivation: 在空间任务中,多光谱太阳图像的高压缩保真度面临带宽受限与需保留精细光谱及空间细节之间的挑战,现有方法难以兼顾二者。 Method: 提出一种学习型图像压缩框架,包含两个模块:iSWGE模块通过将光谱通道建模为具有学习边特征的图节点来显式捕捉波段间关系;WSGA-C模块结合稀疏图注意力与卷积注意力,减少空间冗余并增强细小结构表达。 Result: 在SDOML数据集六个极紫外(EUV)通道上的实验表明,相比强学习基线,该方法MSID降低20.15%,PSNR最高提升1.09%,log MS-SSIM提高1.62%,在相近比特率下实现更清晰、光谱更保真的重建。 Conclusion: 所提方法有效平衡了压缩效率与重建质量,特别适用于要求高光谱和空间保真的太阳图像传输场景,具备实际应用潜力。 Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .[131] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model
Devendra K. Jangid,Ripon K. Saha,Dilshan Godaliyadda,Jing Li,Seok-Jun Lee,Hamid R. Sheikh
Main category: cs.CV
TL;DR: 提出一种基于DINOv2低级特征条件的Feature-to-Image Diffusion(F2IDiff)模型,用于单图像超分辨率重建,以实现更精确、无幻觉的生成。
Details
Motivation: 现有的文本到图像扩散模型在消费级摄影中容易产生不可接受的幻觉,且文本特征难以描述细节纹理,同时小块图像无法被准确描述,因此需要更严格的条件控制生成过程。 Method: 使用DINOv2提取的低级特征作为扩散模型的条件输入,构建F2IDiff基础模型,在较小图像块上实现更精确的超分辨率重建。 Result: 该方法能有效减少生成过程中的幻觉现象,尤其适用于高保真低分辨率图像的超分任务,如智能手机摄影场景。 Conclusion: 通过引入低级特征条件控制,F2IDiff实现了对生成内容的更强约束,提升了SISR在实际应用中的可用性和真实性。 Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.[132] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement
Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong
Main category: cs.CV
TL;DR: 提出了一种无需训练的少样本框架CPJ,通过结构化图像字幕提升农业病害诊断的准确性和可解释性。
Details
Motivation: 现有方法依赖昂贵的监督微调,且在域偏移下表现差,缺乏可解释性。 Method: 利用大视觉语言模型生成多角度图像字幕,通过LLM-as-Judge模块迭代优化,并用于双答案VQA流程以支持识别与管理决策。 Result: 在CDDMBench上,使用GPT-5-mini生成字幕时,GPT-5-Nano比无字幕基线提升22.7个百分点(分类)和19.5分(问答)。 Conclusion: CPJ框架无需微调即可实现高性能、透明、基于证据的农业病害诊断,具有强鲁棒性和可解释性。 Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.[133] Using Large Language Models To Translate Machine Results To Human Results
Trishna Niraula,Jonathan Stubblefield
Main category: cs.CV
TL;DR: 本研究提出了一种结合YOLOv5和YOLOv8进行胸部X光异常检测,并利用大语言模型(如GPT-4)生成自然语言放射学报告的管道,实现了从图像到文本的自动报告生成。
Details
Motivation: 现有计算机视觉系统虽能准确检测医学影像异常,但输出为结构化结果,仍需放射科医生撰写报告;因此需要一种能自动生成高质量、临床可用的自然语言报告的方法。 Method: 使用YOLOv5和YOLOv8进行胸部X光片中的异常检测,获取边界框和类别标签后,将这些结构化输出输入大语言模型(LLM),由LLM生成描述性发现和临床摘要。比较两种YOLO模型在检测精度、推理延迟及生成文本质量(与真实报告的余弦相似度)方面的表现。 Result: AI生成报告与真实报告之间具有较高的语义相似性;人工评估显示GPT-4在清晰度上得分高(4.88/5),但在自然语言流畅性方面较低(2.81/5)。 Conclusion: 该方法能够有效生成临床准确的放射学报告,尽管当前系统在写作风格和语言流畅性上仍与放射科医生撰写的报告存在差距。 Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.[134] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression
Manikanta Kotthapalli,Banafsheh Rekabdar
Main category: cs.CV
TL;DR: 本文提出了一种用于低分辨率视频的多尺度向量量化变分自编码器(MS-VQ-VAE),通过构建时空层级的潜在表示,实现高效压缩、传输与边缘设备解码,兼顾感知质量与重建性能。
Details
Motivation: 传统视频编解码器如H.264和HEVC主要面向像素级重建,缺乏对机器学习友好的潜在表示支持,难以融入深度学习流程,且在带宽受限场景下面临效率瓶颈。 Method: 基于VQ-VAE-2框架扩展出适用于时空数据的MS-VQ-VAE,采用两级层次化潜在结构和3D残差卷积,并引入预训练VGG16提取的感知损失以提升重建视觉质量,模型轻量(约1850万参数),针对64x64分辨率视频片段优化。 Result: 在UCF101数据集上使用2秒视频片段训练后,测试集达到25.96 dB PSNR和0.8375 SSIM;相比单尺度基线模型,验证集上PSNR提升1.41 dB,SSIM提升0.0248。 Conclusion: 所提方法能生成紧凑且高保真的视频潜在表示,适用于实时流媒体、移动端视频分析和CDN存储优化等带宽敏感场景,具备良好的边缘部署潜力。 Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.[135] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
Yuanhao Cai,Kunpeng Li,Menglin Jia,Jialiang Wang,Junzhe Sun,Feng Liang,Weifeng Chen,Felix Juefei-Xu,Chu Wang,Ali Thabet,Xiaoliang Dai,Xuan Ju,Alan Yuille,Ji Hou
Main category: cs.CV
TL;DR: 本文提出了一种物理感知的文本到视频生成方法,通过构建大规模物理增强数据集和设计物理引导的偏好优化框架,显著提升了生成视频的物理合理性。
Details
Motivation: 现有文本到视频生成方法在遵循物理规律方面表现不足,且缺乏包含丰富物理交互的训练数据。 Method: 提出了PhyAugPipe流水线,利用视觉语言模型与链式思维推理构建大规模物理视频数据集PhyVidGen-135K;并提出PhyGDPO框架,结合群体Plackett-Luce模型与物理引导奖励机制进行优化,同时采用LoRA-SR减少训练内存开销。 Result: 在PhyGenBench和VideoPhy2基准上显著优于现有的开源方法,生成视频在物理一致性方面表现更优。 Conclusion: 该方法通过数据构建与物理感知优化策略的有效结合,推动了物理合理的视频生成技术的发展。 Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO[136] OCP-LS: An Efficient Algorithm for Visual Localization
Jindi Zhong,Hongxia Wang,Huanshui Zhang
Main category: cs.CV
TL;DR: 提出了一种新的二阶优化算法,用于解决深度学习中的大规模优化问题,通过结合OCP方法和对Hessian矩阵对角元素的适当近似,在多个视觉定位基准上表现出优越性能。
Details
Motivation: 为了解决深度学习中大规模优化问题,尤其是传统优化算法在收敛速度、训练稳定性和抗噪能力方面的不足。 Method: 提出一种新的二阶优化算法,结合OCP方法,并对Hessian矩阵的对角元素进行适当近似。 Result: 在多个标准视觉定位基准上的实验表明,该方法在定位精度上具有竞争力,同时收敛更快,训练更稳定,且对噪声干扰更具鲁棒性。 Conclusion: 所提出的优化算法在视觉定位任务中优于传统方法,具备良好的应用潜力。 Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.[137] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios
Tianyi Zhao,Jiawen Xi,Linhui Xiao,Junnan Li,Xue Yang,Maoxun Yuan,Xingxing Wei
Main category: cs.CV
TL;DR: 本文提出了RGBT-Ground,首个面向复杂真实场景的大规模视觉定位基准,包含配对的RGB与热红外图像及高质量指代表达,并提出RGBT-VGNet模型以融合多模态信息,显著提升在夜间和远距离等挑战场景下的定位性能。
Details
Motivation: 现有视觉定位基准多基于干净环境下的数据集(如COCO),场景多样性不足,难以反映真实世界中光照、天气等复杂变化,限制了模型鲁棒性和泛化能力的评估,尤其是在安全关键应用中。 Method: 构建了一个大规模、空间对齐的RGB-热红外图像对数据集RGBT-Ground,配有高质量指代表达、边界框及细粒度场景级标注;设计了一个统一的视觉定位框架,支持单模态(RGB或TIR)与多模态(RGB-TIR)输入,并提出RGBT-VGNet作为基线模型,有效融合互补的视觉模态信息。 Result: 在RGBT-Ground上对现有方法进行了广泛适配实验,结果表明所提RGBT-VGNet在夜间和远距离等挑战性场景下显著优于现有方法,验证了其在复杂环境中的鲁棒性优势。 Conclusion: RGBT-Ground为复杂真实场景下的视觉定位提供了新的基准,推动了鲁棒性视觉语言理解的研究;提出的RGBT-VGNet通过多模态融合实现了更优的定位性能,尤其在恶劣条件下表现突出。 Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.[138] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning
Fuyu Dong,Ke Li,Di Wang,Nan Luo,Yiming Zhang,Kaiyu Li,Jianfei Yang,Quan Wang
Main category: cs.CV
TL;DR: 本文提出了一种针对变化检测视觉问答(CDVQA)中决策模糊问题的强化微调框架DARFT,通过挖掘决策模糊样本并进行组相对策略优化,提升了模型的判别能力和鲁棒性。
Details
Motivation: 现有CDVQA模型在监督微调后仍存在决策模糊问题,即正确答案与强干扰项置信度相近,影响模型性能。 Method: 提出DARFT框架:首先利用SFT训练的参考策略挖掘决策模糊样本(DAS),然后在这些样本上应用基于多样本解码和组内相对优势的组相对策略优化方法进行强化微调。 Result: 实验表明,DARFT在多个设置下均优于SFT基线模型,尤其在少样本场景下表现突出。 Conclusion: 显式优化决策模糊样本有助于提升CDVQA模型的性能,DARFT为解决此类问题提供了有效且无需额外监督的新范式。 Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.[139] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks
Wei Zhang,Chaoqun Wang,Zixuan Guan,Sam Kao,Pengfei Zhao,Peng Wu,Sifeng He
Main category: cs.CV
TL;DR: 本文提出了SliceLens,一种基于LLM和VLM的假设驱动框架,用于在多实例视觉任务中发现细粒度且可解释的错误切片,并引入首个面向此类任务的基准FeSD,实验证明其在精度和实用性上均显著优于现有方法。
Details
Motivation: 现有的错误切片发现方法主要针对图像分类,难以应用于检测、分割等多实例任务,且缺乏对复杂视觉关系的细粒度推理能力;同时现有基准存在人工标注偏差,无法真实反映模型失败情况。 Method: 提出SliceLens框架,利用大语言模型和视觉语言模型生成并验证多样化的失败假设,通过 grounded visual reasoning 实现细粒度、可解释的错误切片发现;同时构建新基准FeSD,包含专家标注、精细标注的真值切片,并精确关联到局部错误区域。 Result: 在现有基准和FeSD上实验表明,SliceLens在FeSD上的Precision@10达到0.73,相比之前方法(0.31)提升0.42;能识别出可解释的错误切片,并通过模型修复实验验证其对实际模型改进的有效性。 Conclusion: SliceLens结合LLM/VLM实现了跨多实例视觉任务的高效、可解释错误切片发现,配合新基准FeSD为细粒度模型评估提供了更可靠的标准,推动了鲁棒视觉模型的发展。 Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.[140] 3D Semantic Segmentation for Post-Disaster Assessment
Nhut Le,Maryam Rahnemoonfar
Main category: cs.CV
TL;DR: 本文提出了一种专为飓风等自然灾害后环境设计的3D语义分割数据集,利用无人机航拍影像通过SfM和MVS技术重建点云,并评估了现有最先进模型在灾后场景中的表现,揭示其局限性。
Details
Motivation: 现有深度学习模型缺乏针对灾后环境的专用3D数据集,限制了灾后评估能力。 Method: 使用无人机拍摄飓风伊恩灾区的影像,采用SfM和MVS技术构建3D点云数据集,并在该数据集上评估FPT、PTv3和OA-CNNs等SOTA 3D语义分割模型。 Result: 实验表明现有SOTA模型在灾后复杂环境中表现不佳,暴露出对灾害场景适应能力的不足。 Conclusion: 需要开发专门面向灾后场景的3D语义分割模型和基准数据集,以提升灾害响应与场景理解能力。 Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.[141] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers
Zheng Liu,Jinchao Zhu,Gao Huang
Main category: cs.CV
TL;DR: 本文提出了一种新的微调方法CLoRA,通过基础空间共享和样本无关多样性增强(SADE)在保持参数效率的同时提升低秩模块的学习能力,在图像和点云任务中实现了性能与效率的更好平衡。
Details
Motivation: 现有低秩适配方法在参数效率和学习性能之间难以兼顾,或牺牲性能或引入过多可训练参数。 Method: 提出协作式低秩适应(CLoRA),包含基础空间共享机制(共享下/上投影空间以协同构建低秩模块)和样本无关多样性增强(SADE)以促进表示多样性。 Result: 在多个图像和点云数据集上实验表明,CLoRA在学习性能和参数效率之间取得了更优的平衡,并在点云分析中所需GFLOPs最少。 Conclusion: CLoRA通过共享基础空间和增强表示多样性,有效提升了低秩微调方法的性能与效率,优于现有先进方法。 Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.[142] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding
Panquan Yang,Junfei Huang,Zongzhangbao Yin,Yingsong Hu,Anni Xu,Xinyi Luo,Xueqi Sun,Hai Wu,Sheng Ao,Zhaoxing Zhu,Chenglu Wen,Cheng Wang
Main category: cs.CV
TL;DR: 本文提出了面向户外监控场景的3D视觉定位新任务,并构建了首个大规模真实世界多模态数据集MoniRefer,包含约13.6万个物体和41.1万条自然语言描述。同时提出了一种端到端方法Moni3DVG,融合图像外观与点云几何信息进行多模态学习,在新任务上表现出优越性能。
Details
Motivation: 现有3D视觉定位研究主要集中在室内或自动驾驶场景,缺乏针对路侧基础设施监控场景的数据集和方法,限制了交通环境中的基础设施级语义理解能力。 Method: 构建了名为MoniRefer的大规模真实世界多模态数据集,并提出端到端模型Moni3DVG,结合图像的外观信息与点云的几何及光学信息进行多模态特征学习和3D对象定位。 Result: 在新构建的MoniRefer数据集上进行了大量实验和消融研究,验证了所提方法在3D视觉定位任务中的有效性与优越性。 Conclusion: 本文推动了面向基础设施监控的3D视觉定位研究,提供了首个真实世界大规模数据集和有效方法,为复杂交通环境下的路侧感知与自然语言交互奠定了基础。 Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.[143] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning
Shuyuan Lin,Yu Guo,Xiao Chen,Yanjie Liang,Guobao Xiao,Feiran Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为逐层分层注意力网络(Layer-by-Layer Hierarchical Attention Network)的新方法,用于提升计算机视觉中特征点匹配的精度,尤其在存在大量异常值的情况下表现优异。
Details
Motivation: 特征点匹配中的大量异常值会显著影响匹配准确性和鲁棒性,尤其是在高比例异常值情况下如何有效提取高质量信息并减少负样本带来的误差是一个关键挑战。 Method: 提出包含阶段融合、分层提取和注意力机制的网络结构;引入逐层通道融合模块以保留各阶段语义信息并实现整体融合,并设计分层注意力模块来自适应捕获和融合全局感知与结构语义信息;构建两种架构进行特征提取与集成。 Result: 在YFCC100M和SUN3D两个公开数据集上的实验表明,该方法在异常值去除和相机位姿估计任务上优于多种现有先进方法。 Conclusion: 所提出的网络有效增强了特征点的表示能力,在高异常值比例下仍能保持高匹配精度和鲁棒性,具有良好的应用前景。 Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.[144] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes
Qingyu Xu,Runtong Zhang,Zihuan Qiu,Fanman Meng
Main category: cs.CV
TL;DR: 本文提出了一种面向消防救援场景的目标检测方法,构建了包含多种场景和关键目标类别的新数据集FireRescue,并提出了改进的FRS-YOLO模型以解决复杂环境中类别混淆和小目标漏检问题。
Details
Motivation: 现有研究主要关注山地或森林环境,忽视更频繁且结构复杂的 urban 救援场景,且检测类别有限,难以满足指挥决策需求。 Method: 构建了包含15,980张图像、32,000个边界框的新数据集FireRescue,涵盖城市、山地、森林和水域等多种救援场景及八类关键目标;提出FRS-YOLO模型,引入多维度协同增强注意力模块和动态特征采样器,提升易混淆类别区分能力并强化前景特征响应。 Result: 实验表明,所提方法在FireRescue数据集上显著提升了YOLO系列模型的检测性能,有效缓解了混乱场景下的类别混淆与小目标漏检问题。 Conclusion: 该工作为消防救援指挥提供了更全面的数据支持与高效的检测模型,推动了面向实际城市救援场景的智能视觉感知发展。 Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named "FireRescue" for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.[145] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation
Siyang Wang,Hanting Li,Wei Li,Jie Hu,Xinghao Chen,Feng Zhao
Main category: cs.CV
TL;DR: 本文提出RadAR,一种基于径向拓扑结构的并行化自回归视觉生成框架,通过环形分层生成和嵌套注意力机制提升生成效率与一致性。
Details
Motivation: 传统自回归模型按逐个token顺序解码,推理效率低;而视觉token具有强局部依赖性和空间相关性,标准光栅扫描顺序未能充分利用这一特性。 Method: 将生成过程组织为径向拓扑结构:以中心token为起点,其余token按空间距离分为多个同心环;逐环由内向外并行生成;引入嵌套注意力机制在前向过程中动态修正不合理输出,缓解误差累积。 Result: 实现了更高的并行化程度,在保持自回归模型表达能力的同时显著提升了视觉生成效率,并有效防止因上下文不足导致的同时预测不一致问题。 Conclusion: RadAR通过环形并行生成策略和动态校正机制,兼顾了视觉生成中的空间结构保持与推理效率,为高效自回归视觉生成提供了新思路。 Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.[146] Renormalization Group Guided Tensor Network Structure Search
Maolin Wang,Bowen Yu,Sheng Zhang,Linjie Mi,Wanyu Wang,Yiqi Wang,Pengyue Jia,Xuetao Wei,Zenglin Xu,Ruocheng Guo,Xiangyu Zhao
Main category: cs.CV
TL;DR: 提出RGTN,一种受重整化群启发的张量网络结构搜索框架,通过多尺度连续演化实现高效、鲁棒的张量分解结构优化。
Details
Motivation: 现有张量网络结构搜索方法在计算可扩展性、结构适应性和优化鲁棒性方面存在不足,难以应对多尺度结构、离散搜索空间和参数与结构分离优化的问题。 Method: 引入物理启发的重整化群流机制,采用动态尺度变换实现跨分辨率的连续结构演化;设计可学习的边门控机制以在优化过程中动态调整拓扑结构,并基于节点张力和边信息流等物理量进行智能结构建议。 Result: 在光场数据、高阶合成张量和视频补全任务上,RGTN实现了最先进的压缩比,并比现有方法快4-600倍。 Conclusion: RGTN通过多尺度、连续且物理引导的结构搜索范式,显著提升了张量网络结构搜索的效率与性能,验证了物理原理在深度学习架构搜索中的潜力。 Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.[147] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting
Kai Ye,Xiaotong You,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao
Main category: cs.CV
TL;DR: 本文提出EVOL-SAM3,一种零样本推理分割框架,通过推理时的进化搜索机制(生成-评估-进化循环)克服现有方法在遗忘、训练不稳定和静态推理上的局限。
Details
Motivation: 现有推理分割方法受限于监督微调的灾难性遗忘、强化学习的训练不稳定性,或训练自由方法的静态单次推理范式,缺乏自我修正能力。 Method: 提出EVOL-SAM3,维护一组提示假设,通过‘生成-评估-进化’循环迭代优化;引入无参考视觉竞技场进行成对评估,语义变异算子纠正错误,并结合几何先验的异构竞技场模块提升鲁棒性。 Result: 在ReasonSeg基准上,EVOL-SAM3在零样本设置下显著优于静态基线和全监督最先进方法。 Conclusion: 将推理分割重构为推理时进化搜索是有效且强大的范式,EVOL-SAM3实现了更深层次的推理与自我修正,推动了零样本分割的发展。 Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.[148] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation
Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh
Main category: cs.CV
TL;DR: 提出了一种阶段感知的多模型采样策略FlowBlending,在保持大模型生成质量的同时显著提升推理速度和效率。
Details
Motivation: 发现模型容量对不同时间步的影响不同,早期和晚期阶段需要大容量模型,而中间阶段对容量不敏感,因此希望在保证生成质量的前提下提高推理效率。 Method: 设计FlowBlending方法,结合大模型和小模型分别处理对容量敏感和不敏感的时间段;引入简单准则确定阶段边界,并通过速度散度分析识别容量敏感区域。 Result: 在LTX-Video和WAN 2.1上实现最高1.65倍加速,减少57.35% FLOPs,同时保持大模型的视觉保真度、时序连贯性和语义一致性;且可与现有加速技术结合,进一步获得最高2倍加速。 Conclusion: FlowBlending是一种高效、兼容性强的阶段感知采样策略,能够在不牺牲生成质量的情况下大幅提升视频生成模型的推理效率。 Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.[149] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Bingxuan Li,Yiming Cui,Yicheng He,Yiwei Wang,Shu Zhang,Longyin Wen,Yulei Niu
Main category: cs.CV
TL;DR: 本文提出了EchoFoley任务和EchoVidia框架,用于实现基于视频的细粒度可控声音生成,通过引入专家标注的大规模数据集EchoFoley-6k,在可控性和音质上显著优于现有方法。
Details
Motivation: 现有视频-文本到音频生成模型存在视觉主导、缺乏细粒度控制定义以及指令理解能力弱的问题,限制了声音效果在多模态叙事中的应用。 Method: 提出EchoFoley任务,采用符号化表示声音事件的时序、类别与属性,并构建EchoFoley-6k数据集;设计以声音事件为中心、结合快慢思维策略的EchoVidia生成框架。 Result: 实验显示,EchoVidia在可控性上超越现有VT2A模型40.7%,感知质量提升12.5%。 Conclusion: EchoFoley任务和EchoVidia框架有效解决了当前视频到音频生成中视觉主导、控制粒度粗和指令遵循弱的问题,推动了视频关联声音生成的可控性与语义理解发展。 Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.[150] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression
Xiang Liu,Yimin Zhou,Jinxiang Wang,Yujun Huang,Shuzhao Xie,Shiyu Qin,Mingyao Hong,Jiawei Li,Yaowei Wang,Zhi Wang,Shu-Tao Xia,Bin Chen
Main category: cs.CV
TL;DR: 本文提出了Splatwizard,一个专为3D高斯点阵压缩模型设计的统一基准测试工具包,支持自动化评估渲染质量、几何精度、帧率和资源消耗等关键指标。
Details
Motivation: 现有的评估工具缺乏对3DGS压缩方法进行全面、标准化评测的能力,尤其是在速率失真权衡、内存效率和几何准确性等方面的度量不足。 Method: 设计了一个名为Splatwizard的统一基准测试框架,集成了实现新压缩模型的接口,并整合了自动化计算图像质量、重建网格的Chamfer距离、渲染帧率和资源消耗的流程。 Result: 该工具包提供了易用性与全面性,支持现有SOTA技术的集成,并实现了对多种性能指标的一体化评估。 Conclusion: Splatwizard填补了3DGS压缩领域缺乏标准化评测工具的空白,有助于推动该领域的可重复研究与公平比较。 Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard[151] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning
Ankit Dhiman,Srinath R,Jaswanth Reddy,Lokesh R Boregowda,Venkatesh Babu Radhakrishnan
Main category: cs.CV
TL;DR: 提出了一种统一的3D实例分割框架,通过可学习的高斯基元特征嵌入和“嵌入到标签”解码机制,结合边界硬挖掘策略,有效提升多视角下3D实例分割的一致性与性能。
Details
Motivation: 现有3D实例分割方法在处理多视角2D实例标签不一致时存在两阶段流程复杂、依赖敏感超参数聚类或预处理的问题,导致训练效率低且性能受限。 Method: 设计了一个端到端的统一框架,将特征嵌入学习与标签生成融合,引入高斯基元上的可学习特征,并通过‘Embedding-to-Label’机制直接解码为实例标签;为缓解边界区域的分割瑕疵,提出在光栅化后的特征上应用线性层并结合三元组损失进行边界硬采样优化。 Result: 在ScanNet、Replica3D和Messy-Rooms数据集上实现了优于现有方法的定性和定量结果,训练时间减少,同时解决了标签不一致和边界伪影问题。 Conclusion: 该方法通过统一特征学习与标签生成流程,简化了3D实例分割架构,提升了跨视角一致性与模型鲁棒性,为基于3DGS的场景理解提供了高效可靠的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.[152] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation
Takeru Kusakabe,Yudai Hirose,Mashiho Mukaida,Satoshi Ono
Main category: cs.CV
TL;DR: 提出了一种基于投影的对抗攻击方法,利用物理闭环优化和分布式协方差矩阵适应进化策略,成功导致单目深度估计模型出现深度误判,使目标物体部分消失。
Details
Motivation: 验证基于深度神经网络的单目深度估计模型在面对物理世界对抗攻击时的脆弱性,并提升对其鲁棒性的认识。 Method: 提出一种投影式对抗攻击方法,采用物理闭环(PITL)优化并结合分布式协方差矩阵适应进化策略,在真实环境中生成对抗性扰动光。 Result: 实验证明该方法能有效生成对抗样本,导致MDE模型产生深度误估计,使目标场景中物体的部分区域消失。 Conclusion: DNN-based MDE模型在物理世界中仍易受对抗攻击,需加强鲁棒性以应对实际应用中的安全威胁。 Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization -- evaluating candidate solutions in actual environments to account for device specifications and disturbances -- and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.[153] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training
Andrew Tinits,Stephen Mann
Main category: cs.CV
TL;DR: 本文提出了一种改进的Noise2Noise方法,通过理论分析证明某些非线性函数在特定条件下可安全用于噪声目标图像而不引入显著偏差,并成功应用于高动态范围蒙特卡洛渲染图像去噪,仅使用带噪数据训练即可达到接近原模型的性能。
Details
Motivation: Noise2Noise虽无需干净图像作为训练标签,但其无法直接使用非线性处理(如色调映射)于噪声目标,因会引入偏差;而高动态范围图像中存在大量离群值,导致训练困难,需有效方法缓解此问题。 Method: 建立分析非线性函数影响的理论框架,提出一类低偏置非线性函数的设计条件;结合特定损失函数与色调映射函数,在减少动态范围的同时最小化偏差。 Result: 在蒙特卡洛渲染去噪任务中,使用带噪训练数据且应用非线性色调映射后,模型性能接近原需高采样参考图像训练的模型,验证了方法的有效性。 Conclusion: 某些非线性操作可在Noise2Noise训练中安全使用,所提框架为高动态范围图像去噪提供了实用的训练策略,扩展了Noise2Noise的应用范围。 Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.[154] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
Jason Armitage,Rico Sennnrich
Main category: cs.CV
TL;DR: 提出一种通过无导数优化和遗憾最小化改进多变量互信息估计的新方法,使现成的跨模态系统能够在线适应3D场景中的物体遮挡并区分特征,无需预训练或微调。
Details
Motivation: 现有的跨模态系统在处理3D场景时面临从2D视觉输入到3D空间的维度鸿沟,且难以应对物体遮挡和噪声输出。 Method: 引入基于遗憾最小化的无导数优化方法来提升互信息估计,结合值函数优化控制场景内相机,从而实现对视觉-语言模型噪声输出的直接学习。 Result: 该方法使现成的跨模态系统能在线适应3D多物体场景中的遮挡,并有效区分特征,在无需预训练或微调的情况下提升了跨模态任务性能。 Conclusion: 所提方法通过增强互信息估计与相机控制,成功弥合了2D训练系统与3D推理之间的差距,为复杂3D场景下的跨模态理解提供了高效、实用的解决方案。 Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.[155] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture
Md Ahmed Al Muzaddid,Jordan A. James,William J. Beksi
Main category: cs.CV
TL;DR: 本文提出了一种名为CropTrack的新型多目标跟踪框架,结合外观与运动信息,有效解决了农业环境中因遮挡、光照变化和目标外观相似导致的身份保持难题,在公开农业数据集上显著优于现有方法。
Details
Motivation: 农业环境中的多目标跟踪面临重复模式、目标外观相似、光照突变和频繁遮挡等挑战,现有方法依赖运动信息,难以在强遮挡下维持身份一致性,且外观信息因高度相似而难以有效利用。 Method: 提出CropTrack框架,结合外观与运动信息,包含重排序增强的外观关联、基于外观的冲突解决的一对多关联策略,以及指数移动平均原型特征库以提升外观匹配鲁棒性。 Result: 在公开农业MOT数据集上验证,CropTrack在IDF1和关联准确率上显著优于现有方法,身份切换次数更少,表现出更强的身份保持能力。 Conclusion: CropTrack通过融合优化后的外观信息与运动信息,有效提升了农业场景下的多目标跟踪性能,尤其在处理频繁遮挡和外观相似问题上具有优势,为农业自动化中的视觉跟踪提供了可靠解决方案。 Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.[156] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
Xunyi Zhao,Gengze Zhou,Qi Wu
Main category: cs.CV
TL;DR: 本文提出了一个名为VLN-MME的统一评估框架,用于探索多模态大语言模型(MLLMs)作为零样本具身智能体在视觉-语言导航(VLN)任务中的潜力,并发现尽管MLLMs能遵循指令,但在3D空间推理和上下文感知方面表现较差。
Details
Motivation: 研究MLLMs在具身导航任务中多轮对话、空间推理和序列动作预测能力的不足,推动其在复杂真实环境中的应用。 Method: 构建了一个模块化、可扩展的评估框架VLN-MME,将传统导航数据集整合为标准化基准,通过引入思维链(CoT)和自反思机制测试不同MLLM架构的表现。 Result: 实验表明,引入CoT和自反思反而导致性能下降,说明MLLMs在3D空间推理和上下文感知方面存在显著缺陷。 Conclusion: MLLMs在具身导航任务中顺序决策能力有限,需针对性后训练以提升其空间推理与上下文理解能力;VLN-MME为未来研究提供了系统性评估基础。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.[157] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation
Meng Lan,Lefei Zhang,Xiaomeng Li
Main category: cs.CV
TL;DR: 提出OFL-SAM2,一种无需手动提示、支持在线学习的SAM2框架,用于标签高效的医学图像分割。
Details
Motivation: 适应SAM2到医学图像分割面临需要大量标注数据和高质量手动提示的问题,耗费人力且依赖专家。 Method: 设计一个轻量级映射网络,利用少量标注样本将通用图像特征转换为目标特征,并支持推理时在线参数更新;引入在线少样本学习模块和自适应融合模块,动态结合SAM2的记忆注意力特征。 Result: 在三个医学图像分割数据集上验证了方法的有效性,仅用有限训练数据即达到当前最优性能。 Conclusion: OFL-SAM2通过在线学习和特征自适应融合,实现了高效、无需手动提示的医学图像分割,具有良好的泛化能力和应用前景。 Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model's generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.[158] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation
Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao
Main category: cs.CV
TL;DR: FinMMDocR是一个新的双语多模态基准,用于评估多模态大语言模型在真实金融场景中的数值推理能力,具有场景感知、文档理解和多步计算三大特点。
Details
Motivation: 现有基准在真实金融场景下的多模态推理能力评估不足,缺乏对复杂金融文档和专家级推理的支持。 Method: 构建包含1200个专家标注问题的双语多模态数据集,涵盖12种隐式金融场景和9类共837份中英文长文档,平均50.8页,包含丰富视觉元素,并设计需多步提取与计算的复杂问题。 Result: 问题平均需要11步推理(5.3步提取+5.7步计算),65.0%需跨页证据(平均2.4页),当前最佳MLLM准确率仅为58.0%,不同RAG方法表现差异显著。 Conclusion: FinMMDocR能有效推动MLLM及其推理增强方法在复杂真实多模态任务上的发展。 Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.[159] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection
Bartłomiej Olber,Jakub Winter,Paweł Wawrzyński,Andrii Gamalii,Daniel Górniak,Marcin Łojek,Robert Nowak,Krystian Radlak
Main category: cs.CV
TL;DR: 本文提出了一种基于神经元激活模式的新型激光雷达域适应方法,通过仅标注目标域中少量代表性样本,实现了最先进的3D物体检测性能,同时结合持续学习启发的后训练技术防止模型权重漂移。
Details
Motivation: 现有的3D物体检测器在跨域泛化方面表现不佳,例如在美国训练的模型在亚洲或欧洲性能下降,因此需要有效的域适应方法来提升模型在不同区域的适用性。 Method: 提出一种基于神经元激活模式的域适应方法,选择目标域中具有代表性和多样性的少量样本进行标注,并结合受持续学习启发的后训练技术以防止权重漂移。 Result: 该方法在极小标注预算下,优于线性探测和现有最先进的域适应技术,实现了更高的性能。 Conclusion: 通过精心选择目标域中的少量样本并结合防止权重漂移的技术,可以有效提升3D物体检测模型的跨域泛化能力,且成本低廉。 Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.[160] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films
Rongji Xun,Junjie Yuan,Zhongjie Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为HaineiFRDM的电影修复框架,利用扩散模型的强大内容理解能力来辅助人类专家更好地修复难以区分的电影缺陷。
Details
Motivation: 现有的开源电影修复方法由于使用低质量合成数据训练和噪声光流,在性能上落后于商业方法,且未探索高分辨率电影修复。 Method: 采用基于patch的训练和测试策略,设计了位置感知的全局提示和帧融合模块,以及全局-局部频率模块以重建不同patch间的一致纹理,并首先恢复低分辨率结果作为全局残差以减轻分块导致的块状伪影。 Result: 实验结果表明,该模型在缺陷修复能力上优于现有开源方法。 Conclusion: HaineiFRDM通过创新的结构设计和高质量数据集,在高分辨率电影修复任务中表现出卓越性能,推动了开源电影修复技术的发展。 Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.[161] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT
Xinran Gong,Gorkem Durak,Halil Ertugrul Aktas,Vedat Cicek,Jinkui Hao,Ulas Bagci,Nilay S. Shah,Bo Zhou
Main category: cs.CV
TL;DR: 本文提出了一种名为ProDM的生成扩散模型,用于从非门控胸部CT中恢复无运动伪影的冠状动脉钙化病变,从而提高CAC评分的准确性与临床可用性。
Details
Motivation: 非门控胸部CT中的运动伪影严重影响冠状动脉钙化(CAC)定量的准确性,而门控CT应用受限,因此需要一种可靠的方法在常规CT中实现精确CAC评估。 Method: 提出ProDM框架,包含三个关键部分:(1) CAC运动模拟数据引擎,从门控CT生成逼真的非门控图像;(2) 引入钙特异性先验的可微钙一致性损失进行属性感知学习;(3) 渐进式校正机制,在扩散过程中逐步减少伪影。 Result: 在真实患者数据上实验表明,ProDM显著提升了CAC评分准确性、病灶空间保真度和风险分层性能,并通过阅片研究验证其能有效抑制运动伪影、提升临床可用性。 Conclusion: ProDM为在常规胸部CT中实现可靠的冠状动脉钙化定量提供了有前景的解决方案,尤其适用于人群筛查和日常影像检查。 Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.[162] VIPER: Process-aware Evaluation for Generative Video Reasoning
Yifan Li,Yukai Gu,Yingqian Min,Zikang Liu,Yifan Du,Kun Zhou,Min Yang,Wayne Xin Zhao,Minghui Qiu
Main category: cs.CV
TL;DR: 本文提出了一个针对视频生成模型的新型评估范式,引入了VIPER基准和过程-结果一致性(POC@r)指标,以更全面地评估生成视频推理中的中间步骤和最终结果,揭示了当前最先进模型存在严重的结果正确但过程错误的问题。
Details
Motivation: 现有的视频生成评估方法多依赖单帧判断,容易导致模型通过错误的推理过程得到正确结果(即结果欺骗),缺乏对推理过程的细致评估。 Method: 提出VIPER基准,涵盖时间、结构、符号、空间、物理和规划等16项任务,并设计POC@r指标,利用基于视觉语言模型的分级裁判机制,同时评估中间推理步骤的有效性和最终输出的正确性。 Result: 实验表明,当前最先进的视频生成模型在POC@1.0上仅达到约20%的得分,显示出严重的结果欺骗现象;同时发现测试时扩展和采样鲁棒性仍存在显著缺陷。 Conclusion: 当前视频生成模型在实现真正的通用视觉推理方面仍有巨大差距,需重视对推理过程的评估,而不仅仅是最终结果。 Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.[163] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
Siyuan Hu,Kevin Qinghong Lin,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了ShowUI-$\pi$,首个基于流的生成模型,用于实现GUI智能体中的灵巧操作,支持离散点击与连续拖拽的统一建模,并构建了包含20K拖拽轨迹的数据集和ScreenDrag基准测试。实验表明现有商用代理表现不佳,而ShowUI-$\pi$在仅4.5亿参数下达到更优性能,推动了数字环境中类人灵巧控制的发展。
Details
Motivation: 现有的GUI智能体依赖于离散的点击预测,无法实现需要实时感知与调整的连续交互(如拖动进度条),缺乏对复杂、闭合回路操作的支持,限制了其在真实数字环境中的灵活性与实用性。 Method: 提出ShowUI-$\pi$,采用基于流的动作生成模型,通过轻量级动作专家从连续视觉观测中预测光标增量调整;设计统一的离散-连续动作框架,集成点击与拖拽操作;构建ScreenDrag基准,包含手动收集与合成的20K拖拽轨迹及多领域在线/离线评估协议。 Result: 实验显示现有商用GUI代理在ScreenDrag上表现较差(如Operator得分为13.27,Gemini-2.5-CUA最高为22.18),而ShowUI-$\pi$在仅450M参数下取得26.98的成绩,验证了其在拖拽任务上的有效性与优越性。 Conclusion: ShowUI-$\pi$首次实现了GUI代理中离散与连续动作的统一建模,显著提升了对复杂交互(尤其是拖拽)的支持能力,为实现数字世界中类人水平的灵巧控制提供了有效路径,并通过新基准揭示了当前技术的不足。 Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.[164] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions
Itallo Patrick Castro Alves Da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza,Baldoino Fonseca dos Santos Neto,Marcio de Medeiros Ribeiro
Main category: cs.CV
TL;DR: 本文对卷积神经网络的压缩技术(如量化、剪枝和权重聚类)进行了综合评估,分析了其在准确率、压缩比和鲁棒性之间的权衡,发现某些组合策略不仅能保持甚至提升模型在自然损坏下的鲁棒性。
Details
Motivation: 模型压缩可能影响在自然损坏下的鲁棒性,因此在资源受限设备部署前需评估其对鲁棒性的影响。 Method: 对ResNet-50、VGG-19和MobileNetV2应用量化、剪枝和权重聚类等压缩技术,并在CIFAR-10-C和CIFAR-100-C数据集上进行单独及组合评估,采用多目标评估方法分析性能权衡。 Result: 某些压缩策略可在提高压缩比的同时保持或提升模型鲁棒性,尤其在结构更复杂的网络中;定制化的技术组合能实现更优的多目标性能。 Conclusion: 合理选择和组合压缩技术有助于在真实有损环境中实现高效且鲁棒的深度学习模型部署。 Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.[165] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Yohan Park,Hyunwoo Ha,Wonjun Jo,Tae-Hyun Oh
Main category: cs.CV
TL;DR: 本文提出了DarkEQA,一个用于评估视觉语言模型在多级低光条件下视觉问答能力的开源基准,强调物理真实感的视觉退化建模,并揭示了现有模型在此类挑战性条件下的局限性。
Details
Motivation: 现有的视觉语言模型(VLMs)基准主要在理想光照条件下评估性能,忽略了实际应用中常见的低光等视觉退化问题,尤其是24/7持续运行的需求使得对低光环境下感知鲁棒性的系统评估变得迫切。 Method: 提出DarkEQA基准,基于物理真实的线性RAW空间建模低光退化,模拟光照下降和传感器噪声,并通过ISP启发的渲染流程生成图像;通过以自我为中心的观察进行问题回答任务来评估VLMs和低光图像增强模型(LLIE)。 Result: 实验评估了多种最先进的VLMs和LLIE模型,结果表明当前VLMs在低光条件下表现显著下降,暴露其感知瓶颈,且现有LLIE方法对其性能提升有限。 Conclusion: DarkEQA为评估VLMs在低光环境下的感知鲁棒性提供了具有物理真实感的新基准,揭示了当前模型的不足,并推动未来对恶劣视觉条件下具身智能代理的研究。 Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.[166] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification
Zhenyu Cui,Jiahuan Zhou,Yuxin Peng
Main category: cs.CV
TL;DR: 本文提出了一种无需重新索引历史图库图像的终身行人重识别新任务(RFL-ReID),并设计了双向连续兼容表示(Bi-C2R)框架,在不重新提取历史特征的情况下实现新旧模型特征的兼容,有效缓解灾难性遗忘,取得了优异性能。
Details
Motivation: 现有终身行人重识别方法依赖于对历史图库图像的重新索引,但由于隐私问题和计算成本高昂,难以实际应用;且新旧模型提取的特征不兼容,导致检索性能下降。 Method: 提出Bi-C2R框架,通过双向知识迁移机制,在更新模型时持续修正旧模型生成的图库特征,使其与新模型兼容,同时平衡新旧知识的学习,避免灾难性遗忘。 Result: 在多个基准数据集上验证了Bi-C2R的有效性,不仅在新提出的RFL-ReID任务上表现领先,也在传统L-ReID任务中达到先进水平。 Conclusion: Bi-C2R成功解决了无需重新索引的终身行人重识别中的特征兼容性和知识遗忘问题,为实际场景下的持续学习ReID系统提供了可行方案。 Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.[167] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
Yuchen Wu,Jiahe Li,Fabio Tosi,Matteo Poggi,Jin Zheng,Xiao Bai
Main category: cs.CV
TL;DR: FoundationSLAM 是一种基于学习的单目稠密SLAM系统,通过结合基础深度模型的几何引导,提升了跟踪与建图的准确性与鲁棒性。
Details
Motivation: 解决以往基于光流的方法在几何一致性方面的不足,实现更准确、更鲁棒的位姿估计与稠密重建。 Method: 提出混合光流网络生成几何感知的匹配点,结合双向一致束调整层进行多视角约束下的联合优化,并引入可靠性感知细化机制动态调整光流更新。 Result: 在多个具有挑战性的数据集上实现了优越的轨迹精度和稠密重建质量,实时运行达到18 FPS。 Conclusion: FoundationSLAM 通过融合基础模型的几何先验与学习型光流,在保持实时性的同时显著提升了SLAM系统的精度与泛化能力。 Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.[168] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing
Xu He,Haoxian Zhang,Hejia Chen,Changyuan Zheng,Liyang Chen,Songlin Tang,Jiehui Huang,Xiaoqiang Liu,Pengfei Wan,Zhiyong Wu
Main category: cs.CV
TL;DR: 提出一种自举式框架,将音频驱动的视觉配音从病态的修复任务转化为良定义的视频编辑问题,利用扩散Transformer生成理想训练数据并实现高精度唇形同步与身份保持。
Details
Motivation: 现有方法因缺乏理想的成对训练数据(仅唇部运动不同而其他视觉条件一致的视频对)而依赖掩码修复范式,导致视觉伪影、身份漂移和同步效果差。 Method: 采用Diffusion Transformer作为数据生成器,为每个真实视频样本合成对应的唇部变化伴生视频,构建视觉对齐的视频对;在此基础上训练基于DiT的音频驱动编辑器,并引入时间步自适应多阶段学习策略以解耦扩散过程中的编辑目标冲突。 Result: 在唇形同步精度、身份保持性和复杂真实场景鲁棒性方面显著优于现有方法,同时提升了视觉保真度。 Conclusion: 该方法通过构建理想训练数据和充分利用完整视觉上下文,有效解决了传统方法中的根本缺陷,推动了音频驱动视觉配音向更高质量和实用化发展。 Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.[169] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion
Dian Shao,Mingfei Shi,Like Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为FineTec的统一框架,用于在时间损坏的情况下进行细粒度动作识别。该方法通过上下文感知补全、空间分解和基于物理的动力学估计,显著提升了在严重数据缺失情况下的识别性能。
Details
Motivation: 现有方法难以从存在大量缺失数据的时间损坏骨架序列中准确恢复细粒度动作的时空特征,尤其在在线姿态估计场景中表现不佳。 Method: FineTec首先通过多样时间掩码进行上下文感知补全以恢复基础骨架序列;然后使用基于语义区域的空间分解模块将骨架分为动态和静态子组,并生成增强序列;最后利用拉格朗日动力学估计关节加速度,并结合GCN进行动作识别。 Result: 在NTU-60、NTU-120、Gym99和Gym288等多个基准上验证了方法的有效性,在Gym99-severe和Gym288-severe设置下分别达到89.1%和78.1%的top-1准确率。 Conclusion: FineTec在不同级别的时间损坏下均显著优于现有方法,展现出强大的鲁棒性和泛化能力,适用于真实场景中的细粒度动作识别任务。 Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.[170] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images
Jiageng Liu,Weijie Lyu,Xueting Li,Yejie Guo,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: Edit3r是一种前馈框架,能够从无姿态、视角不一致的编辑图像中单次重建和编辑3D场景,无需优化或位姿估计,具有快速、高质量和实时应用潜力。
Details
Motivation: 现有3D场景编辑方法通常需要每场景优化或依赖精确的相机位姿估计,导致速度慢且难以实现真实感渲染;同时缺乏多视角一致的编辑图像用于监督训练。 Method: 提出Edit3r,采用前馈网络直接预测指令对齐的3D编辑;通过基于SAM2的重着色策略生成跨视角一致的监督信号,并设计非对称输入策略,将重着色参考视图与原始辅助视图结合,以融合不同观测信息。 Result: 在新构建的大规模评测基准DL3DV-Edit-Bench上,Edit3r在语义对齐性和3D一致性方面优于现有基线方法,且推理速度显著更快。 Conclusion: Edit3r实现了快速、无需优化的单次3D场景编辑,在真实感渲染和实时应用方面表现出巨大潜力,推动了3D场景编辑的实用化进程。 Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.[171] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Yi-Chuan Huang,Hao-Jen Chien,Chin-Yang Lin,Ying-Huan Chen,Yu-Lun Liu
Main category: cs.CV
TL;DR: 本文提出了GaMO(Geometry-aware Multi-view Outpainter),一种通过多视角补全来解决稀疏视角3D重建问题的新框架。与生成新视角不同,GaMO从现有视角扩展视场,保持几何一致性并提升场景覆盖范围,在零样本设置下无需训练即实现SOTA性能,并比现有扩散方法快25倍。
Details
Motivation: 现有稀疏视角3D重建方法存在三大问题:视野覆盖不足、生成视图间几何不一致以及计算成本高。为克服这些问题,本文提出新的重建范式。 Method: 提出GaMO框架,将稀疏视角重建重新定义为多视角补全任务;利用多视角条件控制和几何感知去噪策略,在零样本情况下对现有视角进行外补全,不生成新视角但扩展其视野,从而保持几何一致性。 Result: 在Replica和ScanNet++数据集上,使用3、6、9个输入视图均取得最优重建质量,PSNR和LPIPS优于先前方法,处理时间低于10分钟,相比SOTA扩散方法加速25倍。 Conclusion: GaMO通过多视角补全而非新视角生成,有效解决了稀疏视角重建中的覆盖、一致性与效率问题,为3D重建提供了一种高效且高质量的新范式。 Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/[172] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Zhening Huang,Hyeonho Jeong,Xuelin Chen,Yulia Gryaditskaya,Tuanfeng Y. Wang,Joan Lasenby,Chun-Hao Huang
Main category: cs.CV
TL;DR: SpaceTimePilot 是一种视频扩散模型,通过分离空间和时间实现可控的生成渲染,能够在给定单目视频的情况下独立控制摄像机视角和运动序列。