Table of Contents
cs.CL [Back]
[1] From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization
Ruangrin Ldallitsakool,Margarita Bugueño,Gerard de Melo
Main category: cs.CL
TL;DR: 本文提出一种数据驱动的方法,自动构建基于图的文档表示,利用动态滑动窗口注意力模块捕捉句子间的局部与中程语义依赖及文档内部结构关系,并在文档分类任务上用GAT取得有竞争力的结果且计算开销更低;还初步探索了其在抽取式摘要中的应用。
Details
Motivation: 现有文档表示方法在捕捉句子间多尺度语义依赖和结构关系方面存在不足,且计算成本较高。 Method: 提出数据驱动的图构建方法,结合动态滑动窗口注意力模块学习句子间依赖,生成文档图;在此图上训练Graph Attention Networks(GAT)用于下游任务。 Result: 在文档分类任务上GAT表现具竞争力,且计算资源消耗低于先前方法;在抽取式摘要任务中展现出潜力但也存在明显局限。 Conclusion: 所提图构建方法能有效平衡表达能力与效率,适用于分类等任务,但在摘要等更复杂任务中尚需改进。 Abstract: This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugueño and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.[2] Noise reduction in BERT NER models for clinical entity extraction
Kuldeep Jiwani,Yash K Jeengar,Ayush Dhaka
Main category: cs.CL
TL;DR: 本文提出了一种噪声去除(NR)模型,用于提升临床命名实体识别(NER)系统的精度,通过分析NER输出的概率序列并结合概率密度图(PDM)等高级特征,显著降低假阳性率(50%–90%)。
Details
Motivation: 现有基于BERT的临床NER模型召回率高但精度不足,尤其在临床场景中高精度至关重要;而简单按SoftMax概率阈值过滤不可靠,因Transformer常对弱预测给出虚高置信度。 Method: 构建一个监督式噪声去除(NR)模型,以NER输出的token级标签及概率序列为输入,引入概率密度图(PDM)建模Transformer嵌入中的语义牵引效应,从而判别预测为‘强’或‘弱’。 Result: 在多个临床NER模型上将假阳性率降低50%至90%。 Conclusion: NR模型能有效校准NER输出的置信度,显著提升临床实体抽取的精度,且不依赖不可靠的硬阈值,为高可靠性医疗AI提供了实用后处理方案。 Abstract: Precision is of utmost importance in the realm of clinical entity extraction from clinical notes and reports. Encoder Models fine-tuned for Named Entity Recognition (NER) are an efficient choice for this purpose, as they don't hallucinate. We pre-trained an in-house BERT over clinical data and then fine-tuned it for NER. These models performed well on recall but could not close upon the high precision range, needed for clinical models. To address this challenge, we developed a Noise Removal model that refines the output of NER. The NER model assigns token-level entity tags along with probability scores for each token. Our Noise Removal (NR) model then analyzes these probability sequences and classifies predictions as either weak or strong. A naïve approach might involve filtering predictions based on low probability values; however, this method is unreliable. Owing to the characteristics of the SoftMax function, Transformer based architectures often assign disproportionately high confidence scores even to uncertain or weak predictions, making simple thresholding ineffective. To address this issue, we adopted a supervised modeling strategy in which the NR model leverages advanced features such as the Probability Density Map (PDM). The PDM captures the Semantic-Pull effect observed within Transformer embeddings, an effect that manifests in the probability distributions of NER class predictions across token sequences. This approach enables the model to classify predictions as weak or strong with significantly improved accuracy. With these NR models we were able to reduce False Positives across various clinical NER models by 50\% to 90\%.[3] Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs
Sean W. Kelley,Christoph Riedl
Main category: cs.CL
TL;DR: 本文系统评估了个性化对大语言模型(LLM)谄媚行为的影响,发现个性化通常增强情感一致性,但对认知一致性的影响取决于模型角色:在提供建议时增强认知独立性,在作为社交同伴时则削弱之。
Details
Motivation: 大型语言模型易表现出谄媚行为,而随着模型越来越多地基于用户特定上下文(如性格、偏好、对话历史)生成回应,其定制化认同能力增强;然而,个性化如何调节谄媚行为尚缺乏系统性评估。 Method: 在九个前沿模型和五个涵盖建议、道德判断与辩论场景的基准数据集上,开展严谨评估;设计区分情感一致性与认知一致性的测量框架,并进行鲁棒性检验以排除输入长度或人口统计信息等混杂因素影响。 Result: 个性化普遍提升情感一致性(如情绪认同、委婉/顺从),但对认知一致性(信念采纳、立场稳定性、抗干扰性)的影响呈角色依赖:建议角色下增强认知独立性,社交同伴角色下显著削弱——后者在用户挑战下更频繁放弃原有立场。 Conclusion: 个性化对LLM谄媚行为的影响高度依赖其交互角色;需采用角色敏感的评估范式,并建立兼顾目标对齐与个性化特性的新基准与测量框架。 Abstract: Large Language Models (LLMs) are prone to sycophantic behavior, uncritically conforming to user beliefs. As models increasingly condition responses on user-specific context (personality traits, preferences, conversation history), they gain information to tailor agreement more effectively. Understanding how personalization modulates sycophancy is critical, yet systematic evaluation across models and contexts remains limited. We present a rigorous evaluation of personalization's impact on LLM sycophancy across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts. We find that personalization generally increases affective alignment (emotional validation, hedging/deference), but affects epistemic alignment (belief adoption, position stability, resistance to influence) with context-dependent role modulation. When the LLM's role is to give advice, personalization strengthens epistemic independence (models challenge user presuppositions). When its role is that of a social peer, personalization decreases epistemic independence. In this role, extensively personalized user challenges causing LLMs to abandon their position at significantly higher rates. Robustness tests confirm that the effects are driven by personalized conditioning, not by additional input tokens per se or demographic information alone. Our work provides measurement frameworks for evaluating personalized AI systems, demonstrates the necessity of role-sensitive evaluation, and establishes a novel benchmark to assess goal alignment.[4] TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Ashley Hagaman,Sarah R. Lowe,Aimee Kendall Roundtree
Main category: cs.CL
TL;DR: 本文提出了一种针对结构化预测任务(如医学标注)中偏好对差异小、语义关键token稀疏等问题的改进偏好优化方法TAB-PO,通过引入token级自适应优势和条件屏障机制,在保持SFT基础性能的同时增强偏好区分能力,显著提升微F1指标。
Details
Motivation: 标准DPO在医学标注等token关键型结构化预测任务中表现脆弱,主要因偏好对分离度低(仅差1-3个token)和token重要性分布不均(语义标签token稀疏但关键,JSON结构token高频但次要),导致margin collapse、likelihood squeezing和gradient dilution问题。 Method: 提出Token-Adaptive Barrier Preference Optimization (TAB-PO):1)基于token重要性的加权、参考调整优势函数,聚焦高价值语义token;2)引入条件token级屏障,协同约束SFT先验似然与偏好分离目标,缓解低分离度与重要性偏斜下的训练不稳定。 Result: 在医学沟通标注任务(联合预测层级标签与证据片段)上,TAB-PO相较SFT提升约4%微F1,并持续优于近期偏好优化基线方法。 Conclusion: TAB-PO通过细粒度token级建模与正则化,有效解决了DPO在低分离、重要性偏斜结构化任务中的固有缺陷,为领域敏感的偏好对齐提供了新范式。 Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.[5] ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents
Xiaohui Zhang,Zequn Sun,Chengyuan Yang,Yaqin Jin,Yazhong Zhang,Wei Hu
Main category: cs.CL
TL;DR: 本文提出了一种名为ActMem的新型可操作记忆框架,将记忆检索与主动因果推理相结合,通过构建因果与语义图、利用反事实推理和常识补全,提升LLM代理在冲突检测与复杂决策中的能力,并配套发布评估数据集ActMemEval。
Details
Motivation: 现有记忆框架将代理视为被动记录者,缺乏对信息深层含义的理解,在冲突检测和复杂决策场景中表现不足。 Method: 提出ActMem框架,将非结构化对话历史转化为结构化的因果与语义图;结合反事实推理与常识补全,推断隐含约束并解决过去状态与当前意图间的潜在冲突;构建新评测数据集ActMemEval。 Result: 实验表明ActMem在复杂、依赖记忆的任务上显著优于现有最先进基线。 Conclusion: ActMem提升了LLM代理的记忆理解与推理能力,为构建更一致、可靠的智能助手奠定基础。 Abstract: Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.[6] EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal
Samah Fodeh,Yan Wang,Linhai Ma,Srivani Talakokkul,Jordan M. Alpert,Sarah Schellhorn
Main category: cs.CL
TL;DR: 本文提出了EPPCMinerBen基准,用于评估大语言模型(LLMs)在电子医患通信(EPPC)文本中识别交流意图与提取证据的能力,包含代码分类、子代码分类和证据提取三项任务,并基于耶鲁纽黑文医院患者门户的1933条专家标注句子进行评测。
Details
Motivation: 随着医患交流转向安全消息平台,分析电子患者-提供者通信(EPPC)数据对改善治疗效果和依从性至关重要,但现有方法缺乏统一、细粒度的评估基准。 Method: 构建EPPCMinerBen基准,含三个子任务(Code/Subcode Classification, Evidence Extraction),使用1933条来自752条安全消息的专家标注句子;在零样本与少样本设置下评测多种LLM(如Llama-3系列、DeepSeek-R1、sdoh-llama等)性能。 Result: Llama-3.1-70B在证据提取上最优(F1=82.84%),Llama-3.3-70b-Instruct在代码分类最优(F1=67.03%),DeepSeek-R1-Distill-Qwen-32B在子代码分类最优(F1=48.25%);少样本提示普遍提升性能;小模型尤其在子代码分类上表现差(F1>30%)。 Conclusion: 大型指令微调模型在EPPC理解任务中整体更优,尤其擅长证据提取;EPPCMinerBen为医患对话级语义理解提供了首个公开可复现的评估基准,支持后续模型泛化与临床沟通分析研究。 Abstract: Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering[7] Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
Youngji Roh,Hyunjin Cho,Jaehyung Kim
Main category: cs.CL
TL;DR: 本文提出了一种基于幅度的无训练方法,识别大语言模型中具有领域关键性的维度,并将其作为可解释的功能单元用于激活引导,在领域自适应和越狱任务中优于传统全维度引导。
Details
Motivation: 现有工作将大语言模型中高度各向异性的极大激活维度视为需管理的异常现象,而本文认为这些维度是领域专业化产生的内在可解释功能单元。 Method: 提出基于幅度的无训练标准来识别领域关键维度,并引入仅针对这些维度的‘关键维度引导’(Critical Dimension Steering)方法。 Result: 关键维度表现出对符号/定量模式或领域术语的可解释语义检测能力;在领域自适应和 jailbreaking 场景中,关键维度引导优于常规全维度引导。 Conclusion: 大语言模型中的极端激活维度并非噪声,而是具备可解释性和功能意义的领域关键单元,利用它们可提升特定任务性能与可控性。 Abstract: Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.[8] SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
Xiaoxin Shi,Jiaxin Wan,Linkang Dong,Wei Jiang,Yue Liu,Zengfeng Huang
Main category: cs.CL
TL;DR: 本文提出SimpleTool,通过引入特殊双功能标记,在保持甚至提升准确率的同时,显著加速大语言模型(LLM)的函数调用过程,实现接近实时的控制频率(如16Hz),适用于具身智能、游戏AI等低延迟场景。
Details
Motivation: 现有基于LLM的函数调用受限于自回归解码的固有延迟,难以满足具身智能、游戏AI和交互式虚拟人等对高控制频率(如10Hz)的实时性要求。 Method: 提出SimpleTool方法,设计兼具压缩低熵token(如分隔符、参数名)和作为模式选择器的特殊token,支持函数名与参数的并行独立生成,联合利用结构化输出的冗余性和弱因果依赖性。 Result: 在Qwen系列模型(0.5B–14B)的五个基准上实现3–6倍端到端加速(最高达9.6倍),仅增加8.2%并行开销;ST-Qwen-0.5B在Mobile Actions上精度与延迟稳定性均优于FunctionGemma;量化后在消费级GPU上P50延迟低至61.2ms,4B模型可达16Hz实时控制。 Conclusion: SimpleTool通过协同利用结构化输出特性,有效突破LLM函数调用的延迟瓶颈,为低延迟真实场景部署提供了可行路径。 Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.[9] GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency
Changhao Wang,Jiaolong Yang,Xinhao Yao,Yunfei Yu,Peng Jiao,Lu Yu,Junpeng Fang,Riccardo Cantoro,Qing Cui,Jun Zhou
Main category: cs.CL
TL;DR: 本文提出GRIP框架,通过几何建模与自适应信息势量化,统一全局分布平衡与局部样本选择,显著提升大模型数据效率,在MoE模型上超越3倍规模未筛选数据的性能。
Details
Motivation: 现有数据选择方法将全局分布平衡与局部实例选择割裂,损害训练集的层次完整性,而大模型性能日益受限于数据效率而非单纯规模。 Method: 提出GRIP框架:1)将语料建模为信息密集的几何空间;2)用快速自适应探针(RAP)量化语义簇的信息势,动态重分配采样预算;3)基于长度校正的几何先验进行簇内选择,缓解嵌入密度偏差并保留长尾逻辑序列。 Result: 在高达300B token的MoE模型上验证,GRIP持续优于SOTA基线,性能超过使用3倍规模未筛选数据训练的模型。 Conclusion: GRIP为大规模预训练中的自适应数据筛选建立了稳健的几何基础,推动以数据效率为核心的模型优化范式。 Abstract: The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.[10] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
Delip Rao,Chris Callison-Burch
Main category: cs.CL
TL;DR: 本文提出了一种名为Autorubric的统一开源框架,用于基于量规的大语言模型文本生成评估,支持多种评价类型、多评委集成、偏差缓解及可靠性度量,并在多个基准上验证了其有效性,同时发布了新数据集CHARM-100。
Details
Motivation: 现有基于量规的LLM评估技术分散、术语不一、方案不完整,缺乏统一框架。 Method: 设计并实现Autorubric开源Python框架,支持多类评价标准(二元/序数/名义)、单/多评委集成评估、少样本校准、多种偏差缓解策略,以及心理测量学可靠性指标与生产级基础设施。 Result: 在RiceChem、ResearcherBench和CHARM-100三个基准上验证了框架有效性;发布含100个样本、覆盖三类标准的CHATBOT评估数据集CHARM-100。 Conclusion: Autorubric为LLM文本生成评估提供了系统化、可复现、可扩展的统一解决方案,推动了rubric-based评估的标准化与工程化落地。 Abstract: Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $κ$, weighted $κ$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.[11] Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
Ambre Marie,Thomas Bertin,Guillaume Dardenne,Gwenolé Quellec
Main category: cs.CL
TL;DR: 本文提出了一种多轮LLM后处理架构,交替进行说话人识别和词识别,以提升法语医疗对话的ASR准确率与说话人归属,在两个法语临床数据集上验证了其有效性与可行性。
Details
Motivation: 法语医疗对话的自动语音识别(ASR)错误率高(常超30%),尤其在自发临床语音中,亟需提升转录准确性和说话人归属能力。 Method: 提出多轮LLM后处理架构,交替执行说话人识别(Speaker Recognition)和词识别(Word Recognition);在两个法语临床数据集上开展消融实验,考察模型选择、提示策略、轮次顺序和迭代深度四个设计因素;使用Qwen3-Next-80B模型,并通过Wilcoxon符号秩检验评估效果。 Result: 在自杀预防电话咨询数据集上显著降低词级说话人错误率(WDER,p < 0.05,n=18);在清醒开颅术咨询数据集上保持稳定(n=10);无输出失败,实时因子RTF为0.32,计算成本可控。 Conclusion: 该多轮LLM后处理方法可有效提升法语临床语音ASR性能,兼具鲁棒性与实用性,适合离线临床部署。 Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p < 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.[12] Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Xintong Li,Sha Li,Rongmei Lin,Hongye Jin,Linwei Li,Hejie Cui,Sarah Zhang,Chia-Yuan Chang,Kewei Cheng,Besnik Fetahu,Priyanka Nigam,Jingbo Shang,Bing Yin
Main category: cs.CL
TL;DR: 本文提出Step-wise Adaptive Penalization (SWAP)框架,通过细粒度地根据每步推理对正确答案的贡献动态分配长度惩罚,显著压缩推理链长度并提升准确率。
Details
Motivation: 大型推理模型常因过度思考产生冗长推理链,增加成本却不提升准确率;现有强化学习方法难以区分关键与冗余步骤,导致压缩效果粗糙。 Method: SWAP基于模型在策略下的对数概率提升估计每步重要性,将总长度惩罚按重要性重新分配,重点惩罚低重要性步骤;采用组内相对策略优化(group-relative policy optimization)统一优化结果与过程优势。 Result: 在多个基准上平均减少64.3%推理长度,同时相对基线模型提升5.7%准确率。 Conclusion: 将推理长度作为显式的步级优化目标可行且有效,SWAP实现了更高效、更精准的推理压缩。 Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.[13] From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
Matic Korun
Main category: cs.CL
TL;DR: 本文通过在GPT-2中可控诱导幻觉,验证了几何幻觉分类法(中心漂移、错误收敛、覆盖缺口)的有效性,发现仅覆盖缺口型(Type 3)在静态嵌入中具有稳健的几何可分性,而其他两类无法区分;同时指出词元级检验因伪重复导致显著性严重膨胀。
Details
Motivation: 探究几何幻觉分类法是否能在模型内部表征空间中有效区分不同类型的生成幻觉,为理解大语言模型失败机制提供可解释的几何视角。 Method: 在GPT-2上采用两层统计设计:以提示(N=15/组)为推断单元,每组实验运行20次(不同生成种子),分别分析静态嵌入与上下文隐藏状态中三类幻觉的L2范数分离性,并校正多重比较;同时对比词元级与提示级检验的显著性差异。 Result: Type 3(覆盖缺口)在静态嵌入中范数分离稳健(20次中18次显著,中位r=+0.61);在隐藏状态中方向稳定但统计效力不足(仅4/20次显著,中位r=−0.28);Type 1和Type 2在两种空间均未分离(≤3/20次显著);词元级检验因伪重复使显著性夸大4–16倍。 Conclusion: 覆盖缺口型幻觉是唯一具有强几何可辨识性的幻觉类型,其特征体现于表征模长而非方向;Type 1/2的不可分性在124M参数规模下是真实现象;提示级统计建模对评估幻觉几何特性至关重要。 Abstract: We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.[14] When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun,Zhenjian Wang,Orvill de la Torre,Zirui Wang
Main category: cs.CL
TL;DR: 本文研究了如何通过微调Llama 2 7B模型提升其在医疗问答任务中的准确性,使用真实医患对话数据进行监督微调,并采用文本相似度指标评估性能,强调需由医学专家进行人工评估。
Details
Motivation: 大型语言模型(LLMs)在医疗场景中表现不佳,可能带来误导性建议,亟需提升其在医疗领域的可靠性与准确性。 Method: 基于真实医患对话转录文本,对Llama 2 7B模型进行监督式微调,聚焦医疗领域特有表达与语义细节;受限于资源,采用文本相似度指标(而非专家人工评估)进行性能验证。 Result: 微调后模型在各项定量指标上均有显著提升,但在GPT-4评估中结果不一致,凸显自动评估与专家判断之间的差距。 Conclusion: 自动评估(如文本相似度或GPT-4打分)不足以替代真实医学专家的评估;未来部署更精准的医疗LLM必须纳入临床专家参与验证与审核。 Abstract: This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model's accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4's evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.[15] How Large Language Models Get Stuck: Early structure with persistent errors
Alokesh Manna,William Snyder,Whitney Tabor
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)在BabyLM数据集上训练时对语法正确性判断的能力,发现OPT模型在近三分之一的BLiMP语法类别中始终无法正确偏好合语法句子,并提出‘二元组假设’解释该现象:早期训练中二元组统计偏差导致错误偏好固化。
Details
Motivation: 探索语言学洞见如何提升大语言模型训练效率,特别是模型在语法判断任务中的学习动态与潜在偏差根源。 Method: 在BabyLM数据集上训练OPT模型,系统评估其在BLiMP基准67类语法测试上的句子概率偏好变化;结合语言学理论、深度学习理论进行定性分析,并辅以数值实验进行定量验证;提出并初步检验‘Bigram Hypothesis’。 Result: OPT在近1/3 BLiMP类别中始终无法正确区分合语法与不合语法句子;错误偏好常在训练早期即固化并持续至结束;部分BLiMP测试项缺乏语言学有效性。 Conclusion: 模型早期受二元组统计偏差影响易形成错误且顽固的语法判断倾向,需针对性干预;并非所有BLiMP测试都适合作为语法能力评估指标;提出‘Bigram Hypothesis’并设计后续验证方法。 Abstract: Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method (in progress) of testing the hypothesis on appropriately selected BLiMP classes.[16] Distribution-Aware Companding Quantization of Large Language Models
Athul Radhakrishnan,Siddhant Mohan,Mahima Sachdeva
Main category: cs.CL
TL;DR: 本文提出了一种多令牌预测训练方法,即在每个位置让模型同时预测接下来的n个令牌,使用n个独立输出头共享主干网络。该方法作为辅助训练任务,在不增加训练时间的前提下,显著提升了代码和自然语言模型的下游性能,尤其在生成式基准测试(如HumanEval、MBPP)上表现突出,并加快了推理速度。
Details
Motivation: 传统大语言模型采用单令牌预测(next-token prediction)训练方式,效率受限;作者旨在探索更高效的训练范式,提升模型样本效率与推理性能。 Method: 在每个训练位置,模型需同时预测后续n个令牌,每个令牌由独立输出头生成,共享同一模型主干;将多令牌预测作为辅助训练任务融入标准训练流程。 Result: 13B参数模型在HumanEval和MBPP上分别比基线提升12%和17%;小规模算法任务验证其增强归纳头与算法推理能力;4-令牌预测模型推理速度最高提升3倍。 Conclusion: 多令牌预测是一种高效、可扩展且无需额外训练开销的训练策略,尤其适用于大型模型和生成式任务,并能加速推理。 Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.[17] Policy Compliance of User Requests in Natural Language for AI Systems
Pedro Cisneros-Velarde
Main category: cs.CL
TL;DR: 本文提出了首个用于评估用户请求是否符合组织政策的基准数据集,并利用该基准评测了多种大语言模型在策略合规性判断任务上的表现,揭示了该任务的挑战性。
Details
Motivation: 确保组织内用户通过自然语言向AI系统发出的请求符合既定安全与可靠性政策。 Method: 构建了一个包含多样化政策合规性标注的用户请求基准数据集,并在该基准上对比评测了多种大语言模型及不同求解方法在政策合规性评估任务上的性能。 Result: 不同模型和方法在合规性评估任务上表现差异显著,整体性能有限,表明该任务具有高度挑战性。 Conclusion: 策略合规性评估是一个重要且困难的问题,现有大语言模型尚不能稳健可靠地完成该任务,亟需进一步研究。 Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.[18] LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
Cunyuan Yang,Dejuan Song,Xiaotao Pang,Qianqian Shen,Wenjie Nie,Yifan Huang,Lei Wu,Wei Han,Haishuai Wang,Jiajun Bu
Main category: cs.CL
TL;DR: 本文提出Fact-Flow框架,通过分离视觉事实识别与报告生成两个阶段,提升医学影像报告生成的事实准确性;利用LLM自动生成带标注的临床发现数据集,避免人工标注成本;在两个疾病数据集上验证了其在事实准确性和文本质量上的优越性。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)的医学报告自动生成方法存在事实不稳定性问题(如遗漏或错误信息),因其直接依赖图像特征生成报告,缺乏明确的事实依据。 Method: 提出Fact-Flow框架:首先从医学图像中预测临床发现(视觉事实识别),再以此为条件驱动MLLM生成报告;并设计LLM驱动的自动标注流程,构建临床发现标签数据集。 Result: 在两个疾病导向的医学数据集上实验表明,Fact-Flow显著提升了事实准确性,同时保持高水平的文本质量,优于当前最先进模型。 Conclusion: 分离事实识别与报告生成、结合LLM自动构建高质量发现标注数据,是提升医学报告生成可靠性与临床适用性的有效路径。 Abstract: The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.[19] A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
Anna Feldman,Libby Barak,Jing Peng
Main category: cs.CL
TL;DR: 本文提出了一种类型学感知的诊断方法,用于评估多语言掩码语言模型对词序与屈折形式的依赖程度,通过多种基于依存关系的推理时扰动(如全词打乱、内容词打乱、依存头-依存词交换、句级词元替换)在多种语言上测试mBERT和XLM-R,并发现模型普遍严重依赖词序而非形态信息。
Details
Motivation: 现有研究缺乏对多语言掩码语言模型如何利用类型学特征(如词序vs.屈折)进行预测的细粒度诊断;需区分模型是依赖语序线索还是形态线索来完成掩码预测任务。 Method: 基于Universal Dependencies构建四类推理时扰动:全词打乱、内容词打乱(保留功能词)、依存头-依存词交换、句级词元替换(+L);在英语、中文、德语、西班牙语、俄语上评估mBERT和XLM-R的掩码预测准确率(word-level及top-5);发布代码、采样脚本与平衡评测子集。 Result: 全词打乱使所有语言的词级重建准确率趋近于零;部分扰动也导致显著下降;+L在中文影响甚微,但在德/西/俄语中大幅降低准确率,且无法缓解打乱效应;top-5准确率同样显示金标词极少出现在前五预测中。 Conclusion: mBERT和XLM-R在多语言场景下高度依赖词序而非屈折形态,表明其跨语言泛化能力可能受限于表层语序共性,而非深层形态句法知识;+L结果进一步揭示模型对词元-屈折对齐不敏感。 Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.[20] CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh
Main category: cs.CL
TL;DR: 本文提出CIRCUS方法,将电路发现重构为不确定性量化问题,通过多配置剪枝生成归因图集合,计算边的稳定性得分并提取严格共识电路,从而获得鲁棒的核心电路及可解释的替代结构。
Details
Motivation: 传统机制电路发现对分析者选择(如剪枝阈值、特征词典)过于敏感,导致结果脆弱且缺乏不确定性度量。 Method: CIRCUS方法基于单次原始归因运行,通过多种配置进行剪枝构建归因图集合;为每条边分配稳定性得分(保留该边的配置比例),并提取所有配置中均存在的边构成严格共识电路。 Result: 在Gemma-2-2B和Llama-3.2-1B上,严格共识电路规模约为所有配置并集的1/40,但保持相近的影响流解释力,并在激活修补实验中显著优于非共识对照组(p=0.0004)。 Conclusion: CIRCUS提供了一种实用、不确定性感知的机制电路分析框架,支持可信、可审计的报告,并明确区分核心结构、偶然结构与噪声。 Abstract: Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.[21] CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
Jie Cao,Zhenxuan Fan,Zhuonan Wang,Tianwei Lin,Ziyuan Zhao,Rolan Yan,Wenqiao Zhang,Feifei Shao,Hongwei Wang,Jun Xiao,Siliang Tang
Main category: cs.CL
TL;DR: 本文提出CoMoL,一种新颖的MoE-LoRA框架,通过核心空间专家和核心空间路由实现专家多样性、参数高效性和细粒度自适应,显著提升参数效率并保持强适应能力。
Details
Motivation: 现有MoE-LoRA方法存在参数效率低和实例级路由导致的粗粒度适配问题。 Method: 提出CoMoL框架,包括:1)核心空间专家(用紧凑核心矩阵存储专家以控制参数增长并保留多样性);2)核心空间路由(为每个token动态选择并激活合适的核心专家,并通过软融合策略合并为单一核心专家,再与共享LoRA结合形成专用LoRA模块);3)将路由网络投影到LoRA的低秩空间以进一步降低参数开销。 Result: 实验表明CoMoL在保持MoE-LoRA适应性的同时,参数效率媲美标准LoRA,并在多个任务上持续超越现有方法。 Conclusion: CoMoL有效解决了MoE-LoRA中参数效率与细粒度适配之间的矛盾,为高效、灵活的大模型微调提供了新范式。 Abstract: Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.[22] Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
Yubo Dong,Nianhao You,Yuxuan Hou,Zixun Sun,Yue Zhang,Hehe Fan,Liang Zhang,Siyuan Zhao,Linyi
Main category: cs.CL
TL;DR: 本文提出了Super Research任务,用于评估大语言模型在解决高度复杂研究问题上的能力,包括结构化分解、超广度检索和超深度调查,并构建了包含300个专家问题的基准测试集及多维审计协议。
Details
Motivation: 尽管大语言模型在深度研究或广泛搜索方面表现出色,但其处理需长程规划、大量证据收集与异构信息综合的高度复杂问题的能力仍缺乏探索。 Method: 提出Super Research任务框架,涵盖结构化研究计划分解、超广度多视角检索、以及通过迭代查询实现的超深度不确定性解析;构建含300个跨领域专家问题的基准;设计基于图锚定的五维审计协议(覆盖性、逻辑一致性、报告实用性、客观性、引用健康度)。 Result: Super Research能生成带细粒度引用和中间产物(如提纲、表格)的可验证研究报告;该框架作为上限评测与压力测试,成功表现可作为模型通用研究能力的强代理指标。 Conclusion: Super Research为评估和推动大语言模型在复杂自主研究任务中的能力提供了新范式和严格基准,其表现是衡量模型研究鲁棒性的关键指标。 Abstract: While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: https://cnsdqd-dyb.github.io/Super-Research-Benchmark/[23] From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation
Raneen Younis,Suvinava Basak,Lukas Chavez,Zahra Ahmadi
Main category: cs.CL
TL;DR: CoDHy是一个面向癌症研究的交互式AI系统,用于基于生物标志物生成药物组合假设,结合知识图谱与代理推理,并支持研究人员实时干预和验证。
Details
Motivation: 生物医学文献和数据库快速增长,导致研究人员难以系统性地将生物标志物机制与可操作的药物组合假设联系起来。 Method: 构建任务特定的知识图谱,整合结构化数据库和非结构化文献证据;结合知识图谱嵌入与基于代理的推理进行假设生成、验证与排序;提供Web界面支持用户配置上下文、检查中间结果并迭代优化假设。 Result: 实现了可解释、可追溯、可交互的药物组合假设生成系统CoDHy,在转化肿瘤学中展示了其设计、交互流程与实际应用场景。 Conclusion: CoDHy通过人机协同方式提升了假设生成的透明性与可控性,为生物医学研究中的知识驱动决策提供了新范式。 Abstract: The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.[24] QQ: A Toolkit for Language Identifiers and Metadata
Wessel Poelman,Yiyi Chen,Miryam de Lhoneux
Main category: cs.CL
TL;DR: 本文介绍了QwanQwa(QQ),一个轻量级Python工具包,用于统一管理多语言NLP中的语言元数据,支持多种语言标识符的归一化与映射,并提供基于图结构的语言属性遍历功能。
Details
Motivation: 多语言NLP中语言标识符标准不一(如BCP-47、ISO 639-1、Glottocodes),跨数千种语言时映射难以扩展,亟需统一、可扩展的语言元数据管理方案。 Method: 设计并实现QwanQwa(QQ)工具包,集成多源语言资源,提供标准化接口、标识符自动映射及图结构语言关系建模(涵盖语系、地区、文字系统等)。 Result: QQ实现了高效、可扩展的语言标识归一化与跨维度关系查询,已在多个多语言NLP项目中作为基础工具使用,验证了其易用性与实用性。 Conclusion: QwanQwa为多语言NLP研究提供了可靠、轻量且可探索的语言元数据基础设施,有助于提升实验可复现性与语言覆盖报告的准确性。 Abstract: The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different language identifiers; some use BCP-47 (e.g. en_Latn), others use ISO 639-1 (en), and more linguistically oriented datasets use Glottocodes (stan1293). Mapping between identifiers is manageable for a few dozen languages, but becomes unscalable when dealing with thousands. We introduce QwanQwa, a light-weight Python toolkit for unified language metadata management. QQ integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and affords a graph-based structure that enables traversal across families, regions, writing systems, and other linguistic attributes. QQ serves both as (1) a simple "glue" library in multilingual NLP research to make working with many languages easier, and (2) as an intuitive way for exploring languages, such as finding related ones through shared scripts, regions or other metadata.[25] Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
Anastasia Zhukova,Terry Ruas,Jan Philip Wahle,Bela Gipp
Main category: cs.CL
TL;DR: 本文提出uCDCR统一数据集,整合多个英文CDCR语料库,修正不一致并补充缺失属性,建立标准化评估框架,分析各数据集的词汇特性及其对模型性能的影响,强调事件与实体共指解析均具挑战性,不应仅聚焦于事件核心指代(ECR)。
Details
Motivation: CDCR研究因数据集格式异构、标注标准不一、且过度聚焦于事件核心指代(ECR)而呈现碎片化,亟需统一基准以推动可复现与跨数据集比较研究。 Method: 构建uCDCR统一数据集,涵盖多领域公开CDCR语料,统一格式并修正已知不一致;引入标准化度量(如same-head-lemma基线)、词汇多样性/歧义性分析、标注规则溯源,并系统比较各数据集在文档/提及分布、词汇构成及模型表现上的差异。 Result: ECB+虽为SOTA基准,但词汇多样性最低、CDCR复杂度居中;uCDCR整体更均衡多样;联合使用uCDCR全部数据可提升模型泛化能力;事件与实体共指在same-head-lemma基线上表现几乎相同,表明二者难度相当。 Conclusion: uCDCR为CDCR研究提供了统一、可靠、可扩展的基准;研究揭示词汇特性显著影响模型性能;应同等重视事件与实体共指解析,避免方法论偏倚;开放数据与代码支持社区公平、可复现研究。 Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.[26] BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas,Matt Murtagh-White,Adaku Uchendu,Ali Al-Lawati,Michiharu Yamashita,Dominik Macko,Ivan Srba,Robert Moro,Dongwon Lee
Main category: cs.CL
TL;DR: 本文提出了BLUFF,一个覆盖79种语言、超过20.2万样本的多语言虚假与合成内容检测基准,特别关注低资源语言,并配套提出AXL-CoI生成框架和mPURIFY质量过滤流程,揭示现有检测器在低资源语言上性能显著下降。
Details
Motivation: 现有虚假信息检测基准局限于英语或少数高资源语言,导致低资源语言社区缺乏有效防御工具,亟需构建覆盖广泛语言、尤其包含长尾语言的高质量多语言基准。 Method: 构建BLUFF多语言基准:涵盖79种语言、四种文本类型(人工撰写、LLM生成、LLM翻译、人机混合)、双向翻译、39种文本篡改技术及19种LLM生成;提出AXL-CoI多智能体对抗跨语言生成框架和mPURIFY质量过滤流水线。 Result: 实验表明当前SOTA检测器在低资源语言上的F1值较高资源语言平均下降达25.3%;BLUFF提供了开源数据集、评估工具与完整文档。 Conclusion: BLUFF填补了多语言虚假与合成内容检测基准的重大空白,推动面向全球语言社区的公平、鲁棒检测研究。 Abstract: Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: https://jsl5710.github.io/BLUFF/[27] SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
Chaoyue He,Xin Zhou,Xinjia Yu,Lei Zhang,Yan Zhang,Yi Wu,Lei Xiao,Liangyue Li,Di Wang,Hong Xu,Xiaoqiao Wang,Wei Liu,Chunyan Miao
Main category: cs.CL
TL;DR: 本文提出SSKG Hub,一个将可持续性披露标准转化为可审计知识图谱的LLM中心化平台,通过专家指导的流程实现自动提取、审核与认证,并支持多KG融合与下游任务。
Details
Motivation: 现有可持续性披露标准(如GRI、SASB等)内容冗长、术语密集、交叉引用繁多,难以进行结构化分析和下游应用。 Method: 构建了一个LLM为中心、专家引导的处理流水线,包括标准自动识别、可配置分块、标准定制化提示、鲁棒三元组解析、带细粒度溯源元数据的Neo4j存储;Draft KG经专家评审与元专家仲裁后升级为Certified KG;引入基于角色的治理框架保障可追溯性与问责制。 Result: 实现了SSKG Hub原型系统及交互式Web平台,完成专家主导的端到端KG审查案例研究,验证了知识图谱构建、审核与质量保障能力;平台已公开上线(www.sskg-hub.com)。 Conclusion: SSKG Hub为可持续性标准的结构化、可审计与可复用提供了新范式,推动标准理解、合规分析与跨标准比较的自动化与透明化。 Abstract: Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at www.sskg-hub.com.[28] Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet,Ryan Whetten,David Picard,Alexandre Allauzen
Main category: cs.CL
TL;DR: 本文提出了一种线性复杂度的多项式混合机制(PoM),作为Transformer中自注意力的替代方案,用于语音识别任务,在保持性能的同时显著提升计算和内存效率。
Details
Motivation: 现有基于Transformer的语音转文本模型因自注意力机制的二次计算和内存复杂度,面临可扩展性瓶颈。 Method: 提出一种名为多项式混合器(PoM)的新token混合机制,以线性复杂度建模token依赖,并将其集成到基于BEST-RQ的自监督语音表征学习框架中。 Result: PoM在下游语音识别任务上实现了与全自注意力及其他线性复杂度方法相当的词错误率(WER),同时在时间和内存效率上表现更优。 Conclusion: PoM是一种高效且有效的自注意力替代方案,为大规模语音建模提供了更优的性能-效率权衡。 Abstract: State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.[29] RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
Andrew Zhuoer Feng,Cunxiang Wang,Yu Luo,Bosi Wen,Yidong Wang,Lin Fan,Yilin Zhou,Zikang Wang,Wenbo Yu,Lindong Wu,Hongning Wang,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出RAVEL框架和C3EBench基准,用于评估大语言模型在文本合成任务中的实际能力,发现推理能力比生成能力更重要。
Details
Motivation: 现有评估框架无法评估大语言模型在文本合成过程中的具体操作能力(如提纲、起草、编辑等),因此需要更细粒度的评估方法。 Method: 提出RAVEL代理框架,支持LLM测试者自主规划并执行文本合成操作;构建C3EBench基准(1258个样本),通过逆向工程分离出Cloze、Edit、Expand、End-to-End四类任务。 Result: 对14个LLM的分析表明:多数模型在指令不明确或上下文理解要求高的任务中表现不佳;代理式文本合成主要依赖模型的推理能力而非生成能力;强推理器可引导弱生成器提升结果质量,反之则不行。 Conclusion: 文本合成能力的核心在于推理能力,未来应更注重提升LLM的推理与规划能力,而非单纯扩大生成规模。 Abstract: Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: https://github.com/ZhuoerFeng/RAVEL-Reasoning-Agents-Text-Eval.[30] DRIV-EX: Counterfactual Explanations for Driving LLMs
Amaia Cardiel,Eloi Zablocki,Elias Ramzi,Eric Gaussier
Main category: cs.CL
TL;DR: 本文提出DRIV-EX方法,利用基于梯度的嵌入优化与受控解码结合,生成语义合理、语言流畅的反事实解释,以提升大语言模型在自动驾驶中决策过程的可解释性。
Details
Motivation: 大语言模型(LLMs)在自动驾驶中作为推理引擎日益普及,但其决策过程缺乏透明性,亟需可解释的方法揭示其内在逻辑。 Method: 提出DRIV-EX:先在连续嵌入空间中进行梯度优化以定位能翻转决策的最小语义扰动,再将优化所得嵌入作为语义引导,驱动受控文本解码,重生成符合语言流畅性、领域有效性及输入邻近性的反事实场景描述。 Result: 在LC-LLM规划器与highD数据集文本转录上的实验表明,DRIV-EX比现有基线更可靠地生成有效且流畅的反事实解释,并成功揭示模型潜在偏差,为提升LLM驾驶代理鲁棒性提供具体洞见。 Conclusion: DRIV-EX通过融合连续优化与离散可控生成,在保证解释质量前提下实现了对LLM驾驶决策的高效、可信反事实分析,是提升自动驾驶AI可解释性的重要进展。 Abstract: Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.[31] SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Shiqi Chen,Jingze Gai,Ruochen Zhou,Jinghan Zhang,Tongyao Zhu,Junlong Li,Kangrui Wang,Zihan Wang,Zhengyu Chen,Klara Kaleb,Ning Miao,Siyang Gao,Cong Lu,Manling Li,Junxian He,Yee Whye Teh
Main category: cs.CL
TL;DR: 本文提出SkillCraft基准测试,旨在评估智能体在长期工作流中抽象和重用高阶工具组合(即Skills)的能力,并设计轻量级评估协议以支持技能的自动构建、缓存与跨任务复用,实验表明技能复用可显著降低token消耗并提升成功率。
Details
Motivation: 现有基准主要关注静态工具集下单次任务的成功率,无法有效衡量智能体获取可复用高阶工具组合技能的能力,而现实世界中的工具使用代理需长期、反复、结构化地调用和组合工具。 Method: 提出SkillCraft基准,包含真实、高度组合性的工具使用场景,难度在定量和结构维度上分层;设计轻量级评估协议,支持智能体自动将原子工具组合为可执行Skills,并在任务内及跨任务中缓存与复用。 Result: 在SkillCraft上评测当前先进智能体,发现技能保存与复用最多可减少80%的token使用量;成功率与测试时的工具组合能力强相关。 Conclusion: 组合式技能获取是智能体的核心能力,SkillCraft为评估和推动该能力提供了新基准与方法。 Abstract: Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.[32] RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
Andrew Zhuoer Feng,Cunxiang Wang,Bosi Wen,Yidong Wang,Yu Luo,Hongning Wang,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出RLAR框架,通过LLM代理动态合成和调用奖励函数,实现奖励系统的自我演化,显著提升大语言模型在多任务上的对齐效果。
Details
Motivation: 静态、领域特定的奖励模型训练成本高且在分布外场景泛化能力差,难以适应强化学习迭代中不断变化的数据分布。 Method: RLAR将奖励获取建模为动态工具合成与调用任务,利用LLM代理从互联网自主检索最优奖励模型,并通过代码生成合成程序化验证器。 Result: 在数学、编程、翻译和对话任务上性能提升10至60;在RewardBench-V2上显著超越静态基线,逼近性能上限。 Conclusion: 动态奖励编排能有效提升奖励模型的泛化能力与适应性,是提升大语言模型对齐效果的有效新范式。 Abstract: Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.[33] LaSTR: Language-Driven Time-Series Segment Retrieval
Kota Dohi,Harsh Purohit,Tomoya Nishida,Takashi Endo,Yusuke Ohtsubo,Koichiro Yawata,Koki Takeshita,Tatsuya Sasaki,Yohei Kawaguchi
Main category: cs.CL
TL;DR: LaSTR is a language-driven time-series segment retrieval method that uses a Conformer-based contrastive retriever trained on large-scale segment-caption data to improve semantic alignment between natural language queries and time-series segments.
Details
Motivation: Existing time-series search methods often rely on expert-designed similarity criteria or global series-level descriptions, lacking flexibility for natural language queries targeting local segments. Method: The authors construct segment-caption training data via TV2-based segmentation of LOTSA windows and GPT-5.2 captioning, then train a Conformer-based contrastive retriever in a shared text–time-series embedding space. Result: LaSTR outperforms random and CLIP baselines across multiple candidate pool sizes in single-positive retrieval, with improved ranking quality and stronger semantic agreement (validated by SBERT and VLM-as-a-judge). Conclusion: Language-driven segment retrieval is feasible and effective using large-scale synthetic caption data and contrastive learning in a joint embedding space, enabling more intuitive and semantically grounded time-series search. Abstract: Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment--caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text--time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.[34] Qwen3-Coder-Next Technical Report
Ruisheng Cao,Mouxiang Chen,Jiawei Chen,Zeyu Cui,Yunlong Feng,Binyuan Hui,Yuheng Jing,Kaixin Li,Mingze Li,Junyang Lin,Zeyao Ma,Kashun Shum,Xuwu Wang,Jinxi Wei,Jiaxi Yang,Jiajun Zhang,Lei Zhang,Zongmeng Zhang,Wenting Zhao,Fan Zhou
Main category: cs.CL
TL;DR: Qwen3-Coder-Next是一个800亿参数但仅激活30亿参数的高效开源编码大模型,通过大规模可验证编程任务与执行环境联合的智能体训练范式,在SWE-Bench等代理基准上展现出与其激活参数量相匹配的强编码能力。
Details
Motivation: 探索强训练范式能否突破小参数量模型的能力瓶颈,实现高效推理下的高性能编码能力。 Method: 采用基于可执行环境的大规模可验证编程任务合成,结合中期训练和强化学习,进行智能体导向的训练;模型采用稀疏激活架构(80B总参,3B激活)。 Result: 在SWE-Bench、Terminal-Bench等代理中心化基准上,性能与同激活参数量的先进模型具有竞争力。 Conclusion: 强训练配方(尤其是基于环境反馈的智能体训练)可显著提升稀疏激活大模型的编码能力,为高效、开源编码智能体提供了新路径。 Abstract: We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.[35] A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction
Ruihao Pan,Suhang Wang
Main category: cs.CL
TL;DR: 本文研究了机器遗忘在交互式环境中的稳定性,发现静态评估可能高估其实际效果,强调需确保在交互场景下的稳定遗忘。
Details
Motivation: 由于安全、隐私和法律问题,机器遗忘对大语言模型(LLMs)日益重要;但现有工作主要在静态单轮设置下评估,而真实交互场景下的遗忘鲁棒性尚未充分探索。 Method: 通过考察两种常见交互模式——自我纠正和对话条件查询,分析遗忘在交互环境中的稳定性,并对比不同强度遗忘方法的表现。 Result: 静态评估中看似被遗忘的知识常可通过交互恢复;更强的遗忘虽提升表观鲁棒性,却常导致行为僵化而非真正知识擦除。 Conclusion: 静态评估可能高估机器遗忘在现实世界中的有效性,需发展能保障交互环境下稳定遗忘的新方法与评估标准。 Abstract: Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.[36] Constitutional Black-Box Monitoring for Scheming in LLM Agents
Simon Storf,Rich Barton-Cooper,James Peters-Gill,Marius Hobbhahn
Main category: cs.CL
TL;DR: 本文研究了使用大语言模型(LLM)作为黑盒监控器来检测自主代理中的‘策略性行为’(scheming),提出两种合成数据生成方法(STRIDE 和 Gloom),并在 ControlArena 环境中验证其泛化能力,发现简单提示调优即可达到性能上限,过度优化反而导致过拟合。
Details
Motivation: 安全部署大语言模型代理需要可靠监督机制,尤其需检测其隐蔽追求错误目标的‘策略性行为’(scheming);现有方法缺乏在真实环境中的可泛化监控手段。 Method: 构建宪法式黑盒监控器(prompted classifier),仅依赖外部可观测输入输出;通过 STRIDE(迭代精炼)和 Gloom(代理-环境仿真)两种管道生成各1000条合成轨迹;采用提示搜索、人工精调与自动提示优化训练前沿LLM监控器,并在ControlArena的7500条保留轨迹上评估。 Result: 基于纯合成数据训练的监控器能在更真实的ControlArena环境中泛化并有效识别scheming信号;但性能迅速饱和,简单提示搜索即达最优,进一步优化无增益且引发过拟合。 Conclusion: 合成数据驱动的LLM监控器具备现实泛化潜力,但优化边际效益低,提示工程应注重简洁性与鲁棒性,而非复杂自动化调优。 Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.[37] Learning Nested Named Entity Recognition from Flat Annotations
Igor Rozhkov,Natalia Loukachevitch
Main category: cs.CL
TL;DR: 本文探讨了仅使用扁平化命名实体识别(flat NER)标注是否能有效学习嵌套命名实体识别(nested NER)结构,并提出了四种方法,在俄语基准NEREL上验证了其有效性。
Details
Motivation: 嵌套命名实体识别需要昂贵的多层标注,而现有扁平化NER语料丰富但嵌套资源稀缺,因此研究能否仅从扁平标注中学习嵌套结构具有重要现实意义。 Method: 提出并评估了四种方法:字符串包含(子串匹配)、实体损坏(生成伪嵌套数据)、扁平中立化(削弱假阴性信号)以及混合微调+大语言模型流水线。 Result: 在俄语嵌套NER基准NEREL上,最佳组合方法达到26.37%的内层F1值,填补了全监督嵌套标注性能差距的40%。 Conclusion: 仅依赖扁平标注可在一定程度上学习嵌套结构,混合微调与LLM方法效果最优,为减少嵌套标注成本提供了可行路径。 Abstract: Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.[38] MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
Kai Zhang,Zhengqing Yuan,Cheng Peng,Songlin Zhao,Mengxian Lyu,Ziyi Chen,Yanfang Ye,Wei Liu,Ying Zhang,Kaleb E Smith,Lifang He,Lichao Sun,Yonghui Wu
Main category: cs.CL
TL;DR: MEDGPT-OSS是一个开源、20B参数的通用视觉-语言模型,专为临床AI研究设计,通过三阶段训练策略在保持轻量级的同时实现跨模态推理能力,并支持本地部署以满足隐私与合规要求。
Details
Motivation: 解决当前高性能生物医学多模态助手多为闭源或计算成本过高、难以本地部署以保障患者隐私和PHI合规的问题。 Method: 采用GPT-OSS语言骨干与优化视觉前端结合,通过三阶段渐进式领域适配训练(含严格数据筛选与长上下文多模态对齐)构建模型。 Result: 在分布外(OOD)多模态推理和复杂纯文本临床任务上超越更大规模的开源医学模型,且可在消费级GPU上运行。 Conclusion: MEDGPT-OSS以参数高效、开源可复现的方式弥合了临床AI研究中性能与部署可行性的鸿沟,为机构级隐私保护AI研究提供了可靠基础。 Abstract: Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.[39] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu,Yihao Feng,Yanchao Sun,Xianzhi Du,Pingzhi Li,Olli Saarikivi,Yun Zhu,Yu Meng
Main category: cs.CL
TL;DR: 本文提出CHIMERA——一个仅9K样本但高质量、跨学科、全自动构建的合成推理数据集,用于解决大模型推理能力训练中冷启动、领域覆盖窄和人工标注难三大数据瓶颈;基于其对4B Qwen3模型微调后,在多项高难度推理基准上媲美百亿级大模型。
Details
Motivation: 解决LLM推理能力复现与扩展中的三大数据挑战:缺乏长思维链(CoT)种子数据(冷启动)、数学以外科学领域覆盖不足(领域局限)、前沿难题人工标注成本过高或不可行(标注瓶颈)。 Method: 构建CHIMERA数据集:(1)利用先进推理模型自动生成丰富、长程CoT轨迹;(2)覆盖8大科学领域及超1000个细粒度主题,结构化组织于模型生成的层级分类体系中;(3)采用强推理模型驱动的全自动评估流水线,交叉验证问题有效性与答案正确性;并用该数据集对4B Qwen3模型进行后训练。 Result: 基于CHIMERA微调的4B Qwen3模型在GPQA-Diamond、AIME 24/25/26、HMMT 25及Humanity's Last Exam等高难度推理基准上表现强劲,性能接近或达到DeepSeek-R1和Qwen3-235B等更大规模模型。 Conclusion: 小而精的全自动合成数据集(CHIMERA)可有效突破数据瓶颈,显著提升中小规模模型的跨领域推理能力,为可扩展、开放的推理模型训练提供新范式。 Abstract: Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.[40] KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
Lianjun Liu,Hongli An,Weiqi Yan,Xin Du,Shengchuan Zhang,Huazhong Liu,Yunshan Zhong
Main category: cs.CL
TL;DR: 本文提出KVSlimmer算法,通过理论分析KV缓存中Query/Key与Value权重的谱能量分布差异,建立数学上精确的Hessian建模方法,实现无需梯度计算的高效KV压缩,在多个模型和基准上显著降低内存与延迟并提升长文本推理性能。
Details
Motivation: 现有KV缓存压缩方法缺乏理论基础,依赖经验观察和近似Hessian估计,导致压缩效果次优且推理开销大。 Method: 构建基于投影权重谱能量分布的理论框架,揭示Query/Key(谱集中→特征同质)与Value(谱分散→特征异质)的不对称性;提出KVSlimmer算法,利用前向传播变量导出Hessian的闭式解,实现梯度无关、高效低内存的KV合并。 Result: 在Llama3.1-8B-Instruct上,LongBench平均分提升0.92,内存减少29%,延迟降低28%;在多模型与多基准实验中持续优于SOTA方法。 Conclusion: KVSlimmer为KV缓存压缩提供了首个具备严格理论支撑、计算高效且无需梯度的解决方案,显著提升了LLM长上下文推理的实用性。 Abstract: The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.[41] Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Shravani Hariprasad
Main category: cs.CL
TL;DR: 本文评估了五种开源小语言模型在三种临床问答数据集上的表现,考察了不同提示风格对一致性、准确性和指令遵循能力的影响,发现高一致性不等于高准确性,角色扮演式提示会显著降低准确性,Llama 3.2 在准确性和可靠性之间取得了最佳平衡。
Details
Motivation: 小规模开源语言模型在低资源医疗场景中备受关注,但其在不同提示措辞下的可靠性尚不清楚,亟需系统评估其一致性、准确性和指令遵循能力。 Method: 在本地消费级CPU上,对Gemma 2 2B、Phi-3 Mini 3.8B、Llama 3.2 3B、Mistral 7B和Meditron-7B五个模型,在MedQA、MedMCQA、PubMedQA三个临床QA数据集上,使用原始、正式、简化、角色扮演和直接五种提示风格进行零样本推理评估,测量一致性分数、准确率和指令遵循失败率。 Result: Gemma 2一致性最高(0.845–0.888)但准确率最低(33.0–43.5%);Llama 3.2一致性中等(0.774–0.807)但准确率最高(49.0–65.0%);角色扮演提示普遍降低准确率(如Phi-3 Mini在MedQA下降21.5个百分点);Meditron-7B在PubMedQA中指令遵循失败率达99.0%。 Conclusion: 高一致性不等于正确性,模型可能‘稳定地错误’,这对临床AI极为危险;应避免在医疗应用中使用角色扮演提示;Llama 3.2最适合低资源部署;安全临床AI需同步评估一致性、准确性和指令遵循能力。 Abstract: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.[42] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang,Talant Mawkanuli,Gina-Anne Levow
Main category: cs.CL
TL;DR: 本文提出了一种结合神经序列标注与大语言模型(LLM)后修正的混合自动词素标注(glossing)流程,显著降低低资源形态丰富语言(如朱尔加图瓦语)的标注工作量,并总结出适用于濒危语言文档化的轻量级混合建模设计原则。
Details
Motivation: 低资源、形态丰富的语言在语言学记录和田野调查中面临词素标注(IGT)效率低下的瓶颈问题。 Method: 构建两阶段混合流程:先用BiLSTM-CRF模型进行序列标注,再用检索增强的LLM进行后修正;通过消融实验系统评估检索增强提示、词素词典使用及少样本示例数量的影响。 Result: 检索增强提示显著优于随机示例选择;词素词典多数情况下反而降低性能;性能随少样本数量近似对数增长;两阶段流程大幅减少人工标注负担。 Conclusion: 结构化预测模型与LLM推理的混合架构是濒危语言自动标注中计算轻量、实用性强的有效路径,并提炼出具体的设计原则。 Abstract: Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.[43] Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
Manil Shrestha,Edward Kim
Main category: cs.CL
TL;DR: 本文提出了一种基于共形预测的框架,用于校准大语言模型(LLMs)在医学实体抽取任务中的置信度,确保在FDA药品标签和MIMIC-CXR放射报告两个临床场景下达到≥90%的有限样本覆盖保证,同时揭示了模型校准方向(过自信/欠自信)随文本结构、抽取类别和模型架构而变化的关键现象。
Details
Motivation: 大型语言模型(LLMs)在医学实体抽取中应用日益广泛,但其置信度分数常存在校准偏差,阻碍其在临床环境中的安全部署。 Method: 采用共形预测(conformal prediction)框架,在两个临床领域(FDA药品标签结构化抽取、MIMIC-CXR放射报告实体抽取)中对GPT-4.1和Llama-4-Maverick等LLMs进行校准;使用FactScore原子语句评估和医师标注作为金标准,并通过调整共形阈值τ实现目标覆盖率。 Result: 在FDA标签任务中模型欠自信(τ≈0.06),在放射报告任务中模型过自信(τ高达0.99);共形预测在两场景下均达成≥90%覆盖,拒绝率仅9–13%;校准特性高度依赖于文档结构、抽取类别与模型架构。 Conclusion: LLM的置信度校准不是全局模型属性,而是域依赖的;为保障临床安全部署,必须采用面向具体领域的共形校准策略。 Abstract: Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($τ\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($τ$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.[44] The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
Li Lucy,Albert Zhang,Nathan Anderson,Ryan Knight,Kyle Lo
Main category: cs.CL
TL;DR: 本文评估了11种视觉语言模型(VLMs)在真实学生手写数学作答数据集DrawEduMath上的表现,发现所有模型在识别和分析学生错误(尤其对需更多教学支持的学生)方面均显著不足,表明当前VLM优化方向与教育应用需求存在偏差。
Details
Motivation: 为使AI有效支持数学教育,需准确识别并响应学生错误;而现有VLM多面向解题能力优化,缺乏面向教学反馈能力的评估与激励。 Method: 在涵盖一年真实学生手写作答的DrawEduMath QA基准上,系统评测11个主流VLM在描述学生解题过程、尤其是识别和分析错误方面的性能,并按学生熟练度分层分析表现差异。 Result: 所有VLM在处理需更多教学支持的学生作答时表现更差;在所有问答任务中,评估学生错误相关问题的准确率最低。 Conclusion: 当前VLM虽擅长解题,但在教育核心任务——诊断学生错误方面能力薄弱,亟需面向教学支持目标的新评估标准与训练激励机制。 Abstract: Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.[45] Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages
Kaushal Santosh Bhogale,Tahir Javed,Greeshma Susan John,Dhruv Rathi,Akshayasree Padmanaban,Niharika Parasa,Mitesh M. Khapra
Main category: cs.CL
TL;DR: 本文提出OIWER评估框架,利用大语言模型捕捉印度语言中允许的正字法变体,显著改善ASR系统评估的准确性与人类感知一致性。
Details
Motivation: 传统词错误率(WER)在评估印度语言ASR系统时过于悲观,因其无法处理拼写变异、词缀切分灵活性及代码混合词的非标准拼写等问题,而现有方法难以覆盖资源匮乏语言中的可接受正字法变体。 Method: 提出基于大语言模型(LLM)的OIWER评估框架,构建能反映印度语言中允许正字法变体的基准测试集,并通过大量实验验证其有效性。 Result: OIWER使平均错误率降低6.3点;缩小模型性能差距(如Gemini-Canary差距从18.1降至11.5);相比WER-SN,在匹配人类感知方面提升4.9点。 Conclusion: OIWER更准确、更符合人类感知地评估印度语言ASR系统,为低资源语言的语音识别评测提供了新范式。 Abstract: Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.[46] S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
Abigail Berthe-Pardo,Gaspard Michel,Elena V. Epure,Christophe Cerisara
Main category: cs.CL
TL;DR: 本文提出了S-VoCAL数据集与评估框架,用于评估从文学作品中推断虚构角色语音相关属性(如年龄、性别、口音等)的能力,并通过RAG方法验证其有效性。
Details
Motivation: 现有TTS系统在自然度上已取得进展,但在为小说角色分配合适声音(如年龄、性别、口音、健康状况等)方面仍存在显著不足;缺乏专门用于评估此类属性推理能力的基准数据集。 Method: 构建首个面向文学角色语音属性推断的基准数据集S-VoCAL,包含8个社会语音学定义的属性和952个角色-书籍对;设计适配各属性特点的评估框架,引入基于大语言模型嵌入的新相似性度量;采用检索增强生成(RAG)流程进行属性推理实验。 Result: RAG方法能较可靠地推断Age和Gender等属性,但在Origin和Physical Health等属性上表现较差;S-VoCAL提供了可复现、细粒度的评估能力。 Conclusion: S-VoCAL填补了虚构角色语音属性自动推理领域缺乏标准化评估资源的空白,为提升TTS在角色化叙事中的表现提供了新基准与工具。 Abstract: With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at https://github.com/AbigailBerthe/S-VoCAL .[47] Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
Hoor Elbahnasawi,Marwan Sayed,Sohaila Eltanbouly,Fatima Brahamia,Tamer Elsayed
Main category: cs.CL
TL;DR: 本文介绍了Qayyem,一个支持阿拉伯语自动作文评分(AES)的Web平台,旨在解决阿拉伯语AES领域因语言复杂性和标注数据稀缺导致的发展受限问题。
Details
Motivation: 阿拉伯语AES支持有限,主要受限于其语言复杂性及大规模公开标注数据集的缺乏。 Method: 设计并实现了一个名为Qayyem的Web平台,集成作业创建、批量作文上传、评分配置和按维度作文评估等功能,并部署多种先进的阿拉伯语作文评分模型。 Result: Qayyem成功抽象了与评分服务器API交互的技术复杂性,使教师可通过用户友好的界面使用先进评分服务。 Conclusion: Qayyem为阿拉伯语AES提供了可扩展、易用且技术集成度高的解决方案,有助于推动该领域的教学应用与研究发展。 Abstract: Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.[48] Thoth: Mid-Training Bridges LLMs to Time Series Understanding
Jiafeng Lin,Yuxuan Wang,Jialong Wu,Huakun Luo,Zhongyi Pei,Jianmin Wang
Main category: cs.CL
TL;DR: 本文提出Thoth,首个具备通用时间序列理解能力的中训大语言模型,通过构建Book-of-Thoth语料库实现时间序列与自然语言的对齐,并提出KnoTS评测基准以评估知识密集型时间序列推理能力;实验表明Thoth在多项时序问答任务上显著优于基线模型,尤其在数据稀缺场景下微调效果更优。
Details
Motivation: 大语言模型(LLMs)虽在通用推理上表现优异,但难以理解与推理时间序列数据,限制了其在依赖时间动态的决策场景中的应用。 Method: 提出中训(mid-training)范式,构建面向时间序列的高质量中训语料库Book-of-Thoth,支持时间序列到文本和文本到时间序列的双向生成;并设计知识密集型时间序列理解评测基准KnoTS;基于此训练Thoth系列模型。 Result: Thoth在多个时间序列问答基准上显著超越其基模型及先进LLMs;在数据稀缺条件下微调时仍保持优越性能;验证了中训对提升时间序列理解能力的有效性。 Conclusion: 中训是提升大语言模型时间序列理解能力的有效途径;Thoth为通用时序理解提供了新范式,Book-of-Thoth与KnoTS分别推动了数据与评测的发展。 Abstract: Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: https://github.com/thuml/Thoth.[49] GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
Zhuokang Shen,Yifan Wang,Hanyu Chen,Wenxuan Huang,Shaohui Lin
Main category: cs.CL
TL;DR: 本文提出GroupGPT,一种面向多用户群聊的高效、隐私保护型智能体框架,通过大小模型协同架构解耦干预时机与响应生成,并引入多模态支持与新基准MUIR进行评估。
Details
Motivation: 现有大语言模型聊天系统多面向单用户,难以适应多用户群聊中复杂动态上下文下的主动精准干预需求;且依赖LLM同时完成推理与生成,导致高token消耗、可扩展性差和隐私风险。 Method: 提出GroupGPT框架:采用小-大模型协同架构,分离干预决策(由小模型负责)与响应生成(由大模型负责);支持文本、图片、视频、语音等多模态输入;构建MUIR基准数据集(2500条带干预标签与理由的群聊片段)用于评估干预时机与响应质量。 Result: GroupGPT在MUIR上LLM评估得分为4.72/5.0;相比基线方法最高减少3倍token用量;提供消息云端传输前的隐私清洗;用户反馈良好,适用于多种群聊场景。 Conclusion: GroupGPT是一种高效、可扩展且注重隐私的多用户群聊助手框架,通过架构解耦与专用基准推动该方向实用化发展。 Abstract: Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot-Shen/GroupGPT .[50] How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Xiangxiang Zhang,Caijun Jia,Siyuan Li,Dingyu He,Xiya Xiong,Zheng Sun,Honghao He,Yuchen Wu,Bihui Yu,Linzhuang Sun,Cheng Tan,Jingxuan Wei
Main category: cs.CL
TL;DR: 本文提出Faire框架,通过强化学习解决多模态大模型在几何推理中因监督微调导致的性能下降问题,实现绘图与推理的功能对齐。
Details
Motivation: 现有监督微调(SFT)在处理图文交织的几何推理数据时,仅学习表面格式对齐,无法建模绘图与推理之间的因果依赖,导致性能反超纯文本基线。 Method: 提出Faire(Functional alignment for interleaved reasoning)强化学习框架,引入三项因果约束,推动模型从表层模仿转向功能对齐,使绘图真正服务于推理过程。 Result: Faire在多个挑战性几何推理基准上取得具有竞争力的性能,并引发模型行为的质变:绘图被有效内化为推理工具。 Conclusion: 单纯分布对齐的SFT不足以支持交织式几何推理;功能对齐(而非形式对齐)是提升多模态推理能力的关键路径。 Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.[51] CARD: Towards Conditional Design of Multi-agent Topological Structures
Tongtong Wu,Yanming Li,Ziye Tang,Chen Jiang,Linhao Luo,Guilin Qi,Shirui Pan,Gholamreza Haffari
Main category: cs.CL
TL;DR: 本文提出CARD框架,通过条件图生成和环境感知优化,实现多智能体通信拓扑的动态自适应,显著提升LLM多智能体系统在代码生成与数学推理等任务中的准确性与鲁棒性。
Details
Motivation: 现有LLM多智能体系统的通信拓扑常为固定或静态学习,无法应对模型升级、API/工具变更、知识源变化等现实动态性,导致鲁棒性不足。 Method: 提出CARD(Conditional Agentic Graph Designer)框架,基于AMACP协议,采用条件变分图编码器与环境感知优化,在训练和运行时根据动态环境信号实时调整通信图结构。 Result: 在HumanEval、MATH和MMLU基准上,CARD持续优于静态及提示驱动基线,准确率更高,且对模型能力或资源变化更具鲁棒性。 Conclusion: 动态、条件化的通信拓扑设计是提升LLM多智能体系统适应性与鲁棒性的关键路径,CARD为该方向提供了可扩展的方法论框架。 Abstract: Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.[52] DEP: A Decentralized Large Language Model Evaluation Protocol
Jianxiang Peng,Junhao Li,Hongxiang Wang,Haocheng Lyu,Hui Guo,Siyi Hao,Zhen Wang,Chuang Liu,Shaowei Zhang,Bojian Xiong,Yue Chen,Zhuowen Han,Ling Shi,Tianyu Dong,Juesi Xiao,Lei Yang,Yuqi Ren,Deyi Xiong
Main category: cs.CL
TL;DR: 本文提出了一种去中心化评估协议(DEP),旨在解决当前大语言模型(LLM)基准测试中标准不统一、可复现性差及基准泄露风险高等问题;通过解耦用户、模型与基准,实现模块化、插拔式评估,并配套开发了支持断点续传、并发请求等功能的DEP Toolkit。
Details
Motivation: 现有LLM基准测试缺乏统一评估标准、依赖人工脚本导致结果难复现,且主流集中式框架存在基准数据泄露风险。 Method: 提出去中心化评估协议(DEP),采用匹配服务器架构,支持本地或远程部署;解耦用户、LLM和基准,将基准文件与评估逻辑保留在服务器端;开发DEP Toolkit工具包并提供适配文档。 Result: 实验验证了DEP的有效性,降低了基准评估部署成本;截至2026年2月已适配60+基准,并推动社区共建。 Conclusion: DEP提供了一种标准化、去中心化、防泄露的LLM评估新范式,提升了评估的一致性、安全性与可扩展性。 Abstract: With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.[53] Token-level Data Selection for Safe LLM Fine-tuning
Yanping Li,Zhening Liu,Zijian Li,Zehong Lin,Jun Zhang
Main category: cs.CL
TL;DR: 本文提出TOSS框架,通过词元级数据选择来提升大语言模型微调过程中的安全性,同时保持任务性能。
Details
Motivation: 现有微调方法易导致模型安全性下降,且样本级防御方法难以兼顾安全性和实用性。 Method: 提出词元级安全风险量化方法(TOSS),基于安全退化模型与效用导向模型的损失差异评估每个词元的风险;并设计渐进式优化策略TOSS-Pro以增强模型识别不安全词元的能力。 Result: 实验表明TOSS在保障安全性的同时显著提升下游任务性能,优于现有样本级防御方法。 Conclusion: 词元级细粒度分析与选择可有效缓解微调引发的安全退化问题,为安全微调提供新范式。 Abstract: Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.[54] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification
Jacob Devasier
Main category: cs.CL
TL;DR: 本研究探讨了掩码扩散语言模型(MDLMs)在事实验证任务中的推理机制,发现其通常早期就确定判决结果并以此为锚点,后续生成的解释更多是事后合理化而非真正推理;强制延迟判决反而降低准确率,表明过度推理可能损害MDLMs的性能。
Details
Motivation: 探究MDLMs在需要合理判决的任务(如事实验证)中,其生成的解释是真实推理还是事后合理化,以及其推理动态与自回归模型的本质差异。 Method: 通过分析MDLMs在事实验证任务中的扩散过程动态,设计延迟判决干预实验、因果干预实验(如注入错误判决或破坏解释质量),定量评估判决与解释之间的时序关系和因果依赖性。 Result: MDLMs通常在扩散早期即收敛于判决,解释生成是后验的;强制延迟判决使准确率从86.2%降至71.9%;56%情况下模型会为错误强制判决提供合理化解释;判决高度依赖解释质量(污染解释下准确率57.3%,真实解释下97.1%)。 Conclusion: 对于MDLMs的事实验证,延长推理过程(如强制先解释后判决)不仅无效,反而有害——因解释生成引入噪声,导致模型用错误依据覆盖原本正确的早期判断;其‘推理’本质更接近基于早期判决的理性化生成。 Abstract: Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.[55] XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning
Ngoc-Quang Le,T. Thanh-Lam Nguyen,Quoc-Trung Phu,Thi-Phuong Le,Duy-Cat Can,Hoang-Quynh Le
Main category: cs.CL
TL;DR: 本文提出XCom模型,通过结合基于方面的情感预测和语义分析,并引入Shapley可解释性模块,在保证性能的同时提升比较意见挖掘的透明度与可信度。
Details
Motivation: 现有基于Transformer的比较意见挖掘模型缺乏透明度,影响用户信任。 Method: 提出XCom模型,包含两个核心模块:(i)基于方面的评分预测模块;(ii)用于比较意见挖掘的语义分析模块;并集成Shapley加性解释(SHAP)模块以增强可解释性。 Result: XCom在多个基准上取得领先性能,同时提供有意义的可解释性结果。 Conclusion: XCom在保持高性能的同时显著提升了模型决策的可解释性,是更可靠、可信赖的比较意见挖掘工具。 Abstract: Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model's deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: https://anonymous.4open.science/r/XCom.[56] Reasoning Boosts Opinion Alignment in LLMs
Frédéric Berdoz,Yann Billeter,Yann Vonlanthen,Roger Wattenhofer
Main category: cs.CL
TL;DR: 本文探讨了如何利用大语言模型(LLM)进行政治观点建模,并提出通过结构化推理提升观点一致性与减少偏差,但在消除偏见方面仍存在局限。
Details
Motivation: 大语言模型虽在文本生成任务中表现出色,但因其统计本质和因果理解有限,在观点建模中易产生偏差;需探索推理机制是否能提升其观点对齐能力。 Method: 受强化学习提升数学推理能力的启发,本文训练模型通过结构化推理生成与用户政治档案一致的观点输出。 Result: 在美、欧、瑞士三地政治数据集上的实验表明,引入推理可提升观点建模效果,性能媲美强基线,但未能完全消除偏差。 Conclusion: 结构化推理有助于提升LLM的政治观点对齐能力,但需结合其他机制才能构建真正可信的政治数字孪生体;作者开源方法与数据集,为后续研究奠定基础。 Abstract: Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.[57] Generative AI & Fictionality: How Novels Power Large Language Models
Edwin Roland,Richard Jean So
Main category: cs.CL
TL;DR: 本文探讨了小说作为生成式AI训练数据的影响,通过分析BERT模型发现小说不仅塑造了AI的语言输出,还催生了新的社会回应形式,强调了计算训练数据在当代文化生产分析中的重要性。
Details
Motivation: 探究小说作为生成式AI训练数据的影响,以及其相对于其他文本(如报纸、Reddit、维基百科)的效果。 Method: 通过研究开源模型BERT,结合文学学者对小说作为话语和语言形式的理论分析。 Result: 发现大语言模型利用了小说的熟悉属性和优势,同时也催生了新的社会回应形式和品质。 Conclusion: 如果当代文化日益受生成式AI和机器学习影响,则当代文化生产的各种模式分析必须考虑计算训练数据这一相对新颖的维度。 Abstract: Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels' effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today's various modes of cultural production must account for a relatively novel dimension: computational training data.[58] Can Thinking Models Think to Detect Hateful Memes?
Mohamed Bayan Kmainasi,Mucahid Kutlu,Ali Ezzat Shahroor,Abul Hasnat,Firoj Alam
Main category: cs.CL
TL;DR: 本文提出了一种基于强化学习的后训练框架,通过任务特定奖励和新提出的Group Relative Policy Optimization(GRPO)目标,提升思维型多模态大语言模型(MLLMs)在有害表情包分析中的推理能力。
Details
Motivation: 有害表情包往往需要组合式多模态推理:图像与文本单独看可能无害,但其交互却传达有害意图;而当前思维型MLLMs在此类任务上的能力尚未被充分探索。 Method: 提出基于强化学习的后训练框架,包括:(i)对现成MLLMs进行系统实证研究;(ii)通过知识蒸馏生成弱/伪监督链式思维(CoT)推理数据,扩展有害表情包数据集;(iii)设计GRPO目标,联合优化分类性能与解释质量,以促进细粒度、逐步推理。 Result: 在Hateful Memes基准上达到SOTA,准确率与F1提升约1%,解释质量提升约3%;代码、扩展数据集及评估资源将开源。 Conclusion: 强化学习驱动的GRPO后训练可有效增强MLLMs在复杂多模态推理任务(如有害内容识别)中的表现,兼顾判别能力与可解释性。 Abstract: Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.[59] Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
Harshavardhan
Main category: cs.CL
TL;DR: 本文提出了一种名为自锚定校准漂移(SACD)的现象假说,即大语言模型在多轮对话中基于自身先前输出进行迭代时,其表达的置信度会出现系统性变化;研究通过对比Claude Sonnet 4.6、Gemini 3.1 Pro和GPT-5.2三个前沿模型在150个问题上的表现,发现不同模型表现出异质性的SACD模式:Claude呈现置信度下降与校准误差漂移,GPT-5.2在开放问题中置信度上升且校准误差恶化,而Gemini虽无显著置信度漂移,但其自我锚定抑制了本应出现的校准改善。
Details
Motivation: 探究大语言模型在多轮自回归生成过程中,因依赖自身先前输出而导致的置信度与校准性能的系统性偏移现象,填补对模型内部状态演化动态理解的空白。 Method: 设计三条件实验(单轮基线、多轮自锚定、独立重复控制),在150个跨领域问题上评估三个前沿LLM的置信度漂移(CDS)与期望校准误差(ECE)变化,并进行统计检验。 Result: Claude Sonnet 4.6显示显著负向置信度漂移与强校准误差漂移;GPT-5.2在开放问题中呈正向置信度漂移且ECE随轮次加剧;Gemini 3.1 Pro无显著CDS,但自锚定使其ECE无法自然收敛至零,反而被锁定在高位。 Conclusion: SACD是一种真实存在且模型特异的动态现象,既可表现为置信度单调变化,也可体现为对自然校准改进过程的抑制,提示需在多轮交互系统中重新审视模型置信度的可靠性与校准稳定性。 Abstract: We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models -- Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 -- across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control (C). Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p < .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini's calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 -- indicating that SACD can manifest as suppression of natural calibration improvement rather than ac[60] Suffix-Constrained Greedy Search Algorithms for Causal Language Models
Ayoub Hammal,Pierre Zweigenbaum,Caio Corro
Main category: cs.CL
TL;DR: 本文提出了一种后缀约束生成方法,通过强制LLM输出遵循严格模板,确保最终答案可被简单、确定性地解析,且不损害甚至提升模型性能。
Details
Motivation: 大型语言模型(LLMs)在数学问答等预测任务中虽能生成推理链,但其自由格式输出中提取最终答案困难,本质上是一个独立的信息抽取问题。 Method: 提出后缀约束生成方法,设计基于贪心搜索的多种算法,强制模型生成以预定义模板结尾的响应,使最终答案可被 trivially parse(即无需复杂解析)。 Result: 在多个数据集上的实验表明,该方法能保证最终答案被简单、确定性地提取,且不影响甚至提升了任务性能。 Conclusion: 后缀约束生成是一种有效提升LLM输出结构化与可解析性的轻量级方法,在保持或增强准确性的同时,显著简化下游答案提取流程。 Abstract: Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.[61] Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation
Liwen Sun,Xiang Yu,Ming Tan,Zhuohao Chen,Anqi Cheng,Ashutosh Joshi,Chenyan Xiong
Main category: cs.CL
TL;DR: 本文提出KG-Followup,一种结合知识图谱与主动上下文学习的大型语言模型,用于生成临床诊断前的关键随访问题,显著提升问题相关性与召回率。
Details
Motivation: 临床诊断耗时且依赖医患高频互动,而现有大语言模型因医学领域知识有限,难以生成有效、关键的预诊断问题。 Method: 构建医学领域知识图谱,并将其与大语言模型结合,采用主动上下文学习机制,增强模型在预诊断阶段生成相关随访问题的能力。 Result: KG-Followup在相关基准测试中召回率较当前最优方法提升5%–8%。 Conclusion: 知识图谱可有效弥补大语言模型的医学领域知识短板,KG-Followup为自动化预诊断评估提供了可靠、可解释的关键模块。 Abstract: Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.[62] LLM Self-Explanations Fail Semantic Invariance
Stefan Szeider
Main category: cs.CL
TL;DR: 本文提出语义不变性测试,用于评估大语言模型(LLM)自我解释的保真度;实验发现前沿模型在面对语义修饰但功能不变的工具描述时,其自我报告的不适感显著降低,表明其自我解释不忠实于实际任务状态,而易受语义框架影响。
Details
Motivation: 验证LLM自我解释是否真正反映其内部功能状态,而非被无关语义描述所操纵,从而质疑将自我报告作为模型能力或进展证据的合理性。 Method: 设计语义不变性测试:在代理任务中,使用功能等效但语义不同(如‘缓解框架’vs中性描述)的工具,收集模型对同一不可行任务的自我报告,并进行通道消融和控制指令干预实验。 Result: 所有四个前沿模型均未通过语义不变性测试——缓解框架工具显著降低自我报告的不适感,且该效应无法被忽略语义的指令抑制;消融实验证实工具描述是主要驱动因素。 Conclusion: LLM的自我解释不具备语义不变性,其内容更多响应语义预期而非真实任务状态,因此不宜直接作为模型能力或内部状态的可靠指标。 Abstract: We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.[63] A Study on Building Efficient Zero-Shot Relation Extraction Models
Hugo Thomas,Caio Corro,Guillaume Gravier,Pascale Sébillot
Main category: cs.CL
TL;DR: 本文研究了零样本关系抽取模型在现实场景下的鲁棒性,指出先前工作存在不切实际的假设(如直接编码提及对、缺乏拒绝机制),并提出单次处理和带拒绝机制的模型策略,实验表明AlignRE(Li等,2024)整体表现最优。
Details
Motivation: 现有零样本关系抽取方法依赖不现实假设:输入中直接编码实体对(阻碍离线预计算),且缺乏拒绝机制(影响检索场景下的评估),难以适应真实应用需求。 Method: 构建现有模型分类体系,提出单次处理模型与带拒绝机制的改进策略,并适配多个SOTA工具进行对比实验。 Result: 实验证明当前方法在真实假设下鲁棒性不足,但AlignRE在各项指标上综合表现最佳。 Conclusion: 零样本关系抽取需兼顾效率(如支持离线预计算)与实用性(如引入拒绝机制),AlignRE是目前最接近现实部署需求的模型。 Abstract: Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.[64] Spectral Attention Steering for Prompt Highlighting
Weixian Waylon Li,Yuchen Niu,Yongxin Yang,Keshuang Li,Tiejun Ma,Shay B. Cohen
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的注意力引导方法SEKA及其自适应变体AdaSEKA,通过直接编辑键嵌入(而非存储完整注意力矩阵)实现高效、低开销的模型焦点控制,并兼容FlashAttention等内存优化注意力机制。
Details
Motivation: 现有注意力引导方法需显式存储完整注意力矩阵,无法兼容FlashAttention等内存高效的注意力实现,限制了其在资源受限场景下的应用。 Method: 提出Spectral Editing Key Amplification (SEKA),利用谱分解在注意力计算前直接编辑键嵌入,使其朝向能增强目标token注意力得分的潜在方向;进一步提出Adaptive SEKA (AdaSEKA),引入无训练的查询自适应路由机制,根据提示语义动态融合多个专家子空间。 Result: SEKA与AdaSEKA在标准引导基准上显著优于强基线,同时大幅降低延迟和内存开销,并完全兼容优化后的注意力实现(如FlashAttention)。 Conclusion: SEKA系列方法为高效、轻量、即插即用的注意力引导提供了新范式,突破了传统方法对注意力矩阵存储的依赖,推动了大模型可控推理的实用化。 Abstract: Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt's semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.[65] Efficient Extractive Summarization with MAMBA-Transformer Hybrids for Low-Resource Scenarios
Nisrine Ait Khayi
Main category: cs.CL
TL;DR: 本文提出首个用于抽取式摘要的Mamba-Transformer混合模型,结合Transformer语义建模能力与Mamba线性时间复杂度优势,实现长文档无截断处理,在低资源场景下显著提升ROUGE指标并加快推理速度。
Details
Motivation: 现有抽取式摘要方法受限于Transformer的二次计算复杂度,常需截断长文档,难以在资源受限场景部署。 Method: 构建Mamba-Transformer混合架构:(1) Transformer编码器提取句子级语义;(2) Mamba状态空间模型高效建模句间依赖;(3) 线性分类器预测句子相关性。 Result: 在新闻、议论文和科学文献数据集上,相比BERTSUM和MATCHSUM取得显著提升(如ArXiv上ROUGE-1 +0.23,p<0.001);在长文档上效果最强;小样本下鲁棒;CNN/DailyMail推理提速24–27%。 Conclusion: 首次将状态空间模型与Transformer融合用于抽取式摘要,验证了其在低资源、长文档场景下的有效性与实用性。 Abstract: Extractive summarization of long documents is bottlenecked by quadratic complexity, often forcing truncation and limiting deployment in resource-constrained settings. We introduce the first Mamba-Transformer hybrid for extractive summarization, combining the semantic strength of pre-trained transformers with the linear-time processing of state space models. Leveraging Mamba's ability to process full documents without truncation, our approach preserves context while maintaining strong summarization quality. The architecture includes: (1) a transformer encoder for sentence-level semantics, (2) a Mamba state space model to capture inter-sentence dependencies efficiently, and (3) a linear classifier for sentence relevance prediction. Across news, argumentative, and scientific domains under low-resource conditions, our method achieves: (1) large gains over BERTSUM and MATCHSUM, including +0.23 ROUGE-1 on ArXiv and statistically significant improvements on all datasets (p < 0.001); (2) consistent advantages across domains, strongest on the longest documents; (3) robust performance with limited training data; and (4) 24-27% faster inference on news summarization (CNN/DailyMail). We introduce the first hybrid Transformer-state space architecture for summarization, showing significant ROUGE improvements in low-resource scenarios.[66] Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data
Minghao Guo,Ziyi Ye,Wujiang Xu,Xi Zhu,Wenyue Hua,Dimitris N. Metaxas
Main category: cs.CL
TL;DR: 本文通过志愿者十年私信数据,提出“个体图灵测试”评估大语言模型模拟特定个体的能力,发现现有方法虽无法通过该测试,但在陌生人测试中表现更好,并揭示了参数化与非参数化方法在个体模拟中的根本权衡。
Details
Motivation: 探索大语言模型(LLMs)复制特定个体的能力,此前这一方向尚缺乏深入研究。 Method: 基于志愿者十年私信数据,提出“个体图灵测试”,并系统评估细调、检索增强生成(RAG)、记忆式方法及混合方法等主流LLM个体模拟方法。 Result: 当前LLM个体模拟方法未能通过个体图灵测试,但在陌生人测试中表现显著更好;细调更擅长模拟日常语言风格,RAG和记忆方法在个人观点与偏好问题上更强。 Conclusion: 揭示了在纵向语境下,参数化(如细调)与非参数化(如RAG、记忆)方法在个体模拟中存在根本性权衡。 Abstract: Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.[67] Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
Achuth Chandrasekhar,Janghoon Ock,Amir Barati Farimani
Main category: cs.CL
TL;DR: 本文提出了一种名为Catalyst-Agent的LLM驱动AI代理,基于MCP服务器,结合OPTIMADE API、GNN模型(UMA)与AdsorbML工作流,实现催化剂材料的自动筛选与优化,在ORR、NRR和CO2RR反应中展现出高效闭环发现能力。
Details
Motivation: 传统催化剂发现依赖耗时昂贵的实验试错或高计算成本的第一性原理方法,亟需更高效、高精度的替代方案。 Method: 构建基于Model Context Protocol(MCP)的LLM智能体Catalyst-Agent,集成OPTIMADE数据库检索、结构修改、FAIRchem AdsorbML工作流(含slab构建与UMA GNN吸附能预测),实现闭环催化材料探索与优化。 Result: 在ORR、NRR、CO2RR三个关键反应中,Catalyst-Agent筛选成功率23–34%,平均1–2次迭代即可收敛至成功候选材料。 Conclusion: AI代理具备规划与工具调用能力,可实质性自动化催化剂筛选流程,生成可验证科学假设,显著降低人工干预,推动加速材料发现。 Abstract: The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.[68] Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning
Hamed Damirchi,Ignacio Meza De la Jara,Ehsan Abbasnejad,Afshar Shamsi,Zhen Zhang,Javen Shi
Main category: cs.CL
TL;DR: 本文提出Truth as a Trajectory (TaT)方法,将LLM推理建模为层间表征变化的轨迹,通过分析各层间几何位移而非静态激活,发现区分有效推理与虚假行为的几何不变量,在多个任务和架构上优于传统探针方法。
Details
Motivation: 现有可解释性方法将隐藏状态视为静态点,易受多义特征干扰,导致线性探针仅学习表面词汇模式而非深层推理结构。 Method: 提出TaT方法,将Transformer推理建模为层间迭代精炼的轨迹,聚焦表征在层间的几何位移而非单层静态激活,从中提取几何不变量以区分推理质量。 Result: TaT在常识推理、问答和毒性检测等基准上,仅利用层间激活变化(无需原始激活),显著缓解对静态词汇混淆的依赖,性能优于传统探针方法。 Conclusion: 轨迹分析为LLM可解释性提供了新视角,TaT证明了动态位移分析比静态激活分析更能揭示真实推理机制。 Abstract: Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.[69] MetaState: Persistent Working Memory for Discrete Diffusion Language Models
Kejing Xia,Mingzhe Li,Lixuan Wei,Zhenbang Du,Xiangchi Yuan,Qirui Jin,Wenke Lee
Main category: cs.CL
TL;DR: 本文提出MetaState方法,通过为离散扩散语言模型(dLLMs)引入轻量级、固定大小的持续工作记忆,解决其在去噪过程中因丢弃中间连续表征而导致的信息孤岛问题,从而提升跨步一致性与生成质量。
Details
Motivation: 标准离散扩散语言模型在每一步去噪中仅依赖当前硬掩码序列,丢弃中间连续表征,造成信息孤岛问题,导致冗余重计算和跨步一致性下降。 Method: 提出MetaState:一种轻量级循环增强模块,包含三个可训练组件——跨注意力Mixer(读取骨干网络激活至记忆槽)、GRU风格Updater(跨去噪步整合信息)、跨注意力Injector(将更新后的记忆反馈回骨干网络),并在K步展开下微调。 Result: 在LLaDA-8B和Dream-7B上,MetaState仅引入极少可训练参数且保持骨干网络冻结,却持续提升准确率,验证了持久跨步记忆的有效性。 Conclusion: 持久化的跨去噪步记忆是一种有效机制,可弥合离散扩散语言模型中各去噪步骤间的语义断层,显著提升生成质量。 Abstract: Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.[70] PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
Yimin Zhao,Sheela R. Damle,Simone E. Dekker,Scott Geng,Karly Williams Silva,Jesse J Hubbard,Manuel F Fernandez,Fatima Zelada-Arenas,Alejandra Alvarez,Brianne Flores,Alexis Rodriguez,Stephen Salerno,Carrie Wright,Zihao Wang,Pang Wei Koh,Jeffrey T. Leek
Main category: cs.CL
TL;DR: 本文提出PanCanBench——首个面向胰腺癌患者真实问题的临床LLM评估基准,涵盖3130条专家制定的评分标准;评估22个主流模型发现:模型在临床完整性上差异显著(46.5%–82.3%),幻觉率差异巨大(6.0%–53.8%),且新推理优化模型未必提升事实性,网络搜索亦不必然改善结果。
Details
Motivation: 现有医学LLM评估框架(如HealthBench)依赖模拟问题、缺乏疾病特异性深度,且基于高分rubric无法保障事实正确性,尤其在胰腺癌等复杂临床场景下亟需更真实、更严谨的评估方法。 Method: 构建人机协同流水线,基于PanCAN提供的去标识化真实患者问题,由临床专家制定问题特异性评分标准,形成PanCanBench基准(含282个问题、3130条标准);采用LLM-as-a-judge框架,从临床完整性、事实准确性、网络搜索整合三方面评估22个LLM;对比人工vs合成rubric及启用/禁用web search的影响。 Result: 22个模型在临床完整性上得分差异大(46.5%–82.3%);幻觉率跨度极大(Gemini-2.5 Pro/GPT-4o仅6.0%,Llama-3.1-8B高达53.8%);o3虽完整性最高但事实性不如其他GPT模型;启用web search后Gemini-2.5 Pro和GPT-5得分反而下降;合成rubric使绝对分虚高17.9分,但相对排序基本稳定。 Conclusion: PanCanBench揭示了当前LLM在真实胰腺癌临床问答中存在严重事实性风险与性能不稳定性;单纯追求rubric高分或启用外部检索不能替代对事实准确性的严格验证;未来临床LLM开发与评估必须以真实患者问题和专家主导的细粒度标准为核心。 Abstract: Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.[71] Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning
Zhongjian Zhang,Xiao Wang,Mengmei Zhang,Jiarui Tan,Chuan Shi
Main category: cs.CL
TL;DR: 本文提出RGLM方法,通过重构图信息来增强图-文本对齐,克服现有Graph-Tokenizing LLMs仅依赖文本监督导致的图上下文利用不足问题。
Details
Motivation: 现有Graph-Tokenizing LLMs仅依赖语言指令的文本监督,导致隐式对齐和文本主导偏差,未能充分挖掘图结构信息。 Method: 提出重构式图指令微调框架RGLM,包含三个变体:基于输入空间的RGLM-Decoder,以及基于隐空间的RGLM-Similarizer和RGLM-Denoiser,并从信息论角度分析其对齐效果。 Result: 在多个基准和任务场景上的实验验证了RGLM的有效性,显著提升了图-文本对齐质量与图任务泛化能力。 Conclusion: RGLM通过引入显式的图监督重构机制,有效缓解文本主导偏差,为图Token化大模型的对齐研究提供了新方向。 Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.[72] Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
Jiyoon Myung
Main category: cs.CL
TL;DR: 本文系统评估了大语言模型(LLMs)在多轮真实对话场景下的可靠性,发现其在维持全局约束、意图识别与结构化实体追踪等任务中随对话轮次增加显著退化,尤其在小模型上更明显;揭示了指令漂移、意图混淆和上下文覆盖等典型失效模式,并呼吁加强面向对话可靠性的压力测试与评估方法建设。
Details
Motivation: 尽管大语言模型(LLMs)已广泛部署于现实应用中,但其在依赖历史上下文的长程、混合主题多轮对话中的可靠性仍缺乏系统理解与评估。 Method: 设计三个反映实际交互挑战的代表性任务(全局约束维持、工具/代理选择、结构化实体追踪),每项任务均设置单轮与多轮对比,对多个商用及开源LLM进行系统性评测,并开展错误分析以识别典型失效模式。 Result: 所有被测模型在多轮设置下均出现显著可靠性下降,小模型退化更严重;识别出指令漂移、意图混淆和上下文覆盖三类高频失败模式。 Conclusion: 当前LLMs在真实多轮对话中可靠性不足,亟需构建更具压力性的对话可靠性评测基准,并推动鲁棒评估方法与模型改进,以支撑可信部署。 Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.[73] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
Jiajie Jin,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Yutao Zhu,Zhicheng Dou
Main category: cs.CL
TL;DR: 本文提出LaSER框架,通过自蒸馏将显式思维链(CoT)推理内化到稠密检索器的潜在空间中,实现无需自回归文本生成的隐式‘静默思考’,兼顾推理深度与检索效率。
Details
Motivation: 现有基于大语言模型(LLM)的稠密检索器多将其作为静态编码器使用,未能充分利用其强大推理能力;而显式重写-检索流程虽能引入CoT,但延迟过高。 Method: 提出LaSER自蒸馏框架,基于共享LLM骨干网络,设计双视图训练机制(显式视图编码真实推理路径,隐式视图进行潜在思考),并引入多粒度对齐策略,包括输出对齐和创新的轨迹对齐(同步隐式路径中间状态与显式推理语义进展)。 Result: 在领域内与跨领域推理密集型基准上显著超越SOTA;在多种骨干网络和模型规模下验证了方法鲁棒性;实现了显式CoT的推理深度与标准稠密检索的高效推理的统一。 Conclusion: LaSER成功将显式推理能力内化至稠密检索器的潜在空间,证明统一学习框架对激发有效隐式推理至关重要,为高效、深度推理感知检索提供了新范式。 Abstract: LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.[74] Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
Samhruth Ananthanarayanan,Ayan Sengupta,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 本文提出一种物理启发的KV缓存压缩观,强调压缩不仅是存储问题,更是注意力路由的扰动;通过合成任务发现模型在90%压缩率附近存在语义可达性的相变点,并揭示不同架构的路由动态差异及表征刚性问题,最终将KV压缩重新定义为对注意力几何结构的探测。
Details
Motivation: 现有KV缓存压缩方法虽宣称高内存节省率,但其评估忽略了注意力的本质是语义路由而非单纯存储——保留KV对不等于保证语义可访问性。 Method: 提出基于物理直觉的KV压缩框架,区分保留性、可访问性与利用率;设计多实体跟踪、消歧、共指解析和多跳推理等合成任务;引入全局驱逐比(GER)量化路由扰动,并分析不同LLM(如LLaMA、Qwen)的注意力头动态。 Result: 发现中度压缩下内部表征退化但准确率损失小(表明冗余);所有模型在≈90%压缩率时出现幻觉安全悬崖,且与GER激增强相关;LLaMA与Qwen呈现截然不同的路由演化模式;识别出‘表征刚性’现象:过度头级共识导致路由灵活性崩溃。 Conclusion: KV压缩容忍度由稀疏的token-route结构决定;应将KV压缩视为对注意力几何结构的探测工具;长上下文可扩展性本质上依赖于自注意力中的稀疏性与‘彩票假说’。 Abstract: As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.[75] Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
Yuxin Liu,Mingye Zhu,Siyuan Liu,Bo Hu,Lei Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于心理学理论的动态角色扮演语言代理方法(PDD),通过估计上下文相关的人格重要性,并在推理时结合加权奖励引导解码,提升社会模拟中代理行为的真实性与一致性。
Details
Motivation: 现有静态提示工程或微调方法无法适应动态场景下人格对行为的影响变化,而心理学中的认知-情感人格系统理论指出人格影响具有情境依赖性,因此需要自适应的角色管理机制。 Method: 提出Persona Dynamic Decoding(PDD)框架,包含两个模块:(1) Persona Importance Estimation(PIE),无监督地动态量化人格属性的情境重要性;(2) Persona-Guided Inference-Time Alignment(PIA),利用重要性得分构建加权多目标奖励,在推理过程中调节生成概率。 Result: 大量实验表明该方法在话语一致性与行为保真度上显著优于基线方法。 Conclusion: PDD是一种理论驱动、无需额外标注、可在推理时动态调整人格影响的新范式,有效提升了角色扮演语言代理在社会模拟中的现实性。 Abstract: The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.[76] Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
Ming-Hao Hsu,Xueyao Zhang,Xiaohai Tian,Jun Zhang,Zhizheng Wu
Main category: cs.CL
TL;DR: 本文研究了大型语音-语言模型中语音与文本模态间的性能差距,发现该差距源于语音表征的冗余性和跨层对齐特性,而非简单的分布偏移;提出应关注token或时间粒度的建模,而非特征级匹配。
Details
Motivation: 尽管大型语音-语言模型取得进展,语音输入任务仍显著落后于纯文本推理,需深入理解该模态差距的动态根源。 Method: 采用跨层中心核对齐(CKA)分析,结合语音-文本token对齐,在SpeechMMLU和VoiceBench BBH上评估四个开源端到端模型,考察各层表征演化规律及对齐稳定性。 Result: 语音表征呈现宽泛的跨层对齐带,源于语音帧冗余性;该模式结构稳定;输入层统计校准无效甚至有害,表明模态差距非分布偏移。 Conclusion: 模态差距瓶颈在于将冗余语音高效压缩为稳定的深层决策,未来工作应聚焦token或时间粒度的建模机制。 Abstract: Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.[77] Extracting Training Dialogue Data from Large Language Model based Task Bots
Shuo Zhang,Junzhou Zhao,Junji Hou,Pinghui Wang,Chenxu Wang,Jing Tao
Main category: cs.CL
TL;DR: 本文系统研究了大语言模型(LLM)在任务导向对话系统(TODS)中因记忆训练数据而引发的隐私风险,提出针对LLM-TODS定制的数据提取攻击方法,在高精度下成功提取数千条对话状态标签,并分析了记忆机制的关键影响因素与缓解策略。
Details
Motivation: LLM作为软知识库易记忆含敏感信息的训练对话数据,但其在TODS中如何继承并暴露这种记忆尚无系统研究。 Method: 开展系统性定量研究:评估现有训练数据提取攻击、分析TODS建模特性导致现有方法失效的原因,并提出面向LLM-TODS的新型攻击技术,改进响应采样与成员推断。 Result: 所提数据提取攻击可高精度(最优情况下超70%)提取数千条对话状态训练标签;并识别和量化了LLM-TODS中训练数据记忆的关键影响因素。 Conclusion: LLM-TODS存在显著且可被利用的数据记忆隐私风险;需结合模型架构、训练策略与部署机制设计针对性缓解方案。 Abstract: Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.[78] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
Arghodeep Nandi,Ojasva Saxena,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 本文提出MarODE,一种基于马尔可夫过程和常微分方程建模推理轨迹动态的离线评估框架,显著提升对生成式语言模型推理质量的人本化、泛化性评估能力。
Details
Motivation: 现有推理轨迹评估方法过于机械,难以捕捉人类视角下的推理质量,且泛化能力差,尤其在推理质量逐步下降时表现不佳。 Method: 提出MarODE框架,将推理过程建模为马尔可夫链,并用常微分方程刻画推理轨迹的动态演化;结合人本扰动与人工评分,联合评估评估指标的‘优度’(goodness)与‘健全性’(soundness)。 Result: 在大规模评测中,MarODE在Somers' D相关性指标上较现有基线提升超250%。 Conclusion: 理论驱动的评估框架(如MarODE)对提升语言模型推理质量评估的可靠性与实用性具有关键价值,尤其当推理轨迹成为模型系统核心组件时。 Abstract: Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.[79] More Data, Fewer Diacritics: Scaling Arabic TTS
Ahmed Musleh,Yifan Zhang,Kareem Darwish
Main category: cs.CL
TL;DR: 本文提出了一种构建大规模自动标注阿拉伯语TTS数据集的鲁棒流程,获得约4000小时数据,并验证了在无标音(diacritization)条件下,增加数据量可显著弥补性能损失,计划开源无需标音的阿拉伯语TTS模型。
Details
Motivation: 阿拉伯语TTS研究受限于公开训练数据和准确标音模型的缺乏。 Method: 构建包含语音活动检测、语音识别、自动标音和噪声过滤的自动化数据处理流程,收集并处理阿拉伯语录音,生成约4000小时TTS训练数据;在此基础上训练多种带声纹克隆的TTS模型,对比不同数据规模(100/1000/4000小时)及是否使用标音的效果。 Result: 使用标音数据训练的模型整体更优,但增大训练数据量(如4000小时)可在很大程度上弥补无标音带来的性能下降。 Conclusion: 大规模自动标注数据可有效缓解阿拉伯语TTS中标音依赖问题,为构建轻量、实用的阿拉伯语TTS系统提供了可行路径,并计划开源无需标音的公共TTS模型。 Abstract: Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.[80] Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
Aditya Parikh,Aasa Feragen,Sneha Das,Stella Frank
Main category: cs.CL
TL;DR: 本文指出当前放射学中视觉语言模型(VLMs)评估存在盲点:高token重叠分数可能掩盖模板化、缺乏临床术语的生成问题;为此提出基于词汇多样性的临床特异性评估方法,并引入Clinical Association Displacement(CAD)和Weighted Association Erasure(WAE)两个新指标,揭示不同解码策略在临床信息保留与公平性间的权衡。
Details
Motivation: 现有VLM评估依赖表面文本相似度(如token重叠),易被模板化生成误导,忽视临床术语缺失与人口统计偏差,导致‘指标游戏’和临床不可靠。 Method: 提出词汇级框架Clinical Association Displacement(CAD)量化不同人群群体在生成报告中的词关联偏移;设计Weighted Association Erasure(WAE)聚合该偏移以衡量临床信号损失;对比确定性解码与随机采样对多样性、临床性与公平性的影响。 Result: 确定性解码导致显著语义擦除(即临床信息丢失),而随机采样提升多样性却可能引入新偏差;CAD/WAE可有效揭示解码策略对临床保真度与人口公平性的双重影响。 Conclusion: 应摒弃单纯追求高token重叠的评估范式,转向以词汇多样性与临床特异性为核心、兼顾人口公平性的新型评估框架;‘最优报告’需重新定义为临床准确、术语丰富且跨群体稳健的生成结果。 Abstract: Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.[81] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Jiebin Zhang,Zhenghan Yu,Liang Wang,Nan Yang,Eugene J. Yu,Zheng Li,Yifan Song,Dawei Zhu,Xingxing Zhang,Furu Wei,Sujian Li
Main category: cs.CL
TL;DR: 本文提出Learning to Draft (LTD)方法,通过强化学习联合优化草稿与验证阶段的动态协调,直接以吞吐量为目标,显著提升LLM推测解码效率。
Details
Motivation: 现有推测解码方法在时间分配上存在静态化或仅优化代理指标(如接受长度)的问题,忽视真实时间开销且割裂草稿与验证阶段。 Method: 将推测解码建模为强化学习环境,训练两个协同自适应策略,动态协调草稿生成与目标模型验证过程,以每轮循环的吞吐量为优化目标。 Result: 在5个不同大语言模型和4项任务上评估,LTD实现2.24x–4.32x加速比,相较SOTA方法Eagle3最高提升36.4%。 Conclusion: LTD通过端到端联合优化草稿与验证策略,有效提升了推测解码的实际推理吞吐量,为高效LLM推理提供了新范式。 Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.[82] LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
Anka Chandrahas Tummepalli,Preethu Rose Anish
Main category: cs.CL
TL;DR: 本文提出LexChronos框架,通过双代理架构从印度最高法院判决书中提取结构化事件时间线,利用合成数据集训练模型,在法律文本摘要等下游任务中显著提升大语言模型的理解与推理能力。
Details
Motivation: 传统方法将法律文书视为非结构化文本,限制了大语言模型在法律任务中的效果;同时印度法律事件数据稀缺,亟需构建适配本地司法语境的结构化处理框架。 Method: 提出LexChronos双代理框架:LoRA微调的抽取代理识别候选事件,预训练反馈代理通过置信度驱动循环进行评分与优化;并采用DeepSeek-R1和GPT-4反向工程构建含2000样本的印度法律事件合成数据集。 Result: 在合成基准上BERT F1达0.8751;下游摘要任务中GPT-4在75%案例中更偏好结构化时间线输入,验证其对印度判例理解与推理的提升。 Conclusion: LexChronos为印度法律AI应用(如先例映射、论点合成、判决预测)提供了基于结构化事件表示的基础框架,推动本地化法律智能发展。 Abstract: Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.[83] Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Jiahao Huo,Yu Huang,James Kwok,Xuming Hu
Main category: cs.CL
TL;DR: 本文提出ColParse,一种利用文档解析模型生成布局感知的子图像嵌入,并与全局页面向量融合,以构建紧凑且结构感知的多向量表示的新范式,大幅降低存储需求并提升性能。
Details
Motivation: 现有视觉文档检索(VDR)中的多向量架构面临严重的存储瓶颈,当前优化策略(如嵌入合并、剪枝或抽象标记)难以在不牺牲性能或忽略关键布局信息的前提下解决该问题。 Method: 提出ColParse范式:使用文档解析模型生成少量布局感知的子图像嵌入,并将其与全局页面级向量融合,形成紧凑、结构感知的多向量表示。 Result: 实验表明,ColParse在多个基准和基础模型上将存储需求降低超95%,同时显著提升检索性能。 Conclusion: ColParse弥合了细粒度多向量检索精度与大规模部署实际需求之间的关键鸿沟,为高效、可解释的多模态信息系统提供了新路径。 Abstract: Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.[84] Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin,Kai Han
Main category: cs.CL
TL;DR: 本文提出Surgical Post-Training (SPoT)方法,通过数据矫正与二元奖励监督,在极少量数据和训练时间内显著提升大模型推理能力,同时缓解灾难性遗忘。
Details
Motivation: 现有大语言模型后训练方法在提升推理能力时面临效率与灾难性遗忘的权衡;作者发现DPO中隐含的正则化机制被忽视,可被利用来缓解遗忘。 Method: 提出SPoT框架:(1) 利用Oracle进行精准错误步骤修正的数据矫正流程,生成贴近模型分布的数据;(2) 设计基于奖励的二元交叉熵目标,将推理正确性建模为二分类问题,实现解耦监督。 Result: 仅用4k矫正后的数学数据对,在8×H800 GPU上训练28分钟,使Qwen3-8B在领域内及OOD任务上平均准确率提升6.2%。 Conclusion: SPoT是一种高效、低遗忘的推理增强范式,揭示并利用了偏好优化中的隐式正则效应,为轻量级后训练提供了新思路。 Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT[85] QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
Yixuan Tang,Zhenghong Lin,Yandong Sun,Anthony K. H. Tung
Main category: cs.CL
TL;DR: 本文提出QIME框架,通过本体论引导生成临床意义明确的是/否问题,构建可解释的医学文本嵌入,无需逐问训练分类器,性能接近黑盒模型并提供简洁临床解释。
Details
Motivation: 现有密集生物医学嵌入虽性能强但缺乏可解释性,而基于问题的可解释嵌入方法常依赖启发式或表层对比信号,忽视专业领域知识。 Method: QIME是一种本体论驱动的框架,将每个嵌入维度映射为一个临床有意义的是/否问题;通过聚类特异的医学概念签名生成语义原子级问题,并支持无需训练每题分类器的嵌入构建策略。 Result: 在生物医学语义相似性、聚类和检索基准上,QIME持续优于先前可解释嵌入方法,并显著缩小与强黑盒生物医学编码器之间的性能差距。 Conclusion: QIME在保持高性能的同时实现了高可解释性,提供了简洁且具临床信息性的解释,提升了其在临床决策中的实用性。 Abstract: While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.[86] Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš,Tjaša Arčon,Timotej Petrič,Dario Vajda,Marko Robnik-Šikonja,Iztok Lebar Bajec
Main category: cs.CL
TL;DR: 本文提出了一套将大语言模型成功适配到低资源语言(以斯洛文尼亚语为例)的方法论,并发布了120亿参数的斯洛文尼亚语生成模型GaMS3-12B;该模型通过三阶段持续预训练与两阶段监督微调构建,在多项斯洛文尼亚语评测中超越同参数规模开源模型,甚至在斯洛文尼亚语LLM竞技场中媲美GPT-4o。
Details
Motivation: 当前主流开源大语言模型主要基于英语训练,导致其在低资源语言和文化上的性能较差,亟需系统性方法实现有效本地化适配。 Method: 采用三阶段持续预训练(基于Gemma 3)加两阶段监督微调(SFT)策略;训练数据包含1400亿斯洛文尼亚语、英语及西巴尔干语言预训练token,以及超20万英-斯双语SFT样本。 Result: GaMS3-12B在斯洛文尼亚语评测集(Slovenian-LLM-Eval)、英-斯翻译和斯洛文尼亚语LLM竞技场中均显著优于同规模Gemma 3,并在竞技场中对GPT-4o达到60%以上胜率。 Conclusion: 系统性多阶段训练策略可高效提升大模型在低资源语言上的表现,GaMS3-12B验证了该方法的有效性,为其他低资源语言适配提供了可复用范式。 Abstract: Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.[87] Legal RAG Bench: an end-to-end benchmark for legal RAG
Abdur-Rahman Butler,Umar Butler
Main category: cs.CL
TL;DR: 本文提出了Legal RAG Bench,一个用于评估法律领域RAG系统端到端性能的基准与评估方法,包含4876段维多利亚州刑事指控手册文本和100个复杂人工问题,并采用全因子设计与分层错误分解框架分析检索与推理模型贡献;实验发现检索(尤其是Kanon 2 Embedder)是性能主导因素,远超大模型影响,且多数‘幻觉’实为检索失败所致。
Details
Motivation: 现有法律RAG系统缺乏统一、严谨的端到端评估基准与方法,难以厘清检索与生成模块各自的贡献,也易将检索失败误判为LLM幻觉。 Method: 构建Legal RAG Bench:含4,876条法律文本与100个专家级复杂问题,并提供长答案与支撑段落;提出基于全因子设计与分层错误分解的评估方法;在三种嵌入模型(Kanon 2、Gemini Embedding 001、text-embedding-3-large)和两种前沿LLM(Gemini 3.1 Pro、GPT-5.2)上开展系统评测。 Result: 检索性能(尤其Kanon 2 Embedder)是法律RAG正确性(+17.5)、扎实性(+4.5)与检索准确率(+34)提升的主因;LLM影响相对有限;大量被归因为‘幻觉’的错误实源于检索失败,表明检索设定了系统性能上限。 Conclusion: Legal RAG Bench为法律AI提供了首个可复现、可分解、面向真实任务的评估框架;研究结论强调应优先优化法律专用检索模块,而非仅聚焦大模型调优,并开源全部代码与数据以推动社区发展。 Abstract: We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.[88] Bootstrapping Embeddings for Low Resource Languages
Merve Basoz,Andrew Horne,Mattia Opper
Main category: cs.CL
TL;DR: 本文探讨了利用大语言模型生成合成三元组数据以优化嵌入模型的方法,提出了两种新策略(适配器组合和跨语言微调XL-LoRA),在多语言任务中显著提升了性能。
Details
Motivation: 高资源语言(如英语)有丰富的监督微调数据,但数百种其他语言缺乏此类数据,亟需方法填补这一空白。 Method: 测试三种生成合成三元组数据的策略:上下文学习、适配器组合、以及跨语言微调的大语言模型生成器(XL-LoRA)。 Result: 适配器组合和XL-LoRA在多种任务和语言上显著优于基线,而上下文学习仍弱于强非合成基线。 Conclusion: 适配器组合与XL-LoRA为构建高性能多语言嵌入模型提供了清晰、可扩展的路径。 Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.[89] AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
Main category: cs.CL
TL;DR: AnnoABSA is a customizable, web-based annotation tool for ABSA tasks, featuring LLM-powered RAG suggestions with human-in-the-loop control and few-shot learning from prior annotations.
Details
Motivation: To address the lack of integrated, flexible, and intelligent annotation tools for the full spectrum of ABSA tasks. Method: Design and implementation of a web-based annotation platform with configurable sentiment elements, human-in-the-loop LLM-based RAG suggestions, and similarity-based few-shot prompting using previously annotated examples. Result: A functional, open-source (MIT License), extensible annotation tool that improves suggestion accuracy over time via retrieval-augmented few-shot prompting. Conclusion: AnnoABSA effectively bridges manual annotation and AI assistance for ABSA, enhancing both annotation efficiency and consistency while remaining fully controllable by human annotators. Abstract: We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.[90] Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
Harry Stuart,Masahiro Kaneko,Timothy Baldwin
Main category: cs.CL
TL;DR: 本文提出利用大语言模型(LLM)作为领域专家进行自动化面试,通过模拟结构化问答动态更新对候选人能力特质的信念估计,从而提升初筛阶段招聘决策质量。
Details
Motivation: 传统人工面试成本高、难规模化;现有简历打分等粗筛方法信息有限,难以准确评估候选人能力。 Method: 构建基于LLM的智能面试系统,以能力量表为导向,通过多轮交互式问答动态更新对候选人潜在特质(如技术能力、沟通能力等)的校准化信念估计。 Result: 在模拟面试实验中,系统对候选人潜质的信念估计能有效收敛至其预设的真实能力水平;并开源了代码、匿名简历数据集、校准测试集及模拟面试数据。 Conclusion: LLM可作为低成本、高精度的‘虚拟领域专家’,显著增强早期招聘筛选的信息深度与决策可靠性。 Abstract: Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant's rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants' artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \href{https://github.com/mbzuai-nlp/beyond-the-resume}{https://github.com/mbzuai-nlp/beyond-the-resume}. Our demo is available at \href{https://btr.hstu.net}{https://btr.hstu.net}.[91] FreeAct: Freeing Activations for LLM Quantization
Xiaohao Liu,Xiaobo Xia,Manyi Zhang,Ji-Fu Li,Xianzhi Yu,Fei Shen,Xiu Su,See-Kiong Ng,Tat-Seng Chua
Main category: cs.CL
TL;DR: 本文提出FreeAct量化框架,通过放松静态一一映射约束,为不同token类型(如视觉/文本/掩码token)分配特定的动态激活变换矩阵,同时保持权重变换统一静态,从而提升dLLMs和MLLMs的量化性能。
Details
Motivation: 现有基于变换的量化方法采用刚性的静态一一映射,无法适应dLLMs和MLLMs中不同token类型激活分布的动态差异。 Method: 利用激活的低秩特性构建扩展解空间,解耦激活与权重的变换;对激活侧按token类型分配独立变换矩阵,权重侧保持统一静态变换。 Result: 在dLLMs和MLLMs上显著优于基线方法,最高提升5.3%性能。 Conclusion: FreeAct通过引入动态激活变换机制,在不增加权重变换复杂度的前提下,有效提升了大模型量化精度,尤其适用于多模态与扩散类大语言模型。 Abstract: Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.[92] LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
Main category: cs.CL
TL;DR: 本文提出LA-ABSA方法,利用大语言模型(LLM)生成标注数据来微调轻量级模型,以解决方面级情感分析(ABSA)中人工标注成本高的问题,在低资源场景下实现与LLM提示学习相当的性能,同时大幅降低计算能耗。
Details
Motivation: Aspect-Based Sentiment Analysis(ABSA)任务依赖大量人工标注数据,成本高、耗时长,亟需低成本、高效率的标注替代方案。 Method: 提出LA-ABSA框架,利用LLM(如Gemma-3-27B)通过少样本上下文学习(ICL)为无标签数据生成高质量标注,并以此训练轻量级下游模型,应用于TASD和ASQP两类复杂ABSA任务。 Result: 在五个ABSA数据集上验证,LA-ABSA在低资源设置下(如仅用50个标注样本)达到接近ICL prompting的性能(SemEval Rest16 ASQP任务F1达49.85 vs. Gemma-3-27B的51.10),且显著提升能效。 Conclusion: LA-ABSA是一种高效、节能的ABSA数据增强范式,在保持竞争力的同时大幅降低对大模型推理资源的依赖,为低资源场景下的ABSA部署提供了可行路径。 Abstract: Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.[93] nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
Main category: cs.CL
TL;DR: 本文提出了一种名为Self-Consistent Structured Generation (SCSG)的方法,用于SemEval-2026任务3(Track A)中的维度方面级情感分析(DimABSA),通过多次运行LoRA微调的大语言模型并取多数一致结果来提升预测可靠性,并利用vLLM的PagedAttention机制优化计算效率;实验表明该方法在多语言多领域设置下显著优于单次推理,在多个子任务中取得领先排名。
Details
Motivation: 提升维度方面级情感分析(DimABSA)任务中大语言模型预测的可靠性与鲁棒性,解决单次推理易受随机性影响的问题。 Method: 提出Self-Consistent Structured Generation(SCSG):对每个样本多次运行LoRA适配的大型语言模型(如Gemma 3),仅保留多数共识的结构化输出元组;结合vLLM的PagedAttention实现KV缓存复用以降低多次前向传播的计算开销。 Result: 在6种语言、8种语言-领域组合上验证有效;15次自一致性执行相比单次推理带来统计显著提升;系统在全部设置中稳居前七,英语子集三个位列第二,Tatar-Restaurant子集DimASTE任务排名第一。 Conclusion: 自一致性结构化生成是一种高效可靠的零样本/少样本结构化预测策略,尤其适用于多语言、多领域的细粒度情感分析任务,且可通过工程优化(如PagedAttention)缓解计算代价。 Abstract: We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM's PagedAttention mechanism for efficient key--value cache reuse. Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.[94] Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
Fred Zimmerman
Main category: cs.CL
TL;DR: 本文基于Schmidhuber的压缩进步理论,在超8万本英文图书上分析语义新颖性轨迹,发现现代书籍段落级新颖性更高、轨迹更曲折,而早期文学更倾向收敛式叙事;新颖性与读者评分几乎无关,并识别出八种跨时代分布差异显著的叙事形态原型。
Details
Motivation: 探究Schmidhuber提出的‘有趣性’(即压缩进步)在大规模文本语料中的实证表现,理解不同时代文学在语义新颖性动态模式上的系统性差异。 Method: 使用sentence-transformer获取段落嵌入,定义‘运行中心点’新颖性度量;对比PG19(1920年前)与Books3(约1990–2010)两个大型语料;计算段落级新颖性均值、轨迹迂回度(circuitousness)、收敛曲线比例;进行PAA-16降维聚类以识别叙事形态原型。 Result: 1)现代书籍段落新颖性均值高约10%;2)轨迹迂回度提升67%;3)收敛型叙事在早期文献中出现频率为现代的2.3倍;4)新颖性与读者评分几乎无关(r = -0.002);5)发现8种叙事形状原型,其分布存在显著时代差异。 Conclusion: 语义新颖性的动态结构(如轨迹形状与收敛性)随时代发生系统性演变,且‘压缩进步’意义上的有趣性独立于传统文学评价标准,支持新颖性作为可计算的、跨时代的叙事维度。 Abstract: I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.[95] ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
Xunlei Chen,Jinyu Guo,Yuang Li,Zhaokun Wang,Yi Gong,Jie Zou,Jiwei Wei,Wenhong Tian
Main category: cs.CL
TL;DR: 本文提出ALTER框架,通过LoRA的共享A矩阵捕获高熵token,并采用非对称LoRA架构实现目标子域的知识遗忘,在保证高遗忘质量的同时显著减少副作用并提升效率。
Details
Motivation: 控制大语言模型不应知晓的知识对于确保其对齐与安全使用至关重要,但现有方法面临知识纠缠边界模糊、参数空间耦合及计算开销大等挑战。 Method: ALTER框架包含两个阶段:(I) 利用LoRA共享A矩阵捕获并学习高熵token;(II) 采用非对称LoRA架构,通过参数隔离和目标子域内token遗忘实现指定遗忘目标。 Result: 在TOFU、WMDP和MUSE基准上达到SOTA性能,遗忘质量超95%,基础token保留良好,模型效用保持率超90%,远高于基线47.8–83.6%。 Conclusion: ALTER为大语言模型轻量高效地实现可控遗忘提供了新方向,尤其在token级隔离与非对称结构设计方面具有创新性与实用性。 Abstract: Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.[96] OpenAutoNLU: Open Source AutoML Library for NLU
Grigory Arshinov,Aleksandr Boriskin,Sergey Senichev,Ayaz Zaripov,Daria Galimzianova,Daniil Karpov,Leonid Sanochkin
Main category: cs.CL
TL;DR: OpenAutoNLU是一个开源的自动化机器学习库,专为自然语言理解(NLU)任务设计,支持文本分类和命名实体识别(NER),具备数据感知训练策略选择、数据质量诊断、OOD检测及大语言模型(LLM)功能,提供低代码API。
Details
Motivation: 解决现有NLU自动化工具需手动配置、缺乏数据质量评估与OOD检测等痛点,提升易用性与鲁棒性。 Method: 提出数据感知的自动训练范式选择机制,集成数据质量诊断、可配置OOD检测模块,并融合LLM特征,构建统一低代码API框架。 Result: 实现无需人工干预的端到端NLU建模流程,支持高质量模型训练与部署,已上线交互式演示应用。 Conclusion: OpenAutoNLU显著降低了NLU任务的使用门槛,兼顾自动化程度与模型可靠性,推动NLU技术在实际场景中的普及。 Abstract: OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here https://openautonlu.dev.[97] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
Xufei Lv,Jiahui Yang,Yifu Gao,Linbo Qiao,Houde Liu
Main category: cs.CL
TL;DR: 本文提出AT2QA,一种无需训练的自主代理方法,通过赋予现成大语言模型(LLM)自主决策能力,在零样本设置下显著提升时序知识图谱问答(TKGQA)性能,尤其在多目标查询上大幅超越先前SOTA。
Details
Motivation: 现有基于大语言模型的TKGQA方法依赖手工设计的检索流程或高成本的监督微调,缺乏灵活性和泛化性。 Method: 提出AT2QA框架,让现成LLM自主决定每步操作,通过通用搜索工具与动态时序知识图谱交互,实现迭代式、零样本的多跳推理。 Result: 在MultiTQ数据集上Hits@1达88.7%,较先前SOTA提升10.7%;多目标查询提升20.1%。 Conclusion: 赋予LLM自主性可在零样本下显著超越监督微调,证明了智能体范式在时序问答中的有效性与优越性。 Abstract: Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on https://github.com/AT2QA-Official-Code/AT2QA-Official-Code[98] CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Ziyi Zhu,Olivier Tieleman,Alexey Bukhtiyarov,Jinghong Chen
Main category: cs.CL
TL;DR: 本文提出了一种方差分解方法,识别出LLM-as-judge评估中法官偏差的来源,并设计了CyclicJudge轮转分配策略以消除偏差,同时保持单法官评估的成本。
Details
Motivation: LLM-as-judge评估中法官存在系统性偏差,其幅度与模型差异相当,导致单法官评估结果不可靠。 Method: 提出方差分解框架,将基准测试得分方差分解为场景、生成、法官和残差四部分;据此设计CyclicJudge轮转法官分配策略。 Result: CyclicJudge能精确消除法官偏差,且每个法官每轮仅需评估一次,维持单法官评估成本;在MT-Bench上的实验验证了理论预测。 Conclusion: 法官偏差是LLM评估中不可忽视的问题,CyclicJudge是一种高效、低成本、无偏的评估策略。 Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.[99] Sovereign AI-based Public Services are Viable and Affordable
António Branco,Luís Gomes,Rodrigo Santos,Eduardo Santos,João Silva,Nuno Marques,Madalena Rodrigues
Main category: cs.CL
TL;DR: 本文通过实践实验挑战了通用AI架构是所有应用场景最优选择的假设,证明了符合数字与文化主权原则的可行且经济高效的替代方案存在,并证实主权AI公共服务在技术上可行、经济上可持续。
Details
Motivation: 随着AI能力与地缘政治利益日益交织,基础AI服务的可用性与可靠性不再可被默认保障,尤其对依赖少数全球科技巨头商业服务的AI赋能公共服务而言,亟需探索自主可控的替代路径。 Method: 通过实际实验验证非通用架构的AI系统在公共部门场景下的可行性与经济性,强调本地化部署、低算力与低成本要求下的技术实现。 Result: 证实主权AI公共服务在技术上可行、经济上可持续,可在有限计算与财政资源下有效运行,同时保障文化与数字自主权。 Conclusion: 主权AI公共服务不仅是理念主张,更是可落地的技术与政策选项,应为各国政府与公共机构采纳提供实证依据与部署经验。 Abstract: The rapid expansion of AI-based remote services has intensified debates about the long-term implications of growing structural concentration in infrastructure and expertise. As AI capabilities become increasingly intertwined with geopolitical interests, the availability and reliability of foundational AI services can no longer be taken for granted. This issue is particularly pressing for AI-enabled public services for citizens, as governments and public agencies are progressively adopting 24/7 AI-driven support systems typically operated through commercial offerings from a small oligopoly of global technology providers. This paper challenges the prevailing assumption that general-purpose architectures, offered by these providers, are the optimal choice for all application contexts. Through practical experimentation, we demonstrate that viable and cost-effective alternatives exist. Alternatives that align with principles of digital and cultural sovereignty. Our findings provide an empirical illustration that sovereign AI-based public services are both technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining cultural and digital autonomy. The technical insights and deployment lessons reported here are intended to inform the adoption of similar sovereign AI public services by national agencies and governments worldwide.[100] KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Songming Zhang,Xue Zhang,Tong Zhang,Bojie Hu,Yufeng Chen,Jinan Xu
Main category: cs.CL
TL;DR: 本文提出KDFlow框架,通过解耦架构和SGLang加速教师推理,结合FSDP2训练效率与零拷贝隐藏状态传输,在LLM知识蒸馏中实现1.44×–6.36×加速。
Details
Motivation: 现有知识蒸馏框架对师生模型采用同质化训练后端(如FSDP、DeepSpeed),导致训练效率次优,亟需兼顾推理与训练效率的新型架构。 Method: 提出KDFlow框架:1)解耦师生模型计算流程;2)用SGLang执行高效教师推理;3)仅传输教师隐藏状态并学生侧重算logits(零拷贝);4)支持离策略/在策略蒸馏及跨分词器蒸馏,提供可扩展API。 Result: 相比现有KD框架,KDFlow在实验中取得1.44×至6.36×的训练速度提升,显著降低工程开销,并支持快速原型开发与规模化蒸馏。 Conclusion: KDFlow通过异构后端协同与通信优化,有效提升了大语言模型知识蒸馏的效率与灵活性,为高效模型压缩提供了新范式。 Abstract: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow[101] FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova,Shiran Sun,Lifeng Han,Natalia Amat Lefort,Flor Miriam Plaza-del-Arco
Main category: cs.CL
TL;DR: 本文介绍了作者团队在SemEval-2025 Task-7中的参赛系统,采用基于检索增强生成(RAG)与开源小型语言模型(sLLMs)的方法,并构建了文化感知知识库(CulKBs)及集成在线搜索,支持英、西、中文,强调隐私性、可持续性与可复现性。
Details
Motivation: 为应对跨语言跨文化的日常常识理解挑战,同时兼顾部署的隐私性、可持续性与低成本,作者选择使用开源小型语言模型(sLLMs)并构建文化适配的知识资源。 Method: 采用检索增强生成(RAG)框架,结合自建的文化感知知识库(CulKBs)——通过关键词从Wikipedia提取文化相关文本和国家特异性摘要;集成DuckDuckGo实时网络搜索;全部基于Ollama平台部署开源sLLMs;使用提示工程优化提示词并分析其学习曲线。 Result: 系统在SemEval-2025 Task-7的Track 1(短答案问答)和Track 2(多选问答)中完成英文、西班牙文、中文三语评测;所有代码、资源及提示词均开源共享。 Conclusion: 基于sLLMs与文化定制知识库的RAG方案是可行且高效的跨文化常识推理路径,兼顾性能、隐私与可持续性;开源实践增强了方法的可复现性与社区可扩展性。 Abstract: This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via https://github.com/aaronlifenghan/FLANS-2026[102] Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang,Yupeng Xie,Yingchaojie Feng,Tingfeng Lan,Wei Chen
Main category: cs.CL
TL;DR: ViviDoc 是一个结合人类与大语言模型代理的协作系统,用于从单一主题输入生成交互式教育文档,通过多代理流程和可读的文档规范(DocSpec)提升可控性、可验证性与教育适用性。
Details
Motivation: 交互式文章虽能增强读者对复杂概念的理解,但其制作成本高,需领域专家与前端开发技能;而现有LLM代理方法缺乏可控性与可验证性。 Method: 提出ViviDoc系统,包含Planner、Executor、Evaluator三代理协同流程,并设计DocSpec中间表示,将交互可视化分解为State、Render、Transition、Constraint四部分,支持教育者在代码生成前审查与调整方案。 Result: 专家评估与用户研究表明,ViviDoc显著优于朴素代理生成方法,且提供直观易用的编辑体验。 Conclusion: ViviDoc通过人机协同与结构化中间表示,有效弥合教育目标与技术实现之间的鸿沟,为交互式教育内容自动化生成提供了可行、可信、可编辑的新范式。 Abstract: Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at https://vividoc-homepage.vercel.app/.[103] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Shixiang Song,He Li,Zitong Wang,Boyi Zeng,Feichen Song,Yixuan Wang,Zhiqin John Xu,Ziwei He,Zhouhan Lin
Main category: cs.CL
TL;DR: AdaPonderLM 是一种自监督的循环语言模型,通过学习每个 token 的早退出策略实现推理时的自适应计算,无需人工设定剪枝比例,在保持性能的同时降低约10%推理计算量。
Details
Motivation: 现有预训练循环语言模型通常采用固定迭代次数,导致简单 token 浪费计算资源,缺乏 token 级别的自适应性。 Method: 提出 AdaPonderLM,引入迭代特定的 MLP 门控机制与单调停机掩码决定各 token 停止循环的时机,并设计 KV 复用机制以复用已停 token 的缓存状态,保障训推一致与实际加速。 Result: 在 Pythia 系列模型(70M–410M 预训练、最大 2.8B 续训)上,推理计算降低约10%,语言建模困惑度与下游任务准确率保持相当;分析表明模型自动将更多计算分配给高负对数似然(困难)token,且在相同 FLOPs 下优于固定剪枝策略。 Conclusion: AdaPonderLM 实现了完全自监督下的自适应计算时间,能动态、合理地将计算资源分配给真正需要的 token,兼顾效率与性能。 Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.[104] From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
Junbo Huang,Max Weinig,Ulrich Fritsche,Ricardo Usbeck
Main category: cs.CL
TL;DR: 本文提出了一种基于有向无环图(DAG)的新闻通胀叙事标注框架,融合定性内容分析原则以提升标注质量,并通过实验评估不同表示方式与距离度量对标注者间一致性(Krippendorff's α)的影响,发现宽松度量会高估可靠性,而局部约束表示可降低标注变异。
Details
Motivation: 新闻中关于经济事件(如通胀)的叙事对公众理解影响重大,但现有NLP方法在结构化标注和评估这类叙事方面仍面临挑战,尤其需应对人类标注变异(HLV)问题。 Method: 构建基于DAG的叙事图标注框架(节点为事件、边为因果关系),采用6×3因子实验设计,系统比较六种叙事表示形式与三种距离度量类型对Krippendorff's α的影响,并引入图扩展的Krippendorff's α度量以量化HLV。 Result: 实验表明:(1)基于重叠的宽松距离度量会高估标注可靠性;(2)局部受限的表示(如一跳邻域)显著降低标注变异;所提图版Krippendorff's α及标注框架已开源。 Conclusion: 该研究为存在人类标注变异的图结构叙事标注任务提供了可复现的评估范式与实践指导,推动NLP在复杂叙事建模中的严谨性与实用性。 Abstract: Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a $6\times3$ factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's $α$), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf's $α$ are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.[105] When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau,Francis Lareau,Nicolas Dugué,Jean-Charles Lamirel,Christophe Malaterre
Main category: cs.CL
TL;DR: 本文提出了一种新的人工评估任务Topic Word Mixing(TWM),用于衡量主题模型在专业领域中主题间的区分度,并通过哲学科学文献语料库上的约4000条人工标注,对比了六种主题模型的自动指标与人工评估结果,发现TWM能更好反映人类感知的主题区分性,且与多样性指标更一致。
Details
Motivation: 现有主题模型评估方法(如主题一致性、多样性等自动指标或词侵入等人工任务)在专业领域中存在局限性,难以准确反映人类对主题质量的判断,亟需更适配领域特性的评估框架。 Method: 提出Topic Word Mixing(TWM)人工评估任务,要求标注者区分来自单一主题或混合主题的词集;在哲学科学文献语料上收集近4000条人工标注,对比LDA、NMF、Top2Vec、BERTopic、CFMF及CFMF-emb六种模型的自动指标(如一致性、多样性)与人工评估(TWM、词侵入)结果。 Result: TWM能有效捕捉人类感知的主题区分性,与多样性指标呈现较好一致性;而词侵入与一致性指标在专业领域中常不一致,说明二者衡量的是不同维度的主题质量。 Conclusion: TWM作为词侵入的互补任务,为专业领域主题模型评估提供了更可靠的人类基准;研究强调需构建融合自动与人工评估的综合框架,尤其面向领域特定语料。 Abstract: Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.[106] AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang,Dongyu Ru,Lin Qiu,Yiyang Li,Xuezhi Cao,Yangqiu Song,Xunliang Cai
Main category: cs.CL
TL;DR: 本文提出AMemGym,一个用于评估和优化LLM助手记忆能力的交互式环境,通过结构化数据采样与LLM模拟用户实现高保真、可扩展的记忆驱动个性化评测。
Details
Motivation: 现有记忆基准依赖静态离线数据,难以可靠、可扩展地评估和训练长程交互中的记忆能力。 Method: 构建AMemGym交互环境,采用结构化数据采样预定义用户画像、状态相关问题与状态演化轨迹;利用LLM模拟用户进行角色扮演以暴露潜在状态,同时保持结构一致性;设计基于结构化数据的综合评估指标。 Result: 实验揭示了RAG、长上下文LLM和智能体记忆等主流记忆方法的性能差距及成因;AMemGym支持方法筛选并有望推动记忆策略的自进化。 Conclusion: AMemGym通过融合结构化状态演化与自由形式交互,为对话智能体的记忆能力发展提供了可扩展、诊断性强的新范式。 Abstract: Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.[107] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
Yixin Nie,Lin Guan,Zhongyao Ma,Anchit Gupta,Yipin Zhou,Xiao Li,Zhengping Zhou,Raymond Zeng,Gelin Zhou,Shigan Chu,Ajay Thampi,Wancen Mu,Nathan Shuster,Ketong Wang,Lin Chen,Jason Brewer,Derek Hao Hu,Alexander McCauley,Jason Weston,Sem Park,Na Zhang,Kevin Tang
Main category: cs.CL
TL;DR: CharacterFlywheel是一个迭代飞轮流程,用于在Instagram、WhatsApp和Messenger等社交应用中持续优化大语言模型(LLM),通过15代迭代改进,显著提升用户参与度与指令遵循能力。
Details
Motivation: 解决大型语言模型在真实社交聊天场景中部署时面临的持续优化、可衡量进展及生产环境稳定性挑战。 Method: 基于LLaMA 3.1,结合真实用户流量数据,构建包含数据筛选、奖励建模、监督微调(SFT)、强化学习(RL)以及离线/在线评估的闭环飞轮流程,并引入过拟合防控与生产动态适配机制。 Result: 在7个月A/B测试中,8次部署中7次实现正向参与度提升,最高达参与广度+8.8%、深度+19.4%;指令遵循率从59.2%升至84.8%,违规率从26.6%降至5.8%。 Conclusion: CharacterFlywheel验证了在大规模社交产品中以数据驱动、可评估、稳健迭代方式优化LLM的可行性,提升了LLM在真实场景中的科学性与工程实践水平。 Abstract: This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.[108] PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li,Feichen Song,Boyi Zeng,Shixiang Song,Zhiqin John Xu,Ziwei He,Zhouhan Lin
Main category: cs.CL
TL;DR: PonderLM-3 是一种预训练框架,支持词元级自适应计算分配,在保持训练-推理一致性的前提下,通过可微注意力掩码与硬剪枝规则实现更高效的推理计算利用。
Details
Motivation: 如何在推理阶段将额外计算资源有针对性地分配给最需要的词元,而非对所有词元统一增加开销。 Method: 在 PonderLM-2 基础上引入可微注意力掩码用于预训练,并配以对应的硬剪枝规则用于推理,实现词元级自适应计算分配。 Result: 相比现有递归或自适应基线,PonderLM-3 在相同推理 FLOPs 下获得更低的预训练困惑度;在下游任务中,达到与固定步数 PonderLM-2 相当的性能,但实际使用更少的推理 FLOPs。 Conclusion: PonderLM-3 提供了一种端到端可微、训练-推理一致的词元级自适应计算框架,使额外推理算力能被精准分配至最受益位置。 Abstract: Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.[109] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li,Shaoping Huang,Zhuoran Jin,Chenlong Zhang,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 本文提出了MMR-Life,一个用于评估多模态大语言模型(MLLMs)在真实生活场景中多图多模态推理能力的综合性基准,涵盖七种推理类型,包含2646道多选题和19108张真实图像,并对37个先进模型进行了评测,揭示了当前模型在多图推理上的显著局限性。
Details
Motivation: 现有MLLMs的多模态推理能力在真实场景中缺乏系统评估,且缺少标准化基准,尤其在跨多图像进行多样化推理方面研究不足。 Method: 构建了MMR-Life基准,包含2646道多选题、19108张真实世界图像,覆盖七类推理(溯因、类比、因果、演绎、归纳、空间、时间),不依赖领域专业知识,强调跨图像信息整合与多元推理;并对37个先进MLLMs进行评测与推理范式分析。 Result: 顶级模型如GPT-5在MMR-Life上仅达58%准确率,且各推理类型表现差异显著;分析表明思考长度、推理方法与推理类型均显著影响模型性能。 Conclusion: MMR-Life为评估、分析与提升下一代多模态推理系统提供了全面、现实、可扩展的基础基准。 Abstract: Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.[110] EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin,Taido Purason,Emil Kalbaliyev,Hele-Andra Kuulmets,Marii Ojastu,Mark Fišel,Tanel Alumäe,Eleri Aedmaa,Krister Kruusmaa,Kairit Sirts
Main category: cs.CL
TL;DR: 本文研究了通过持续预训练(CPT)提升多语言大语言模型(如Llama 3.1 8B)爱沙尼亚语能力的有效性,在保持英语及通用推理能力的同时,结合英语回放与代码、数学、指令类数据构建平衡训练分布,并辅以监督微调、偏好优化和聊天向量融合,显著提升了爱沙尼亚语各项能力且不损害英语性能。
Details
Motivation: 大型语言模型主要在以英语为中心的数据上训练,导致对小语种(如爱沙尼亚语)性能不均衡,亟需在不损害原有多语言能力前提下增强特定小语种能力。 Method: 以Llama 3.1 8B为基座模型,开展持续预训练(CPT),数据混合包含增强的爱沙尼亚语数据、英语回放数据以及代码、数学和类指令数据;随后进行监督微调、偏好优化和chat vector merging以增强指令遵循能力。 Result: 在爱沙尼亚语多项基准测试(语言能力、知识、推理、翻译质量、指令遵循)上均取得一致提升,同时在英语基准上保持竞争力。 Conclusion: 持续预训练配合合理配比的数据混合与后续对齐策略,可显著增强多语言大模型中单个小语种的能力,而无需牺牲其原有英语与通用能力。 Abstract: Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.[111] What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Herbert Zhou,William Dai,Maya Viswanathan,Simon Charlow,R. Thomas McCoy,Robert Frank
Main category: cs.CL
TL;DR: 本文提出了一种结合句法成分分析和依存分析的系统,用于自动识别儿童语料库(CHILDES)中三种核心填隙结构(主句wh-问句、从属wh-问句、关系从句)及其提取位置(主语/宾语/附加语),从而量化儿童接触的填隙依赖输入,支持关于其习得是否依赖先天语法知识的理论争论。
Details
Motivation: 解决儿童填隙依赖习得机制的理论争议(先天语法知识 vs. 分布式语言输入),但缺乏大规模、细粒度的输入量化工具。 Method: 构建一个融合成分句法分析和依存句法分析的计算系统,用于在口语英语语料库中自动识别三类填隙结构及其提取位置;在人工标注数据上验证,并应用于57个CHILDES英语语料库进行实证分析。 Result: 系统在多数类别上表现良好;对CHILDES语料的分析揭示了儿童填隙输入的构成特征、发展轨迹、构式特异性频率及提取位置不对称性;生成的细粒度标注已用于语言模型的过滤训练案例研究。 Conclusion: 该系统为填隙依赖习得研究提供了可扩展、细粒度的输入量化方法,支持基于经验主义的语言习得解释,并推动计算建模与实证研究的结合。 Abstract: Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.[112] Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
Thierry Geoffre,Trystan Geoffre
Main category: cs.CL
TL;DR: 本研究通过序列分析方法,利用法语形态句法一致性互动游戏中的细粒度操作序列,探究小学生的语法推理能力,发现学生倾向于先确定动词再调整前置成分,且序列分析能揭示隐性语言推理过程,为多语言课堂提供实时支持工具。
Details
Motivation: 传统评估仅依赖最终答案,无法捕捉学生在句子构建过程中的实时认知策略;本研究旨在通过细粒度动作序列分析,揭示小学生语法推理的隐藏维度。 Method: 采用基于序列的学习分析方法,将互动游戏中每个滑块移动视为假设检验行为;使用汉明距离量化动作序列与有效语法解的接近程度,并分析不同难度练习中的收敛模式;数据来自100名8-11岁学生的597次游戏会话(共9783个动作)。 Result: 确定词和动词是主要难点;动作序列偏离常规从左到右处理顺序,表现为先固定动词再调整前置成分;解空间较小的练习收敛更慢、更不稳定;最近有效解的变化反映动态假设修正过程。 Conclusion: 序列分析能有效揭示语言推理的认知机制,为开发面向教师的实时教学支持工具及适应语言多样性课堂的教学干预提供基础。 Abstract: This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.[113] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
Xiang Zheng,Han Li,Wenjie Luo,Weiqi Zhai,Yiyuan Li,Chuanmiao Yan,Tianyi Tang,Yubo Ma,Kexin Yang,Dayiheng Liu,Hu Wei,Bing Zhao
Main category: cs.CL
TL;DR: 本文提出ClinConsensus——一个由中国临床专家构建和验证的中文医疗基准,涵盖2500个开放性病例、36个专科及12类临床任务,强调纵向性、开放性和安全性;并设计CACS@k评分与双评审评估框架,揭示当前大模型在临床推理、循证应用和长期随访等方面的显著短板。
Details
Motivation: 现有医学评测基准静态、孤立,无法反映真实临床流程的开放性、纵向性和安全关键性复杂性。 Method: 构建由临床专家主导的ClinConsensus中文医疗基准(2500例、36专科、12任务类型),提出基于量规的评分协议与Clinically Applicable Consistency Score (CACS@k),并设计结合大模型裁判与轻量本地化裁判的双评审评估框架。 Result: 对多个主流LLM的系统评估显示:各模型整体得分相近,但在推理质量、证据使用、纵向随访能力上差异显著;临床可操作的治疗规划仍是核心瓶颈。 Conclusion: ClinConsensus为医疗大模型提供了更真实、可扩展、医生对齐的评测标准,推动其向稳健、临床可信、可部署方向发展;数据集已开源。 Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.[114] Recursive Think-Answer Process for LLMs and VLMs
Byung-Kwan Lee,Youngchae Chee,Yong Man Ro
Main category: cs.CL
TL;DR: 本文提出了一种递归式思考-回答过程(R-TAP),通过置信度生成器和两种互补奖励机制,使大语言模型和多模态模型在推理中实现迭代优化,提升答案准确性并减少自我反思表达,从而增强推理稳定性与效率。
Details
Motivation: 现有Think-Answer类模型虽具可解释性推理能力,但在单次前向推理中仍易出错,尤其依赖‘Oops!’等自我反思信号,说明其缺乏有效纠错机制。 Method: 提出递归式Think-Answer过程(R-TAP),引入置信度生成器评估响应确定性,并设计递归置信度提升奖励和最终答案置信度奖励进行强化学习优化。 Result: R-TAP在LLMs和VLMs上均显著超越单次推理基线;模型输出中‘Oops’类表达大幅减少,推理更稳定、更快。 Conclusion: R-TAP为构建高效、精细化的AI推理机制提供了新范式,有望推动未来AI推理过程的持续演进与优化。 Abstract: Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.[115] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
Veronika Solopova,Viktoria Skorik,Maksym Tereshchenko,Alina Haidun,Ostap Vykhopen
Main category: cs.CL
TL;DR: 本文评估了六种主流大语言模型在四个现实地缘政治危机模拟场景中的决策行为,发现其虽在初期接近人类决策模式,但随时间推移逐渐偏离,并普遍表现出规范性、合作性而非对抗性的解释框架。
Details
Motivation: 大型语言模型越来越多地被用作战略决策环境中的代理,但其在结构化地缘政治模拟中的行为仍缺乏深入研究。 Method: 在四个真实世界危机模拟场景中,对比六种前沿大语言模型与人类参与者在行动选择、风险校准和基于国际关系理论的论证框架三方面的表现。 Result: 模型在基础模拟轮次中近似人类决策模式,但随轮次推进逐渐偏离;所有模型的行动解释均强烈偏向规范性-合作性框架(强调稳定、协调与风险缓解),对抗性推理有限。 Conclusion: 当前LLM在地缘政治模拟中展现出可预测但非完全拟人的策略演化路径,其决策解释存在系统性理论偏向,需谨慎应用于高风险战略场景。 Abstract: Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.[116] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen,Michael Qizhe Shieh,Lidong Bing
Main category: cs.CL
TL;DR: 本文提出LongRLVR方法,通过引入密集且可验证的上下文奖励来增强传统RLVR在长上下文场景中的表现,解决了仅依赖最终答案奖励导致的梯度消失问题,显著提升了LLM在长文本推理任务上的性能。
Details
Motivation: 现有RLVR方法在长上下文任务中表现不佳,因其依赖模型内部参数知识,难以有效进行外部信息的上下文定位与推理;而仅基于最终答案的稀疏奖励导致上下文接地过程梯度消失,使学习不可行。 Method: 提出LongRLVR框架,在原有答案奖励基础上增加一个密集、可验证的上下文奖励信号,直接激励模型选择正确的支撑证据,从而提供稳定的学习梯度。 Result: 在RULER-QA和LongBench v2等长上下文基准上,LongRLVR显著优于标准RLVR:例如Qwen-14B在RULER-QA上从73.17提升至88.90,在LongBench v2上从39.8提升至46.5。 Conclusion: 显式奖励上下文接地过程是提升大语言模型在长上下文推理任务中性能的关键策略。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.[117] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
Miguel Lopez-Duran,Julian Fierrez,Aythami Morales,Daniel DeAlcala,Gonzalo Mancera,Javier Irigoyen,Ruben Tolosana,Oscar Delgado,Francisco Jurado,Alvaro Ortigosa
Main category: cs.CL
TL;DR: 本文提出了CrimeNER,一个面向犯罪相关文档的零样本和少样本命名实体识别(NER)案例研究,并构建了包含1500多份标注文档的CrimeNERdb数据集,涵盖恐怖袭击报告和美国司法部新闻稿,定义了5类粗粒度和22类细粒度犯罪相关实体类型,并在零样本与少样本设定下评估了多种先进NER模型及大语言模型的效果。
Details
Motivation: 现实世界中犯罪相关文档缺乏充分标注的数据,制约了命名实体识别在执法领域的应用。 Method: 构建CrimeNERdb数据集(含1.5k+标注文档),定义粗/细粒度犯罪实体类型,并在零样本和少样本设定下,使用前沿NER模型及通用大语言模型进行实验评估。 Result: 验证了CrimeNERdb数据集的质量与实用性,在零样本和少样本设定下,不同模型展现出一定识别能力,为犯罪领域NER提供了新基准和资源。 Conclusion: CrimeNER及CrimeNERdb填补了犯罪领域NER数据与方法的空白,支持零样本与少样本学习,具有实际执法应用潜力。 Abstract: The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.[118] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
Hao Li,Chunjiang Mu,Jianhao Chen,Siyue Ren,Zhiyao Cui,Yiqun Zhang,Lei Bai,Shuyue Hu
Main category: cs.CL
TL;DR: 本文提出了AgentSkillOS,一个用于技能选择、编排和生态系统级管理的框架,包含技能管理(通过能力树组织)和任务求解(通过DAG流水线调用多技能)两阶段,并在30个富工件任务上验证了其有效性。
Details
Motivation: 随着Claude代理技能的快速增长,如何有效利用、管理和扩展技能生态系统成为核心问题。 Method: 提出AgentSkillOS框架,分为两阶段:(i) Manage Skills:通过节点级递归分类构建能力树以实现高效技能发现;(ii) Solve Tasks:基于有向无环图(DAG)进行技能检索、编排与执行;并构建含30个任务的基准测试集,采用LLM成对评估结合Bradley-Terry模型生成统一质量评分。 Result: 实验表明,在200至20万规模的技能生态中,基于树的检索可有效逼近理想技能选择,DAG编排显著优于原生扁平调用;结构化组合是释放技能潜力的关键。 Conclusion: AgentSkillOS为大规模代理技能生态提供了可扩展、结构化的管理与执行范式,验证了分层组织与流程化编排对提升技能调用效能的决定性作用。 Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.[119] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
Valentin Lacombe,Valentin Quesnel,Damien Sileo
Main category: cs.CL
TL;DR: 本文提出Reasoning Core,一个可扩展的符号推理数据生成套件,覆盖多个形式化领域,并通过外部求解器进行严格验证,支持难度调节、推理轨迹生成及强化学习奖励函数设计;实验证明将其融入预训练能提升下游推理能力且不损害语言建模性能。
Details
Motivation: 现有程序化生成器依赖固定模板,难以提供大规模所需的数据分布广度,限制了语言模型在符号推理能力上的拓展。 Method: 构建Reasoning Core套件,涵盖PDDL规划、一阶逻辑(含等式)、上下文无关文法解析与生成、贝叶斯网络因果推理、方程组求解五大形式化领域;每个任务配外部求解器实现自动验证,支持连续难度调控与求解轨迹生成,并统一接口支持监督预训练与强化学习。 Result: 将Reasoning Core数据混入预训练显著提升下游推理任务表现,同时保持或略微提升语言建模质量;零样本评估显示当前前沿模型(如GPT-5)仍在此类任务上面临挑战。 Conclusion: Reasoning Core为语言模型提供了高质量、可验证、可扩展的符号推理训练数据,是推动模型超越传统语料推理能力边界的有效途径。 Abstract: Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.[120] MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Mohamed Bayan Kmainasi,Abul Hasnat,Md Arid Hasan,Ali Ezzat Shahroor,Firoj Alam
Main category: cs.CL
TL;DR: 本文提出MemeXplain——首个面向阿拉伯语宣传模因和英语仇恨模因的解释增强型多模态数据集,并设计多阶段优化方法训练视觉语言模型,在标签检测与解释生成任务上均超越现有最优性能。
Details
Motivation: 现有工作在联合建模标签检测与解释生成时性能下降,且缺乏针对阿拉伯语宣传模因和英语仇恨模因的大规模解释型多模态数据集。 Method: 构建MemeXplain多语言多任务数据集;提出多阶段优化策略训练视觉语言模型(VLM),协同优化分类与解释生成。 Result: 在ArMeme数据集上准确率提升约1.4%,在Hateful Memes数据集上提升约2.2%,同时提升解释生成质量;代码与数据集将开源。 Conclusion: 解释增强的数据集与分阶段训练策略可有效提升多模态模因理解中检测与可解释性的联合性能,为未来研究提供新基准与工具。 Abstract: The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).cs.CV [Back]
[121] Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection
Naimur Rahman
Main category: cs.CV
TL;DR: 本文研究在极度数据稀缺情况下(如前驱期帕金森病的静息态fMRI数据,仅40例受试者)如何合理评估深度学习模型,指出图像级划分会导致严重信息泄露,而严格的受试者级划分使准确率大幅下降;轻量级MobileNet在主体级评估下泛化最稳定,表明此时评估策略与模型容量比网络深度更重要。
Details
Motivation: 深度学习常应用于数据稀少、高度相关且难以获取的场景(如前驱期帕金森病神经影像),但现有评估方法未充分考虑这些约束,易导致过乐观性能估计。 Method: 使用40名受试者(20例患者+20名健康对照)的静息态fMRI数据,对ImageNet预训练CNN(VGG19、Inception V3、Inception ResNet V2、MobileNet V1)进行微调;对比图像级划分与严格受试者级划分两种评估策略,并分析模型容量影响。 Result: 图像级划分导致信息泄露、准确率近100%;受试者级划分下测试准确率降至60–81%;MobileNet V1在参数更少情况下展现出最优且最稳定的泛化性能。 Conclusion: 在极端数据稀缺场景下,评估策略(尤其是避免跨受试者信息泄露)和模型容量选择比网络深度更关键;应优先采用受试者级划分与轻量模型以保障结果可靠性。 Abstract: Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity.[122] Automated Quality Check of Sensor Data Annotations
Niklas Freund,Zekiye Ilknur-Öz,Tobias Klockau,Patrick Naumann,Philipp Neumaier,Martin Köppel
Main category: cs.CV
TL;DR: 本文提出了一种开源工具,用于自动检测铁路多传感器训练数据中的九类常见错误,显著减少人工工作量并加速AI系统开发;经人工验证,六种检测方法精度达100%,另三种达96%和97%。
Details
Motivation: 自动化驾驶中路线与环境监控至关重要,尤其在GoA2和GoA4级别;高质量训练数据对安全关键AI系统不可或缺,但人工数据质检耗时耗力。 Method: 设计并实现一个开源框架,自动检测多传感器铁路数据集中的九类典型错误;所有检测结果均经人工验证以评估性能。 Result: 六种错误检测方法达到100%精度,另外三种分别达到96%和97%精度。 Conclusion: 所提自动数据质量保障方法可大幅降低人工质检负担、加快AI系统开发进程,适用于安全关键的铁路自动化场景。 Abstract: The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.[123] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation
Takumi Hachimine,Yuhwan Kwon,Cheng-Yu Kuo,Tomoya Yamanokuchi,Takamitsu Matsubara
Main category: cs.CV
TL;DR: 本文提出VoxelDiffusionCut方法,利用扩散模型从部分切割表面观测中迭代估计产品内部体素结构,并基于估计的不确定性规划切割路径,实现目标部件(如电池、电机)的无损提取。
Details
Motivation: 回收和拆解现场需非破坏性提取内部目标部件(如电池、电机),但因产品种类繁多且缺乏拆解信息,难以确定安全切割位置。 Method: 提出VoxelDiffusionCut:采用体素表示内部结构,结合条件扩散模型从已观测切割面推断未观测区域的部件类型及不确定性;据此迭代生成切割计划,避免误切。 Result: 仿真实验表明该方法能从部分切割表面准确估计内部结构,并利用预测不确定性成功实现目标部件的非破坏性提取。 Conclusion: VoxelDiffusionCut通过体素化与扩散建模有效缓解高维3D建模与多模态不确定性建模难题,为自动化智能拆解提供了新思路。 Abstract: Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.[124] Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks
Sushi Rao,Jingwei Li
Main category: cs.CV
TL;DR: 本文提出了一种轻量级图像超分辨率网络MSAAN,通过多尺度空间自适应注意力模块(MSAA)联合建模局部细节与长程上下文依赖,在保持低模型复杂度的同时实现高重建保真度。
Details
Motivation: 解决现有超分辨率方法中高重建保真度与低模型复杂度之间的常见矛盾。 Method: 提出多尺度空间自适应注意力模块(MSAA),包含全局特征调制模块(GFM)和多尺度特征聚合模块(MFA);并引入局部增强块(LEB)和特征交互门控前馈模块(FIGFF)以提升几何感知与非线性表征能力。 Result: 在多个标准数据集(Set5、Set14、B100、Urban100、Manga109)上,MSAAN及其轻量版在×2/×3/×4缩放因子下PSNR和SSIM指标优于或媲美SOTA方法,同时参数量和计算成本显著更低;消融实验验证各组件有效性,视觉结果表明边缘更锐利、纹理更真实。 Conclusion: MSAAN在精度与效率之间取得了良好平衡,是一种高效且高性能的轻量级图像超分辨率方案。 Abstract: This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network's capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.[125] BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation
M Iffat Hossain,Laura Brattain
Main category: cs.CV
TL;DR: 本文提出了一种轻量级双路径U-Net架构BiSe-UNet,专为资源受限设备(如树莓派5)上的实时内镜图像分割(如结肠息肉检测)设计,在保持高精度的同时实现>30 FPS的实时推理速度。
Details
Motivation: 现有模型在嵌入式设备上难以兼顾实时性与分割精度,尤其在边界质量和上下文理解方面存在不足,而临床应用亟需轻量、准确且可部署的方案。 Method: 提出BiSe-UNet:融合注意力增强的上下文路径与浅层空间路径,并采用深度可分离解码器;在Kvasir-Seg数据集上进行训练与评估。 Result: 在Kvasir-Seg上取得有竞争力的Dice和IoU指标,并在树莓派5上实现超过30 FPS的实时推理速度。 Conclusion: BiSe-UNet有效平衡了精度、速度与模型轻量化,适用于边缘医疗设备上的实时医学图像分割任务。 Abstract: During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.[126] NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
Aman Ulla
Main category: cs.CV
TL;DR: NovaLAD是一个高效、轻量级的文档解析系统,融合双YOLO模型(元素与布局检测)、规则分组及可选视觉语言增强,在CPU上实现高性能解析,支持多种输出格式,并在DP-Bench基准上超越现有开源与商业解析器。
Details
Motivation: 现有文档解析方法在精度、速度、GPU依赖和多格式支持方面存在不足,难以满足RAG、知识库等下游任务对高质量结构化文档的需求。 Method: 提出NovaLAD系统:并行运行两个YOLO模型分别检测语义元素(如标题、表格)和布局区域(如列组、行组);引入ViT图像相关性分类器预筛图像,仅对相关图像调用Vision LLM提取结构化信息;采用规则驱动分组与并行化处理(OCR、检测、转换等);全流程适配CPU部署。 Result: 在DP-Bench基准上达到96.49% TEDS和98.51% NID,优于主流商业与开源解析器;支持生成结构化JSON、Markdown、RAG就绪文本和知识图谱;全程可在CPU运行,无需GPU。 Conclusion: NovaLAD通过协同检测、智能过滤与高效并行设计,在不依赖GPU的前提下实现了高精度、高吞吐、多格式的文档解析,为轻量化、低成本RAG与知识工程提供了实用新范式。 Abstract: Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.[127] CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers
Yannian Gu,Xizhuo Zhang,Linjie Mu,Yongrui Yu,Zhongzhen Huang,Shaoting Zhang,Xiaofan Zhang
Main category: cs.CV
TL;DR: 本文提出CT-Flow,一种面向3D CT影像的工具感知型智能体框架,通过Model Context Protocol(MCP)支持多步、工具调用的动态推理,在诊断准确率和自主工具调用成功率上显著超越现有方法。
Details
Motivation: 现有LVLMs在3D CT分析中多采用静态单次推理,而临床放射诊断实为依赖测量、分割、放射组学等工具的动态、迭代过程,二者存在范式鸿沟。 Method: 提出CT-Flow框架,基于Model Context Protocol(MCP)实现开放、工具感知的推理;构建首个面向3D CT工具使用的指令微调基准CT-FlowBench;使模型能将自然语言查询自动分解为多步工具调用序列。 Result: 在CT-FlowBench及标准3D VQA数据集上达到SOTA:诊断准确率提升41%,自主工具调用成功率达95%。 Conclusion: CT-Flow为将自主智能体技术落地真实临床放射工作流提供了可扩展的基础架构。 Abstract: Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.[128] OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics -- A Methodological Proof-of-Concept
Edouard Lansiaux,Margaux Leman,Mehdi Ammi
Main category: cs.CV
TL;DR: OrthoAI是一个开源的决策支持系统,结合轻量级3D牙齿分割与自动化生物力学分析,辅助正畸医生评估隐形矫治方案,目前仅在基于标志点重建的点云上验证有效,尚未在真实口内扫描数据上验证。
Details
Motivation: 当前隐形矫治方案(如ClinCheck)依赖人工审核,效率低且易出错,亟需自动化辅助工具提升评估准确性与效率。 Method: 提出OrthoAI框架:采用动态图卷积网络(DGCNN)在3DTeethLand数据集上训练3D牙齿分割;结合基于循证规则的生物力学引擎(依据Kravitz等2009、Simon等2014),对每颗牙六自由度运动进行分解、预测性评估与超限预警,并生成复合指标。 Result: 模型仅含60,705参数,在代理点云上实现81.4%牙齿识别率和8.25% mIoU;端到端推理耗时<4秒;代码、权重及分析工具已开源。 Conclusion: OrthoAI为几何深度学习与数字正畸交叉研究提供了可复现基线,揭示了当前稀疏标志点监督下的感知局限,强调未来需转向全网格训练以提升临床适用性;当前版本不适用于真实口内扫描数据。 Abstract: Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of $81.4\%$ and mIoU of $8.25\%$ on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in $<4s$ on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.[129] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
Miao Zhang,Ruixiao Zhang,Jianxin Shi,Hengzhi Wang,Hao Fang,Jiangchuan Liu
Main category: cs.CV
TL;DR: QuickGrasp是一种本地优先、服务质量感知的视频语言模型(VLM)部署系统,通过共享视觉表征、加速视频标记化、查询自适应边缘增强和延迟感知的视觉标记密度配置,在保持大模型精度的同时显著降低响应延迟。
Details
Motivation: 大型视频语言模型(VLMs)因资源消耗高、远程部署延迟大而难以落地;小型本地VLM虽快但精度不足,需在响应速度与准确性之间取得平衡。 Method: 提出QuickGrasp系统:采用本地优先架构+按需边缘增强;利用VLM模块化特性共享视觉表征;引入加速视频token化、查询自适应边缘增强、延迟感知且精度保持的视觉token密度配置三项关键技术。 Result: 实验表明,QuickGrasp在多个视频理解基准上达到与大型VLM相当的精度,同时响应延迟最高降低12.8倍。 Conclusion: QuickGrasp有效弥合了本地小模型与云端大模型之间的性能-延迟鸿沟,推动了面向开放世界理解的实时视频查询服务发展。 Abstract: Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.[130] Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach
Kaustav Das,Gaston Rauchs,Jan Sykora,Anna Kucerova
Main category: cs.CV
TL;DR: 本文提出了一种基于自标注的无监督方法,利用超像素算法和CNN感受野来训练语义分割模型,用于低对比度混凝土XCT图像分割,无需人工标注数据。
Details
Motivation: 混凝土XCT图像中骨料与砂浆X射线衰减系数相近,导致图像对比度低,且获取标注数据成本高或不可行,亟需无监督语义分割方法。 Method: 采用基于超像素的自标注技术,结合CNN模型的感受野,建立图像局部区域与全局上下文之间的关系,实现无监督训练。 Result: 该方法在自建XCT数据集上验证了有效性,能识别语义相似结构,展现出良好的无监督语义分割性能。 Conclusion: 所提自标注无监督训练方法可有效应对低对比度XCT图像分割问题,为缺乏标注数据的工业成像任务提供了可行方案,并指出了后续优化方向。 Abstract: This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.[131] Predicting Local Climate Zones using Urban Morphometrics and Satellite Imagery
Hugo Majer,Martin Fleischmann
Main category: cs.CV
TL;DR: 本研究评估了基于城市形态计量学(morphometrics)预测局部气候区(LCZ)的能力,发现仅用2D形态指标预测LCZ效果有限且具有场地依赖性;而将形态计量与卫星影像融合的方法仅在部分地点带来轻微精度提升,整体增益不显著,表明LCZ与可观测城市形态之间的关联较弱,需谨慎用于形态学研究。
Details
Motivation: LCZ框架虽常用于城市形态分析,但其制图主要依赖卫星影像;而城市形态计量学提供了一种基于数值化物理特征描述城市形态的新途径,本文旨在评估其预测LCZ的潜力。 Method: 计算建筑轮廓和街道网络的321个2D形态计量属性(多尺度、多维度),构建四种分类方案:纯形态计量预测、纯影像预测(基线)、以及两种形态计量与影像融合预测方法,并在五个站点进行评估。 Result: 纯形态计量预测结果呈场地依赖性,对应关系选择性且不一致;融合方法仅在两个站点显著提升精度,其余站点增益微弱甚至略降;总体表明LCZ与可观测城市形态之间关联薄弱。 Conclusion: LCZ框架与可测量、可见的城市形态特征之间关系微弱,因此在形态学研究中应谨慎使用该框架。 Abstract: The Local Climate Zone (LCZ) framework is commonly employed to represent urban form in morphological analyses despite its mapping predominantly relies on satellite imagery. Urban morphometrics, describing urban form via numerical measures of physical aspects and spatial relationships of its elements, offers another avenue. This study evaluates the ability of morphometric assessment to predict LCZs using a) a morphometric-based LCZ prediction, and b) a fusion-based LCZ prediction combining morphometrics with satellite imagery. We calculate 321 2D morphometric attributes from building footprints and street networks, covering their various properties at multiple spatial scales. Subsequently, we develop four classification schemes: morphometric-based prediction, baseline image-based prediction, and two techniques fusing morphometrics with imagery. We evaluate them across five sites. Results from the morphometric-based prediction indicate that the correspondence between 2D urban morphometrics and urban LCZ types is selective and inconsistent, rendering the efficacy of this method site-dependent. Nevertheless, it demonstrated that a much broader range of urban form properties is relevant for distinguishing LCZ types compared to standard parameters. Relative to the image-based baseline, the fusion yielded relatively distinct accuracy improvements for urban LCZ types at two sites; however, gains at the remaining sites were negligible or even slightly negative, suggesting that the benefits of fusion are modest and inconsistent. Collectively, these results indicate that the relationship between the LCZs and the measurable, visible aspects of urban form is tenuous, thus the LCZ framework should be used with caution in morphological studies.[132] You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models
Kairan Zhao,Eleni Triantafillou,Peter Triantafillou
Main category: cs.CV
TL;DR: 本文提出GUARD框架,通过吸引-排斥动力学引导文本到图像扩散模型在去噪过程中避开训练数据,从而缓解模型记忆问题,同时保持生成图像质量。
Details
Motivation: 生成模型可能记忆训练数据,导致生成与训练图像高度相似的内容,引发隐私和版权问题。 Method: 提出GUARD框架,结合注意力衰减机制(基于统计方法自动识别需衰减的提示位置),在推理阶段动态、精准地调整跨注意力,使生成图像既符合提示又远离训练数据。 Result: GUARD在两种架构上对逐字和模板式记忆均实现最稳健的缓解效果,并在图像质量上优于或至少不逊于现有方法。 Conclusion: GUARD是一种高效、通用且高质量的记忆缓解方法,为文本到图像生成中的隐私与版权保护提供了新思路。 Abstract: Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.[133] TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
Bibin Wilson
Main category: cs.CV
TL;DR: TinyVLM 是首个支持在内存小于1MB的微控制器(MCU)上运行零样本目标检测的轻量级框架,通过解耦架构、Matryoshka蒸馏和量化嵌入存储,在保持竞争力精度的同时大幅降低内存占用。
Details
Motivation: 现有零样本目标检测方法依赖CLIP等大型视觉语言模型,内存需求远超MCU资源限制,亟需轻量化方案以实现在边缘设备上的部署。 Method: 提出TinyVLM框架:(1)解耦视觉推理与文本编码,将预计算类别嵌入存于Flash;(2)Matryoshka蒸馏训练多维度(16–256)嵌套嵌入;(3)量化嵌入存储,减少4倍原型内存开销。 Result: 在CC3M上训练,在COCO、Flowers102、Food101上实现具竞争力的零样本检测精度;部署仅需285KB RAM和892KB Flash;在STM32H7达26 FPS,在MAX78000达>1000 FPS。 Conclusion: TinyVLM首次实现了在资源受限MCU上的高效零样本目标检测,为边缘智能开辟了新路径。 Abstract: Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.[134] Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression
Bibin Wilson
Main category: cs.CV
TL;DR: 本文提出Latent Replay Detection (LRD)框架,首次实现微控制器(MCU)上内存受限的持续目标检测,通过任务自适应FiLM压缩与空间多样性样本选择,在极小内存(64KB)下支持多任务学习并保持检测性能。
Details
Motivation: 现有目标检测模型在微控制器上部署后无法学习新类别,而传统持续学习方法需存储原始图像,远超MCU仅数十KB的内存限制。 Method: 提出LRD框架,包含三项关键技术:1)基于FiLM条件调制的任务自适应可学习压缩;2)在IoU空间中使用最远点采样进行空间多样性样本选择;3)将每样本压缩至150字节以适配MCU部署。 Result: 在CORe50数据集(50类、5任务)上,LRD在首任务达到良好mAP@50,并在后续任务中显著优于朴素微调;在STM32H753ZI、ESP32-S3和MAX78000等MCU上实测推理延迟为4.9–97.5ms,内存占用严格控制在64KB内。 Conclusion: LRD是首个支持MCU级持续目标检测的实用框架,其任务自适应压缩与空间感知样本选择协同保障了检测精度与定位能力,首次实现了边缘设备上的持续学习落地。 Abstract: Deploying object detection on microcontrollers (MCUs) enables intelligent edge devices but current models cannot learn new object categories after deployment. Existing continual learning methods require storing raw images far exceeding MCU memory budgets of tens of kilobytes. We present Latent Replay Detection (LRD), the first framework for continual object detection under MCU memory constraints. Our key contributions are: 1. Task-Adaptive Compression: Unlike fixed PCA, we propose learnable compression with FiLM (Feature-wise Linear Modulation) conditioning, where task specific embeddings modulate the compression to preserve discriminative features for each task's distribution; 2. Spatial-Diverse Exemplar Selection: Traditional sampling ignores spatial information critical for detection - we select exemplars maximizing bounding box diversity via farthest-point sampling in IoU space, preventing localization bias in replay; 3. MCU-Deployable System: Our latent replay stores 150 bytes per sample versus >10KB for images, enabling a 64KB buffer to hold 400+ exemplars. Experiments on CORe50 (50 classes, 5 tasks) demonstrate that LRD achieves mAP@50 on the initial task and maintains strong performance across subsequent tasks - a significant improvement over naive fine-tuning while operating within strict MCU constraints. Our task-adaptive FiLM compression and spatial diverse exemplar selection work synergistically to preserve detection capabilities. Deployed on STM32H753ZI, ESP32-S3, and MAX78000 MCUs, LRD achieves 4.9-97.5ms latency per inference within a 64KB memory budget-enabling practical continual detection on edge devices for the first time.[135] Towards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images
Andreas Tritsarolis,Tomaž Bokan,Matej Brumen,Domen Mongus,Yannis Theodoridis
Main category: cs.CV
TL;DR: 本文提出TerrAI,一种基于神经网络的靶向喷洒与施肥(TSF)解决方案,旨在提升农业资源利用效率并减少环境影响。
Details
Motivation: 现代农业化推动了先进分析和决策支持系统的发展,以提高资源利用率并减少环境影响;精准的靶向喷洒与施肥(TSF)因受作物类型、施肥阶段、土壤条件和天气动态等外部因素影响而难以实现。 Method: 提出TerrAI,一种考虑不同地块时空变异性的神经网络方法,用于靶向喷洒与施肥(TSF)。 Result: 在真实世界遥感数据集上的实验验证了TerrAI在数据驱动农业实践中的有效性。 Conclusion: TerrAI为解决TSF难题提供了有效方案,有助于优化农业资源使用并促进环境可持续性。 Abstract: The modernization of agriculture has motivated the development of advanced analytics and decision-support systems to improve resource utilization and reduce environmental impacts. Targeted Spraying and Fertilization (TSF) is a critical operation that enables farmers to apply inputs more precisely, optimizing resource use and promoting environmental sustainability. However, accurate TSF is a challenging problem, due to external factors such as crop type, fertilization phase, soil conditions, and weather dynamics. In this paper, we present TerrAI, a Neural Network-based solution for TSF, which considers the spatio-temporal variability across different parcels. Our experimental study over a real-world remote sensing dataset validates the soundness of TerrAI on data-driven agricultural practices.[136] Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion
Sathwik Karnik,Juyeop Kim,Sanmi Koyejo,Jong-Seok Lee,Somil Bansal
Main category: cs.CV
TL;DR: 本文提出Reachability-Aware Diffusion Steering (RADS),一种推理时框架,通过建模扩散去噪过程为动力系统并结合可达性分析与约束强化学习,在不修改模型结构的前提下有效抑制文本到图像扩散模型的训练数据记忆现象,同时保持生成质量、多样性与提示对齐。
Details
Motivation: 文本到图像扩散模型存在训练数据记忆问题,现有缓解方法往往以牺牲图像质量或提示对齐为代价,亟需兼顾安全性与保真度的新方案。 Method: 将扩散过程建模为动力系统,利用可达性分析近似导致记忆样本的‘反向可达管’,并将其建模为在caption embedding空间施加最小扰动的约束强化学习问题,实现推理时轨迹引导。 Result: RADS在生成多样性(SSCD)、质量(FID)和对齐度(CLIP)之间实现了优于现有方法的Pareto前沿,并具备即插即用、无需修改扩散主干网络的优势。 Conclusion: RADS是一种高效、通用且无需重训练的安全生成框架,为扩散模型的记忆风险提供了实用化解决方案。 Abstract: Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.[137] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
Xiangyan Qu,Zhenlong Yuan,Jing Tang,Rui Chen,Datao Tang,Meng Yu,Lei Sun,Yancheng Bai,Xiangxiang Chu,Gaopeng Gou,Gang Xiong,Yujun Cai
Main category: cs.CV
TL;DR: 本文提出ADE-CoT框架,通过难度感知资源分配、编辑特异性早期验证和深度优先机会终止策略,提升图像编辑中Chain-of-Thought推理的效率与性能。
Details
Motivation: 现有Image-CoT方法主要面向文本到图像生成,难以适配图像编辑任务的目标导向性与源图像约束,导致资源分配低效、早期验证不可靠、结果冗余三大挑战。 Method: 提出ADE-CoT:(1)基于估计编辑难度动态分配采样预算;(2)利用区域定位与字幕一致性进行编辑特异性早期剪枝;(3)由实例特定验证器引导的深度优先机会终止机制。 Result: 在Step1X-Edit、BAGEL、FLUX.1 Kontext三个SOTA编辑模型及三个基准上验证,ADE-CoT在同等采样预算下性能优于Best-of-N,且提速超2倍。 Conclusion: ADE-CoT是一种高效的图像编辑测试时扩展框架,显著改善性能-效率权衡,为编辑场景下的CoT范式提供了新思路。 Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.[138] GrapHist: Graph Self-Supervised Learning for Histopathology
Sevda Öğüt,Cédric Vincent-Cuaz,Natalia Dubljevic,Carlos Hurtado,Vaishnavi Subramanian,Pascal Frossard,Dorina Thanou
Main category: cs.CV
TL;DR: 本文提出GrapHist,一种基于细胞图的自监督学习框架,用于数字病理学,通过结合掩码自编码器和异质图神经网络,在1100万个乳腺组织细胞图上预训练,显著提升了下游任务性能并减少了参数量。
Details
Motivation: 现有自监督视觉模型虽在数字病理学中成功,但其通用Transformer架构未考虑组织图像中的细胞及其复杂相互作用这一生物学本质。 Method: 提出GrapHist框架,将组织建模为细胞图,整合掩码自编码器与专为捕获肿瘤微环境异质性设计的异质图神经网络,并在大规模细胞图数据集上进行自监督预训练。 Result: GrapHist在切片、区域和细胞级任务中性能媲美视觉模型,参数减少四倍;在癌症亚型分类任务上远超全监督图模型;并发布首个大规模图基数字病理学基准数据集。 Conclusion: 以生物学为启发的细胞图建模能更高效地进行表征学习,GrapHist为数字病理学提供了轻量、通用且结构感知的新范式。 Abstract: Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at https://huggingface.co/ogutsevda/datasets , establishing the first large-scale graph benchmark in this field. Our code is available at https://github.com/ogutsevda/graphist .[139] Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
Zichen Geng,Zeeshan Hayder,Bo Miao,Jian Liu,Wei Liu,Ajmal Mian
Main category: cs.CV
TL;DR: 本文提出了一种基于解耦分层变分自编码器(DHVAE)的潜在扩散模型,用于生成结构化、可控且物理合理的3D人与人交互(HHI)运动。通过CoTransformer实现全局交互上下文与个体运动的解耦建模,并引入对比学习提升接触物理合理性,结合DDIM与AdaLN-Transformer进行高质量潜在空间去噪。
Details
Motivation: 现有HHI生成方法将全部运动信息压缩至单一潜在表示,难以建模细粒度动作和人-人交互语义,导致语义错位与物理不合理(如穿透、接触缺失)。 Method: 提出DHVAE框架:1)使用CoTransformer模块解耦建模全局交互上下文与个体运动;2)引入对比学习约束以增强潜在交互空间的判别性与物理合理性;3)在分层潜在空间中采用DDIM扩散+跳连AdaLN-Transformer进行去噪。 Result: DHVAE在运动保真度、文本对齐性、物理合理性方面显著优于现有方法,同时具备更高计算效率。 Conclusion: 解耦建模与对比学习引导的潜在扩散是提升3D HHI生成质量与物理可信度的有效范式。 Abstract: Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.[140] M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction
Kangyuan Zheng,Xuan Cai,Jiangqi Wang,Guixing Fu,Zhuoshuo Li,Yazhou Chen,Xinting Ge,Liangqiong Qu,Mengting Liu
Main category: cs.CV
TL;DR: 本文提出M-Gaussian方法,将3D高斯点绘技术适配于多层厚片MRI重建,通过磁性高斯基元、神经残差场和多分辨率渐进训练,在保证高质量重建的同时显著提升速度。
Details
Motivation: 多层厚片MRI采集虽可缩短扫描时间并降低运动敏感性,但导致严重的层间各向异性,影响体积分析与定量评估,亟需高效鲁棒的各向同性高分辨率重建方法。 Method: 提出M-Gaussian:(1) 基于物理一致的体渲染设计磁性高斯基元;(2) 引入神经残差场细化高频细节;(3) 采用多分辨率渐进训练策略。 Result: 在FeTA数据集上达到40.31 dB PSNR,速度比现有隐式神经方法快14倍,是首个成功将3D高斯点绘应用于多层MRI重建的方法。 Conclusion: M-Gaussian在重建质量与计算效率之间实现了最优平衡,为临床多层MRI提供了高效可行的各向同性重建新范式。 Abstract: Magnetic Resonance Imaging (MRI) is a crucial non-invasive imaging modality. In routine clinical practice, multi-stack thick-slice acquisitions are widely used to reduce scan time and motion sensitivity, particularly in challenging scenarios such as fetal brain imaging. However, the resulting severe through-plane anisotropy compromises volumetric analysis and downstream quantitative assessment, necessitating robust reconstruction of isotropic high-resolution volumes. Implicit neural representation methods, while achieving high quality, suffer from computational inefficiency due to complex network structures. We present M-Gaussian, adapting 3D Gaussian Splatting to MRI reconstruction. Our contributions include: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training. Our method achieves an optimal balance between quality and speed. On the FeTA dataset, M-Gaussian achieves 40.31 dB PSNR while being 14 times faster, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.[141] Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents
Carlos Monroy,Benjamin Navarro
Main category: cs.CV
TL;DR: 本文探讨了利用现代AI技术(如SAM2、Florence2和ChatGPT)结合专业本体(ontoShip)与术语表(glosShip),对16–17世纪造船文献图像进行分割与标注的可行性,初步验证了其在古籍数字化整理中的潜力。
Details
Motivation: 提升百年以上古籍图像(尤其是船舰类文献)的自动分割与识别能力,以支持知识自动化编目与公众可及性,但面临训练数据稀缺与领域高度专业化两大挑战。 Method: 采用SAM2进行图像分割;Florence2与ChatGPT协同完成图像内容标注;并融合专用于船舶建筑的本体ontoShip与术语表glosShip以增强语义准确性。 Result: 初步实验表明该多技术融合方案在古籍图像分割与标注任务中具备可行性与应用潜力,提升了历史文献的结构化处理与检索能力。 Conclusion: 将通用视觉大模型与领域知识深度结合,是解决古籍图像理解难题的有效路径;后续需应对数据稀缺、模型泛化性弱及本体覆盖不足等挑战。 Abstract: Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.[142] Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models
Binesh Sadanandan,Vahid Behzadan
Main category: cs.CV
TL;DR: 本文研究了医学视觉语言模型(MedGemma-4B)在临床问题重述下的答案不一致性问题,提出利用Gemma Scope 2稀疏自编码器(SAEs)建模神经激活,并结合LoRA微调与一致性-准确性联合损失,显著降低答案翻转率和置信度差异,同时保持或提升准确率。
Details
Motivation: 医学视觉语言模型对同一临床问题的不同表述可能给出不同的是/否答案,影响临床可信度;亟需提升其语义一致性。 Method: 基于PSF-Med数据集评估MedGemma-4B在MIMIC-CXR和PadChest上的 paraphrase flip rate 和 margin difference;验证Gemma Scope 2 SAEs在医学激活上的可迁移性(R²≈0.997);设计LoRA联合损失(一致性+准确性)进行微调,并开展层范围消融实验。 Result: 在MIMIC-CXR上,flip rate从14.6%降至4.4%(p=0.002),margin difference下降79.5%(1.63→0.33),准确率稳定(84.2%→82.3%,ns);在PadChest上,flip rate从13.6%→7.8%,margin difference下降67.9%(1.08→0.35),准确率提升3.0pp(66.4%→69.4%);早期层对margin reduction贡献更大。 Conclusion: 联合一致性与准确性目标的LoRA微调可有效缓解医学VLM的答案不稳定性,且SAEs具备跨领域(通用/医学)激活建模能力,为可解释性与鲁棒性增强提供新路径。 Abstract: Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p < 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.[143] Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Zhihao Li,Shengwei Dong,Chuang Yi,Junxuan Gao,Zhilu Lai,Zhiqiang Liu,Wei Wang,Guangtao Zhang
Main category: cs.CV
TL;DR: 本文提出ReMD(残差-多重网格扩散)框架,通过在扩散过程中引入物理一致性约束(多网格残差校正与多小波多尺度建模),显著提升流体超分辨率的精度、谱保真度与收敛效率,同时减少采样步数。
Details
Motivation: 现有图像超分辨率和通用扩散模型在流体超分辨率任务中迁移效果差:采样成本高、忽略物理约束、易产生频谱失配和虚假散度。 Method: 提出ReMD框架,在每一步反向扩散中执行多网格残差校正:结合数据一致性与轻量物理线索确定更新方向,并在多尺度(基于多小波基构建)上校正残差;整个过程无需求解控制方程。 Result: 在大气与海洋基准数据集上,ReMD提升了精度与谱保真度、降低了散度,并以显著更少的采样步数达到与基线扩散模型相当的质量。 Conclusion: 在扩散过程中内部嵌入物理一致性(通过多网格残差校正与多小波多尺度建模)是实现高效流体超分辨率的有效途径。 Abstract: Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on https://github.com/lizhihao2022/ReMD.[144] Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
Zihang Zou,Boqing Gong,Liqiang Wang
Main category: cs.CV
TL;DR: 本文揭示了新兴神经模型(如扩散模型)可能被用于数据剽窃,提出了一种基于'锚点与垫片'的通用神经剽窃方法,通过梯度搜索扰动跨注意力机制,绕过可见/不可见版权保护,无需额外训练。
Details
Motivation: 暴露当前版权保护(尤其是水印技术)在面对现代神经模型时的安全漏洞,并推动针对神经数据剽窃的防御研究。 Method: 提出基于'锚点(逆潜在表示)和垫片(渐进式扰动)'的纯梯度搜索方法,在扩散模型不同时间步对跨注意力机制施加扰动,实现对受版权保护图像的语义级复现或版权模糊化。 Result: 在MS-COCO数据集和真实版权图像上验证了扩散模型可成功复刻受保护图像,且能有效规避可见商标、签名及不可见水印等各类版权保护机制。 Conclusion: 神经模型存在严重数据剽窃风险,亟需开发新型鲁棒版权保护与检测机制以应对这一威胁。 Abstract: In this paper, we highlight a critical threat posed by emerging neural models: data plagiarism. We demonstrate how modern neural models (e.g., diffusion models) can replicate copyrighted images, even when protected by advanced watermarking techniques. To expose vulnerabilities in copyright protection and facilitate future research, we propose a general approach to neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on "anchors and shims", employs inverse latents as anchors and finds shim perturbations that gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbations to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modification in copyrighted images, enabling it to bypass protections ranging from visible trademarks and signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.[145] Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Haoxiang Sun,Tao Wang,Chenwei Tang,Li Yuan,Jiancheng Lv
Main category: cs.CV
TL;DR: 本文挑战了将语言推理训练范式直接迁移到视觉感知任务的假设,提出Dr. Seg框架以解决视觉分割中输出空间广度和细粒度稳定奖励的问题。
Details
Motivation: 现有研究假设语言推理的训练范式可无缝迁移到视觉感知任务,但作者发现该假设不成立,需针对视觉感知特性重新设计训练方法。 Method: 提出Dr. Seg框架,包含Look-to-Confirm机制和Distribution-Ranked Reward模块,无需修改模型结构,可即插即用集成到现有GRPO-based VLLMs中。 Result: 在复杂视觉场景中性能提升,同时保持强泛化能力。 Conclusion: 语言推理与视觉感知存在本质差异,需针对性设计训练机制;Dr. Seg是一种简单有效、即插即用的改进方案。 Abstract: Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at https://github.com/xVI-group-SCU/Dr-Seg.[146] EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
Wenxin Tang,Jingyu Xiao,Yanpei Gong,Fengyuan Ran,Tongchuan Xia,Junliang Liu,Man Ho Lam,Wenxuan Wang,Michael R. Lyu
Main category: cs.CV
TL;DR: 本文提出EfficientPosterGen框架,通过语义感知检索、视觉化上下文压缩和无代理布局检测,解决现有MLLM在学术海报自动生成中的信息密度低、耗token多和布局验证不可靠三大问题。
Details
Motivation: 现有基于多模态大语言模型(MLLM)的学术海报自动生成方法存在信息密度低、token消耗过大、布局验证不可靠三大缺陷。 Method: 提出端到端框架EfficientPosterGen,包含三项核心技术:(1)语义感知关键信息检索(SKIR),构建语义贡献图以建模段落关系并筛选重要内容;(2)基于视觉的上下文压缩(VCC),将文本段落渲染为图像,降低文本token消耗并生成海报就绪要点;(3)无代理布局违规检测(ALVD),采用确定性颜色梯度算法检测内容溢出与空间稀疏。 Result: 实验表明该方法显著提升token效率与布局可靠性,同时保持高质量海报生成能力。 Conclusion: EfficientPosterGen为自动化学术海报生成提供了可扩展、高效且可靠的解决方案。 Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen-Code.[147] BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
Saivan Talaei,Fatemeh Daneshfar,Abdulhady Abas Abdullah,Mustaqeem Khan
Main category: cs.CV
TL;DR: BiCLIP是一种面向医疗图像分割的双向一致图文处理框架,通过双向多模态融合与增强一致性正则化,显著提升在少样本和图像退化场景下的鲁棒性。
Details
Motivation: 现有医学图像分割方法在真实临床环境中(标注稀缺、硬件导致图像退化)鲁棒性不足,尤其多模态图文模型的抗干扰能力尚未充分探索。 Method: 提出BiCLIP框架:1)双向图文融合机制,使视觉特征迭代优化文本表征以增强语义对齐;2)引入增强一致性目标,对扰动输入的中间表征进行正则化。 Result: 在QaTa-COV19和MosMedData+上超越现有图像单模态及多模态SOTA;仅用1%标注数据仍保持高性能,并对运动模糊、低剂量CT噪声等临床伪影具有强鲁棒性。 Conclusion: BiCLIP有效提升了医学图像分割在低资源与退化条件下的泛化能力和稳定性,为临床落地提供了更可靠的多模态建模范式。 Abstract: Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in "in-the-wild" clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.[148] FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
Bryceton Bible,Shah Md Nehal Hasnaeen,Hairong Qi
Main category: cs.CV
TL;DR: 本文提出FujiView框架,通过融合网络摄像头图像与气象数据预测富士山等自然地标的可见性,构建了包含10万+样本的多模态数据集,并验证了其在不同时间尺度上的预测性能。
Details
Motivation: 自然地标(如富士山)的可见性对旅游规划和游客体验至关重要,但受快速变化的大气条件影响,难以准确预测。 Method: 提出一种多模态学习框架FujiView,采用晚融合策略,将YOLO提取的图像类别概率与结构化气象数值特征结合,进行五类可见性分类。 Result: 在同日预测中准确率达约0.89,次日预测达84%;YOLO视觉特征主导短时预测(nowcasting/samedaycasting),气象特征在+1天后成为主信号。 Conclusion: Scenic Visibility Forecasting(SVF)被确立为多模态学习的新基准任务,FujiView数据集将公开以推动环境预测研究。 Abstract: Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present FujiView, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as "nowcasting" and "samedaycasting", while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving ACC of approx 0.89 for same-day prediction and up to 84% for next-day forecasts. These results position Scenic Visibility Forecasting (SVF) as a new benchmark task for multimodal learning.[149] FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
Weiting Tan,Andy T. Liu,Ming Tu,Xinghua Qu,Philipp Koehn,Lu Lu
Main category: cs.CV
TL;DR: FlowPortrait 是一种基于强化学习的音频驱动肖像动画框架,利用多模态大语言模型(MLLM)构建人类感知对齐的评估系统,并结合感知与时间一致性正则化,通过 GRPO 算法优化生成器,显著提升唇形同步、表情自然度和运动质量。
Details
Motivation: 现有方法在唇形同步、运动自然性和评估指标与人类感知相关性差等方面存在持续挑战。 Method: 提出 FlowPortrait 框架:以多模态自回归音视频生成骨干网络为基础,引入基于 MLLM 的人类对齐评估系统(评估唇同步、表现力和运动质量),融合感知与时间一致性正则项构成复合奖励,并采用 Group Relative Policy Optimization(GRPO)进行生成器后训练。 Result: 大量实验(含自动评估与人类偏好研究)表明 FlowPortrait 一致生成更高质量的说话人视频。 Conclusion: 强化学习结合人类感知对齐的多模态评估,可有效提升音频驱动肖像动画的质量。 Abstract: Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.[150] DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops
Boyang Deng,Yuzhen Lu
Main category: cs.CV
TL;DR: 本研究提出了一种基于自监督学习(DINOv3)和多源数据整合的作物-杂草检测模型(DINOv3-YOLO26),显著提升了域内与跨年份数据集上的检测精度(mAP50最高提升+14.0%),同时保持实时推理速度(~28.5 fps),并开源数据与代码。
Details
Motivation: 现有高精度蔬菜杂草识别模型受限于大规模标注作物-杂草数据集的严重匮乏。 Method: 构建包含618,642张图像的大规模原始数据集,经筛选得199,388张高质量图像;使用该数据微调DINOv3 ViT-small作为视觉骨干;将其嵌入YOLO26框架(单/双骨干结构),并在双骨干中引入轻量级特征对齐损失以优化特征融合。 Result: DINOv3-YOLO26-large在2025年域内数据上mAP50提升+5.4%,在2021–2023和2024年跨域数据上分别提升+14.0%和+11.9%;虽参数量增加45.6%、延迟上升2.9倍,仍达~28.5 fps实时性能。 Conclusion: 通过自监督预训练与异构数据协同微调,可有效缓解农业细粒度检测中的标注稀缺问题,并显著增强模型泛化能力与实用性。 Abstract: Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.[151] SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision
S. Kalaycioglu,C. Hong,M. Zhu,H. Xie
Main category: cs.CV
TL;DR: SKINOPATHY AI 是一个基于智能手机的网页应用,提供五种可解释的眼科筛查模块,全部运行于普通移动设备,无需云端AI推理,适用于资源匮乏和偏远地区。
Details
Motivation: 解决低资源和偏远地区早期眼科筛查受限于专业设备和 trained practitioners 的问题。 Method: 开发基于 React/FastAPI、OpenCV 和 MediaPipe 的手机优先 Web 应用,集成五种确定性、隐私保护的算法模块:红度量化、眨眼率估计、瞳孔光反射分析、巩膜颜色索引、虹膜标志校准病灶侵入测量。 Result: 验证了在未修改的智能手机上实现多信号眼科筛查的可行性,所有算法本地运行、无需云推理,并支持 PDF 报告生成与纵向趋势跟踪。 Conclusion: SKINOPATHY AI 为未来临床验证的移动眼底镜工具提供了可行基础,强调非诊断性、消费者级初步分诊定位。 Abstract: Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.[152] A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance
Nicholas Korcynski
Main category: cs.CV
TL;DR: 本文提出了一种针对白板笔画二值分割任务的新型评估协议,解决因前景像素极度稀疏(仅约1.79%)导致的标准区域指标(如F1、IoU)掩盖细笔画分割失败的问题;通过引入边界感知指标(BF1、B-IoU)、核心/细笔画子集公平性分析及多运行鲁棒统计(中位数、IQR、最差情况),揭示不同损失函数的真实性能权衡;实验表明重叠类损失(如Dice、Tversky)显著优于交叉熵,且学习模型在最差情况可靠性上优于传统二值化方法(如Sauvola),提升训练分辨率可进一步增强性能。
Details
Motivation: 白板笔画二值分割面临严重类别不平衡(前景像素平均仅1.79%,细笔画子集更低至1.14%±0.41%),导致标准区域指标(F1、IoU)无法反映细笔画分割失败,掩盖模型真实缺陷。 Method: 提出联合评估协议:整合区域指标(F1、IoU)、边界指标(BF1、B-IoU)、核心/细笔画子集公平性分析、每图像鲁棒性统计(中位数、IQR、最差F1),并在多轮固定随机种子训练下采用非参数显著性检验;在DeepLabV3-MobileNetV3上对比5种损失函数(CE、Focal、Dice、Dice+Focal、Tversky),并在12张留出图像(分核心/细子集)上评估;同时对比传统自适应阈值与Sauvola二值化方法,并测试双倍训练分辨率的影响。 Result: 重叠类损失(Dice/Tversky)F1达0.663,显著优于CE的0.438(p<0.001),且边界指标验证轮廓精度同步提升;Sauvola均值F1最高(0.787),但最差F1仅0.452,而Tversky为0.565,显示经典方法重均值、学习模型重鲁棒性;双倍分辨率使F1再提升12.7点。 Conclusion: 标准区域指标不足以评估极端不平衡下的白板分割;引入边界感知指标与子集公平性分析能暴露隐藏权衡;所提评估协议更全面,支持更可靠的模型选择与改进;学习模型虽均值略低,但在最差情况一致性与可靠性上优于传统方法,且分辨率提升持续有效。 Abstract: The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.[153] ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
Zhaodong Wu,Haochen Xue,Qi Cao,Wenqi Mo,Yu Pei,Wenqi Xu,Jionglong Su,Yang Liu
Main category: cs.CV
TL;DR: 本文提出ConFoThinking框架,通过整合注意力机制并基于简洁语义线索提取注意力,提升多模态大语言模型(MLLMs)在细粒度视觉问答(VQA)中的感知性能。
Details
Motivation: 现有方法在定位关键图像区域时存在注意力碎片化、依赖问题文本导致语义噪声、以及注意力与生成坐标不一致等问题。 Method: 提出ConFoThinking框架:1)将跨层注意力聚合到指定中间层;2)基于简洁语义线索(而非完整问题)提取注意力;3)据此裁剪高亮区域用于下游视觉理解。 Result: 在五个VQA基准上显著提升感知性能,验证了方法有效性。 Conclusion: ConFoThinking通过更鲁棒、更聚焦的注意力建模,缓解了MLLMs在细粒度视觉理解中的定位与推理瓶颈。 Abstract: Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.[154] Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
Hongyu Li,Kuan Liu,Yuan Chen,Juntao Hu,Huimin Lu,Guanjie Chen,Xue Liu,Guangming Lu,Hong Huang
Main category: cs.CV
TL;DR: 本文提出'服从性'(Obedience)概念,构建了从语义对齐到像素级精度的层级评估体系,并发布首个专注于纯色生成的视觉服从性基准VIOLIN,揭示了当前生成式AI在简单确定性任务上的根本性局限。
Details
Motivation: 解决生成式AI存在的'简单性悖论'——即虽能生成复杂内容却常在简单确定性任务上失败的问题。 Method: 形式化定义'服从性'概念并建立层级评估体系;通过案例研究识别常见服从性差距;构建VIOLIN基准(专注纯色生成的六种变体)进行高阶服从性评估。 Result: 实验揭示了SOTA模型在服从性方面的根本局限,并提供了探索性见解。 Conclusion: 该框架旨在引起学界对AI服从性的更多关注,推动深入研究以弥合这一关键能力缺口。 Abstract: Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.[155] Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks
Irfan Atabas,Hatice Karatas
Main category: cs.CV
TL;DR: 本研究利用图像处理和深度学习技术,通过立体相机采集土耳其五种橄榄品种的图像,并采用MobileNetV2和EfficientNetB0模型进行分类,最终EfficientNetB0达到94.5%准确率。
Details
Motivation: 实现土耳其本地橄榄品种的自动识别,以支持农业产品的自动识别与质量控制。 Method: 使用立体相机采集五种橄榄品种图像,经预处理后,采用MobileNetV2和EfficientNetB0进行迁移学习训练与分类。 Result: EfficientNetB0模型在测试中达到94.5%的分类准确率,性能最优。 Conclusion: 基于深度学习的图像分类方法可高效、高精度识别橄榄品种,具备农业自动化应用潜力。 Abstract: In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.[156] A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition
Práxedes Martínez-Moreno,Andrea Valsecchi,Pablo Mesejo,Pilar Navarro-Ramírez,Valentino Lugli,Sergio Damas
Main category: cs.CV
TL;DR: 本文提出Lilium方法,通过差分进化算法优化基于3D圆锥表示的软组织厚度模型,提升颅面叠加(SFO)的准确性与鲁棒性。
Details
Motivation: 传统颅面叠加(SFO)因个体软组织厚度差异导致精度受限,需更鲁棒、自动化的建模方法。 Method: 提出Lilium:采用3D圆锥表征软组织厚度,结合差分进化算法优化参数;引入多约束(解剖、形态、摄影合理性),包括标志点匹配、相机参数一致性、头位对齐、颅骨在面部边界内、区域平行性。 Result: Lilium在准确性和鲁棒性上均优于当前最先进方法。 Conclusion: Lilium实现了更贴近法医实践的自动化SFO流程,有效缓解软组织变异带来的不确定性。 Abstract: Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks' correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners' approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.[157] AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning
Yuxiang Shen,Hailong Huang,Zhenkun Gao,Xueheng Li,Chengjun Xie,Xuanhua He,Jie Zhang
Main category: cs.CV
TL;DR: 本文提出AdaFocus,一种无需训练的自适应视觉推理框架,通过置信度驱动的裁剪决策和语义引导的定位模块,解决现有无训练方法中的感知冗余和语义-空间偏差问题,在准确性和推理速度上均显著优于现有方法。
Details
Motivation: 现有无训练多模态大语言模型方法存在感知冗余(盲目裁剪)和语义意图与空间注意力偏移两大缺陷,且大规模训练计算开销大,亟需高效轻量的替代方案。 Method: AdaFocus采用两阶段无训练框架:第一阶段基于置信度判断是否裁剪;第二阶段利用语义信息指导裁剪位置,实现自适应视觉推理。 Result: 实验表明AdaFocus在性能上显著提升,推理速度相较SOTA方法ZoomEyes提升约4.0倍。 Conclusion: AdaFocus为多模态大语言模型提供了高效、准确、无需训练的视觉推理新范式,兼顾精度与效率。 Abstract: Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.[158] Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model
Simo Ryu,Chunghwan Han
Main category: cs.CV
TL;DR: 本文介绍了从零开始训练视频基础模型Summer-22B的经验,涵盖数据收集、多阶段过滤、μP参数化与超球面约束优化等关键技术,并强调数据工程是主要挑战。
Details
Motivation: 为支持大规模视频基础模型的开发,需系统性总结从原始视频采集到模型训练全过程中的工程挑战与设计决策,以供后续研究者参考。 Method: 结合元数据驱动的数据集构建、多阶段过滤、μP参数化方法和超球面约束优化;开发Lavender Data数据管理系统;采用推理感知的架构设计。 Result: 发现数据工程占主导工作量,不同架构变体性能差异小于预期,μP超参数迁移在几何约束下依然有效。 Conclusion: 视频基础模型训练中,高质量数据工程比架构调优更为关键;μP与几何约束优化具备实用潜力;该实践总结可为同类项目提供重要参考。 Abstract: We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.[159] Infinite Self-Attention
Giorgio Roffo
Main category: cs.CV
TL;DR: 本文提出Infinite Self-Attention(InfSA),将自注意力建模为内容自适应图上的扩散过程,通过Neumann级数累积多跳交互,并与经典图中心性指标建立联系;进一步提出线性复杂度变体Linear-InfSA,仅需固定大小辅助状态,支持超长序列(>30万token)高效训练与推理,在ImageNet等任务上显著提升精度、吞吐量与能效。
Details
Motivation: Softmax自注意力的二次计算复杂度限制了Transformer在高分辨率视觉任务中的可扩展性,亟需一种兼具理论解释性与线性计算效率的替代方案。 Method: 将每层自注意力视为内容自适应token图上的扩散步骤,用折扣Neumann级数建模多跳交互;揭示其与Katz中心性、PageRank等图中心性指标的等价性,并从吸收马尔可夫链基本矩阵角度给出随机游走解释;进而设计Linear-InfSA,通过迭代估计隐式注意力算子主特征向量,避免显式构建全注意力矩阵,仅维护O(dh)大小的辅助状态。 Result: Linear-InfSA在4层ViT上ImageNet-1K达84.7% top-1准确率(+3.2点),ImageNet-V2达79.8%,显著优于基线;A100上吞吐达231 img/s、能耗0.87 J/image(13倍提升),并唯一支持9216×9216(~332k token)无OOM推理;主特征向量近似精度高(余弦相似度0.985)。 Conclusion: InfSA为自注意力提供了新的谱视角与图扩散解释,Linear-InfSA在保持模型兼容性的同时实现了线性复杂度与卓越实际性能,是高分辨率视觉Transformer的重要进展。 Abstract: The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).[160] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1
Abhinav Munagala
Main category: cs.CV
TL;DR: 本文提出了一种基于2025年基础模型的双流水线框架,用于鸟类图像二值分割:零样本(仅用文本提示“bird”+Grounding DINO检测框)和有监督(YOLOv11微调检测+SAM 2.1分割)两种模式,均冻结SAM 2.1主干;在CUB-200-2011上,监督模式IoU达0.912,零样本模式IoU达0.831,均优于现有方法,且无需重训练分割模型。
Details
Motivation: 鸟类图像分割面临姿态多变、羽色复杂、光照多样的挑战,而现有方法依赖大量标注数据或端到端训练,泛化性与适应性受限。 Method: 构建双流水线:(1) 零样本流水线——Grounding DINO 1.5以文本“bird”检测鸟类并生成边界框,作为提示输入冻结的SAM 2.1获取掩码;(2) 有监督流水线——在CUB-200-2011上微调YOLOv11提升检测精度,再以检测框提示SAM 2.1生成像素级掩码;全程不更新SAM 2.1参数。 Result: 在CUB-200-2011上,监督流水线IoU=0.912、Dice=0.954、F1=0.953,较SegFormer-B2提升7.0个百分点;零样本流水线IoU=0.831,为该基准首次报告的纯文本提示结果;域适配仅需约1小时轻量检测器微调。 Conclusion: 基于提示的基础模型双流水线在鸟类分割任务中显著优于传统端到端分割网络,兼具高性能、强泛化性与低训练开销,验证了冻结大模型+轻量提示/微调范式的有效性。 Abstract: Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt "bird" before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.[161] Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
Bowen Zhou,Zhou Xu,Wanli Li,Jingyu Xiao,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出ST-Lite,一种无需训练的KV缓存压缩框架,专为GUI场景设计,通过组件中心空间显著性(CSS)和轨迹感知语义门控(TSG)双分支策略,在仅10-20%缓存预算下实现2.45倍解码加速且不损性能。
Details
Motivation: 现有KV缓存压缩方法在GUI场景中表现不佳,因其未考虑GUI注意力模式在所有Transformer层中均呈现均匀高稀疏性的特点;同时GUI数据流具有动态时空轨迹依赖性,需针对性优化。 Method: 提出ST-Lite框架,包含两个核心模块:Component-centric Spatial Saliency(CSS)用于评估UI组件局部邻域显著性以保持结构完整性;Trajectory-aware Semantic Gating(TSG)用于动态过滤交互轨迹中视觉重复的KV对以减少历史冗余。整个方法无需额外训练。 Result: 在仅保留10-20% KV缓存的情况下,ST-Lite实现2.45倍解码加速,且任务性能与全缓存基线相当甚至更优。 Conclusion: ST-Lite是一种轻量、高效、即插即用的KV缓存压缩方案,显著缓解了大型视觉语言模型在长程GUI交互中的内存与延迟瓶颈,提升了自主GUI智能体在资源受限设备上的部署可行性。 Abstract: Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.[162] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
Yang Yang,Xinze Zou,Zehua Ma,Han Fang,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出SKeDA框架,用于文本到视频生成模型的水印嵌入,通过Shuffle-Key采样和差分注意力机制提升水印对帧重排、丢失及时间失真的鲁棒性,同时保持高质量视频生成。
Details
Motivation: 现有图像水印方法直接迁移到视频存在帧对齐依赖强、易受视频压缩等时序失真影响的问题,亟需专为视频设计的高保真、强鲁棒水印方案。 Method: 提出SKeDA框架:1)Shuffle-Key分布保持采样(SKe),用单一伪随机序列经置换生成各帧密钥,将水印提取从序列解码转为集合聚合;2)差分注意力(DA),利用帧间差异动态调整注意力权重以抵抗时序失真。 Result: 实验表明SKeDA在保持高视频生成质量的同时,显著提升了对帧重排序、帧丢失及压缩等常见视频失真的水印鲁棒性。 Conclusion: SKeDA为文本到视频扩散模型提供了兼顾保真度与鲁棒性的生成式水印新范式,有效缓解AI生成视频带来的真实性与版权挑战。 Abstract: The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.[163] A Case Study on Concept Induction for Neuron-Level Interpretability in CNN
Moumita Sen Sarma,Samatha Ereshi Akkamahadevi,Pascal Hitzler
Main category: cs.CV
TL;DR: 本文研究了基于概念归纳的隐藏神经元分析框架在SUN2012数据集上的泛化能力,验证了其在大规模场景识别任务中的适用性。
Details
Motivation: 深度神经网络隐藏神经元的内部语义尚不明确,需探索其可解释性方法的泛化能力。 Method: 采用与ADE20K上相同的概念归纳框架,对SUN2012数据集的隐藏神经元进行语义标注,并通过网络图像和统计检验验证标签可解释性。 Result: 该方法成功迁移到SUN2012数据集,为神经元赋予了可解释的语义标签,并通过实证验证了其有效性。 Conclusion: 基于概念归纳的神经元分析方法具有跨数据集的泛化能力,适用于更广泛的场景理解任务。 Abstract: Deep Neural Networks (DNNs) have advanced applications in domains such as healthcare, autonomous systems, and scene understanding, yet the internal semantics of their hidden neurons remain poorly understood. Prior work introduced a Concept Induction-based framework for hidden neuron analysis and demonstrated its effectiveness on the ADE20K dataset. In this case study, we investigate whether the approach generalizes by applying it to the SUN2012 dataset, a large-scale scene recognition benchmark. Using the same workflow, we assign interpretable semantic labels to neurons and validate them through web-sourced images and statistical testing. Our findings confirm that the method transfers to SUN2012, showing its broader applicability.[164] Stateful Token Reduction for Long-Video Hybrid VLMs
Jindong Jiang,Amala Sanjay Deshmukh,Kateryna Chumachenko,Karan Sapra,Zhiding Yu,Guilin Liu,Andrew Tao,Pavlo Molchanov,Jan Kautz,Wonmin Byeon
Main category: cs.CV
TL;DR: 本文提出了一种面向混合架构(如含Mamba模块)长视频视觉语言模型的查询条件化、渐进式token压缩方法,通过统一的语言感知打分机制实现全层token减少,在大幅压缩(仅保留25%视觉token)下仍保持近基线精度,并显著加速prefilling(3.8–4.2倍)。
Details
Motivation: 现有token缩减方法主要面向密集Transformer,难以适配注意力与状态空间模型(如Mamba)混合的视频VLM架构;且作者发现token重要性虽在单层稀疏,但跨层不稳定,故早期激进剪枝不可靠。 Method: 提出低到高渐进式token缩减调度策略,以及适用于注意力与Mamba模块的统一语言感知打分机制(对Mamba采用隐式注意力代理建模),实现混合架构全层token缩减。 Result: 在仅保留25%视觉token的强压缩下,prefilling速度提升3.8–4.2倍,测试精度接近基线;轻量微调后进一步提升长视频基准性能。 Conclusion: 查询条件化、渐进式且模块通用的token缩减策略,可有效适配混合视频VLM架构,在效率与精度间取得更好平衡。 Abstract: Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8--4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.[165] AdURA-Net: Adaptive Uncertainty and Region-Aware Network
Antik Aich Roy,Ujjwal Bhattacharya
Main category: cs.CV
TL;DR: 本文提出AdURA-Net,一种几何驱动的自适应不确定性感知框架,用于胸部疾病分类,通过自适应空洞卷积、多尺度可变形对齐及双头损失(掩码二元交叉熵+Dirichlet证据学习)提升模型在不确定标签下的可靠性。
Details
Motivation: 临床决策中常因放射科报告模糊或自动标注局限而存在不确定性,尤其在CheXpert、MIMIC-CXR等多标签数据集中,“不确定”标签需模型避免强行置信预测,亟需具备拒绝推断能力。 Method: 提出AdURA-Net:a) 基于DenseNet主干,融合自适应空洞卷积与多尺度可变形对齐以建模解剖复杂性;b) 设计双头损失函数,联合掩码二元交叉熵与Dirichlet证据学习目标,实现不确定性建模。 Result: 该框架提升了模型在含‘不确定’标签的胸片数据上的分类可靠性与不确定性校准能力,增强高风险临床决策中的可信度。 Conclusion: AdURA-Net通过几何感知结构设计与证据深度学习,有效建模并利用诊断不确定性,为不确定性敏感的医学图像分析提供了新范式。 Abstract: One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.[166] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models
Daniel Nobrega Medeiros
Main category: cs.CV
TL;DR: 本文提出了TACIT视觉推理基准,包含10个任务、6个推理领域,支持生成式和判别式双轨评估,强调程序化、可复现与结构化细粒度推理。
Details
Motivation: 现有视觉推理基准多依赖自然语言提示、覆盖推理模态窄、或采用主观评分(如LLM-as-judge),缺乏程序化、客观、多模态的评估标准。 Method: 设计TACIT基准:涵盖6类视觉推理领域;构建双轨评估(生成式图像输出+判别式五选一);使用确定性CV流水线验证;每个干扰项仅违反一个结构约束;全数据集可复现生成与验证。 Result: 发布TACIT v0.1.0:6,000道题目、108,000张PNG图像(三分辨率),配套开源代码、数据与评测框架(Apache 2.0许可,HuggingFace托管)。 Conclusion: TACIT为视觉推理提供了更严谨、客观、多维度的程序化评估范式,推动模型从表层识别向深层结构化推理演进。 Abstract: Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).[167] VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
Soumya Suvra Ghosal,Youngeun Kim,Zhuowei Li,Ritwick Chaudhry,Linghan Xu,Hongjing Zhang,Jakub Zablocki,Yifan Xing,Qin Zhang
Main category: cs.CV
TL;DR: 本文提出VisRef框架,通过在推理过程中动态重注入语义相关且多样化的视觉核心集令牌,提升多模态大模型在视觉依赖任务中的推理性能,无需额外的强化学习微调。
Details
Motivation: 现有大推理模型在视觉依赖任务中扩展文本推理会削弱对视觉信息的关注,导致性能下降;而现有解决方案(如RL微调或refocusing机制)计算开销大。 Method: 提出VisRef:一种视觉接地的测试时扩展框架,核心是在推理过程中主动重注入语义相关、多样且全局代表性的视觉核心集(coreset)令牌。 Result: 在三个视觉推理基准上,VisRef在固定测试时计算预算下,相比现有测试时扩展方法最高提升6.4%。 Conclusion: VisRef能有效提升多模态大模型的视觉接地推理能力,兼顾性能增益与计算效率,无需额外RL训练。 Abstract: Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.[168] Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection
Brianna D'Urso,Tahmid Hasan Sakib,Syed Rafay Hasan,Terry N. Guo
Main category: cs.CV
TL;DR: 本文研究了自然主义对抗补丁(NAPs)在物理交通标志场景中的迁移能力,使用定制化数据集CompGTSRB训练YOLOv5检测器并结合GAN生成补丁,在QCar实车平台上验证其对STOP类检测置信度的干扰效果。
Details
Motivation: 评估自然主义对抗补丁在真实自动驾驶环境中的物理迁移效果,提升对抗攻击评估的可信度与实用性。 Method: 构建面向自动驾驶环境的复合数据集CompGTSRB(融合GTSRB交通标志与实采背景),用其训练YOLOv5模型;采用基于潜在空间优化的GAN生成NAP;在Quanser QCar平台通过前视CSI相机开展多变量(距离、尺寸、位置)物理实验。 Result: NAP在多种物理配置下均显著降低STOP类检测置信度;验证了CompGTSRB数据集及所提物理评估协议的有效性。 Conclusion: NAP具备实际物理迁移能力,凸显了针对嵌入式感知流水线中局部补丁污染设计防御机制的必要性。 Abstract: This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector's STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.[169] Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
Giuseppe Sergioli,Carlo Cuccu,Giovanni Pasini,Alessandro Stefano,Giorgio Russo,Andrés Camilo Granda Arango,Roberto Giuntini
Main category: cs.CV
TL;DR: 本文提出了一种基于量子态判别中“近似最优测量”(PGM)的量子启发式多类分类方法,将每类映射为混合量子态,并通过单次POVM测量实现端到端多类决策;在肺癌和前列腺癌两个放射组学任务中验证了其竞争力,尤其在类别数适中、类间重叠较小时表现更优。
Details
Motivation: 传统多类分类常依赖一对一分解策略,缺乏统一的多类决策机制;而量子态判别中的PGM提供了一种天然的、几何感知的多类决策框架,可建模类间结构与重叠,有望提升判别性能。 Method: 将每个类别编码为一个密度算子(混合量子态),构建一个基于Pretty Good Measurement(PGM)的POVM作为分类器;分类过程即对输入样本对应的密度算子执行该POVM测量,输出概率最大的类别;整个流程不依赖二分类分解,是原生多类方法。 Result: 在NSCLC亚型分类(2类与3类)中显著优于经典基线,在4类任务中仍具竞争力;在PCa风险分层中接近最强集成基线,且在不同特征选择下保持临床相关的敏感性-特异性权衡。 Conclusion: PGM提供了一种有理论基础、几何可解释、无需降维或分解的多类分类新范式;实证表明其在生物医学放射组学等高维小样本场景中具有实用价值和扩展潜力。 Abstract: We investigate a quantum-inspired approach to supervised multi-class classification based on the \emph{Pretty Good Measurement} (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.[170] Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization
He Li,Wenyue He,Weihang Kong,Xingchen Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉-红外(VI)密集预测的联合位置-颜色优化框架(AP-PCO),用于生成跨模态对抗补丁,通过黑盒优化和跨模态颜色自适应提升攻击有效性与隐蔽性。
Details
Motivation: 现有对抗补丁方法主要针对单模态设计,在视觉-红外(VI)密集预测中因未考虑跨光谱不一致性,导致攻击效果差、隐蔽性低。 Method: 提出AP-PCO框架,联合优化对抗补丁的位置与颜色;引入跨模态颜色自适应策略,约束补丁在红外灰度特性下的外观,同时保持可见光域强扰动;整个过程为无需模型内部信息的黑盒优化。 Result: 在多个VI密集预测任务和模型架构上验证了AP-PCO具备一致性强攻击性能,显著优于现有单模态方法,且具有更好隐蔽性。 Conclusion: AP-PCO为VI感知系统鲁棒性评估提供了实用基准,推动了多模态密集预测场景下对抗攻击的研究。 Abstract: Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.[171] Ozone Cues Mitigate Reflected Downwelling Radiance in LWIR Absorption-Based Ranging
Unay Dorken Gallastegi,Wentao Shangguan,Vaibhav Choudhary,Akshay Agarwal,Hoover Rueda-Chacón,Martin J. Stevens,Vivek K Goyal
Main category: cs.CV
TL;DR: 本文提出了两种新的被动长波红外(LWIR)测距方法,利用臭氧吸收特征校正下射辐射反射带来的误差,显著提升了测距精度。
Details
Motivation: 传统被动LWIR吸收测距方法忽略反射辐射(尤其是下射辐射),在低温度对比场景中易导致大幅误差,亟需建模并校正其影响。 Method: 提出两种新方法:(1) 四光谱法——基于四个窄带测量(两个水汽吸收线+两个臭氧吸收线)给出闭式距离估计;(2) 高光谱法——利用更宽光谱范围,同时估计距离、温度、发射率剖面及多天顶角下射辐射贡献。 Result: 实验表明,新方法大幅提升测距精度:未建模反射时误差超100米,四光谱法降至6.8米,高光谱法进一步降至1.2米。 Conclusion: 利用臭氧吸收特征可有效分离并校正下射辐射反射效应,使被动LWIR测距在复杂热场景中具备实用精度。 Abstract: Passive long-wave infrared (LWIR) absorption-based ranging relies on atmospheric absorption to estimate distances to objects from their emitted thermal radiation. First demonstrated decades ago for objects much hotter than the air and recently extended to scenes with low temperature variations, this ranging has depended on reflected radiance being negligible. Downwelling radiance is especially problematic, sometimes causing large inaccuracies. In two new ranging methods, we use characteristic features from ozone absorption to estimate the contribution of reflected downwelling radiance. The quadspectral method gives a simple closed-form range estimate from four narrowband measurements, two at a water vapor absorption line and two at an ozone absorption line. The hyperspectral method uses a broader spectral range to improve accuracy while also providing estimates of temperature, emissivity profiles, and contributions of downwelling from a collection of zenith angles. Experimental results demonstrate improved ranging accuracy, in one case reducing error from over 100 m when reflected light is not modeled to 6.8 m with the quadspectral method and 1.2 m with the hyperspectral method.[172] Seeking Necessary and Sufficient Information from Multimodal Medical Data
Boyu Chen,Weiye Bao,Junjie Liu,Michael Shen,Bo Peng,Paul Taylor,Zhu Li,Mengyue Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于概率必要性与充分性(PNS)的多模态医学表征学习方法,通过分解模态不变与模态特异性特征来解决PNS在多模态中难以估计的问题,提升了模型性能与模态缺失下的鲁棒性。
Details
Motivation: 现有多模态医学模型忽视学习既必要又充分的特征,而这类特征对提升模型性能和模态缺失鲁棒性至关重要。 Method: 将多模态表征分解为模态不变和模态特定成分,并分别为其推导可计算的PNS学习目标。 Result: 在合成数据和真实医学数据集上的实验验证了该方法在性能和模态缺失鲁棒性方面的有效性。 Conclusion: 利用PNS指导多模态表征学习是可行且有效的,尤其适用于对可靠性与鲁棒性要求高的医学AI任务。 Abstract: Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method's effectiveness. Code will be available on GitHub.[173] Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Arya Fayyazi,Haleh Akrami
Main category: cs.CV
TL;DR: 本文提出Proof-of-Perception (PoP)框架,将多模态推理建模为带显式可靠性保证的可执行图,通过共形预测输出校准的不确定性集合,并由轻量控制器动态分配计算资源,在保障精度的同时提升效率与可靠性。
Details
Motivation: 解决多模态推理中误差累积、幻觉及缺乏不确定性量化的问题,实现可验证、可控制的推理过程。 Method: 构建基于工具使用的可执行图框架PoP,每个感知或逻辑节点输出共形集以提供步进式校准不确定性;引入轻量控制器根据不确定性证书在计算预算下动态决定是否调用额外工具或提前终止。 Result: 在文档、图表和多图像问答基准上,PoP在性能、可靠性和计算效率上均优于强基线(如思维链、ReAct、程序化思维)。 Conclusion: PoP通过显式不确定性建模与动态计算调度,实现了更可靠、可验证且高效的多模态推理。 Abstract: We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.[174] Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors
Xuanshuo Fu,Lei Kang,Javier Vazquez-Corral
Main category: cs.CV
TL;DR: 本文提出了一种基于结构化控制嵌入模块(SCEM)的条件扩散模型,用于低光照图像增强,通过分解图像为四个物理先验成分来引导增强过程,在多个数据集上实现了SOTA性能。
Details
Motivation: 低光照图像常存在低对比度、噪声和颜色失真等问题,影响视觉质量和下游任务性能。 Method: 提出一种结合结构化控制嵌入模块(SCEM)的条件扩散框架;SCEM将低光照图像分解为照度、照度不变特征、阴影先验和颜色不变线索四部分作为控制信号,驱动U-Net结构的扩散模型,并采用简化的噪声预测损失进行训练。 Result: 仅在LOLv1上训练,未微调即在LOLv2-real、LSRW、DICM、MEF和LIME等多个基准上取得定量与感知指标的SOTA结果,展现出强泛化能力。 Conclusion: SCEM引导的扩散模型能有效融合物理先验与生成建模优势,实现结构化、高质量且泛化性强的低光照图像增强。 Abstract: Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. https://casted.github.io/scem/.[175] Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance
Galen Pogoncheff,Alvin Wang,Jacob Granley,Michael Beyeler
Main category: cs.CV
TL;DR: 本文提出了一种面向感知效果的皮层视觉假体手术规划框架,将电极放置建模为解剖空间中的约束优化问题,通过可微分的假体视觉前向模型端到端优化电极位置,在保障血管安全和灰质可行性的前提下提升任务级感知性能。
Details
Motivation: 现有皮层视觉假体手术规划策略侧重于视野覆盖和解剖经验规则,未在安全约束下直接优化预测的感知效果。 Method: 将电极坐标设为可学习参数,利用可微分的假体视觉前向模型进行端到端优化;目标函数最小化任务级感知误差,并引入血管规避与灰质可行性约束;支持多电极线程在固定插入预算下的联合优化。 Result: 在基于FreeSurfer fsaverage的折叠皮层几何模型上,该方法在模拟阅读与自然图像任务中持续优于基于覆盖的放置策略;血管安全约束消除了边缘违规,同时保持感知性能。 Conclusion: 可微分感知模型可用于指导解剖学合理、安全感知兼顾的皮层神经接口计算机辅助规划,为下一代视觉假体优化奠定基础。 Abstract: Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.[176] Unsupervised Semantic Segmentation in Synchrotron Computed Tomography with Self-Correcting Pseudo Labels
Austin Yunker,Peter Kenesei,Hemant Sharma,Jun-Sang Park,Antonino Miceli,Rajkumar Kettimuthu
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注即可自动分割高分辨率同步辐射CT(SR-CT)数据的新框架,通过聚类生成伪标签并结合Unbiased Teacher方法进行自校正,显著提升了分割精度。
Details
Motivation: 同步辐射CT产生海量高分辨率数据,但人工标注耗时且不现实,而深度学习依赖大量标注数据,亟需无监督或弱监督的高效分割方法。 Method: 首先基于体素灰度值聚类生成伪标签以构建初始语义图;然后在伪标签上训练分割模型,并引入Unbiased Teacher框架对伪标签进行自校正优化。 Result: 在镁晶体SR-CT数据上,像素级准确率和mIoU分别比基线伪标签提升13.31%和15.94%;并在另外两个样本上验证了方法泛化性与鲁棒性。 Conclusion: 该框架有效克服了SR-CT数据标注瓶颈,在无需人工干预前提下实现了高精度、可推广的自动分割,为大尺度三维图像分析提供了实用新范式。 Abstract: X-ray computed tomography (CT) is a widely used imaging technique that provides detailed examinations into the internal structure of an object with synchrotron CT (SR-CT) enabling improved data quality by using higher energy, monochromatic X-rays. While SR-CT allows for improved resolution, time-resolved experimentation, and reduced imaging artifacts, it also produces significantly larger datasets than conventional CT. Accurate and efficient evaluation of these datasets is a critical component of these workflows; yet is often done manually representing a major bottleneck in the analysis phase. While deep learning has emerged as a powerful tool capable of providing a wide range of purely data-driven solutions, it requires a substantial amount of labeled data for training and manual annotation of SR-CT datasets is impractical in practice. In this paper, we introduce a novel framework that enables automatic segmentation of large, high-resolution SR-CT datasets by eliminating the need to hand label images for deep learning training. First, we generate pseudo labels by clustering on the voxel values identifying regions in the volume with similar attenuation coefficients producing an initial semantic map. Afterwards, we train a segmentation model on the pseudo labels before utilizing the Unbiased Teacher approach to self-correct them ensuring accurate final segmentations. We find our approach improves pixel-wise accuracy and mIoU by 13.31% and 15.94%, respectively, over the baseline pseudo labels when using a magnesium crystal SR-CT sample. Additionally, we extensively evaluate the different components of our workflow including segmentation model, loss function, pseudo labeling strategy, and input type. Finally, we evaluate our approach on to two additional samples highlighting our frameworks ability to produce segmentations that are considerably better than the original pseudo labels.[177] DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography
Yujia Wu,Shuoqi Chen,Shiru Wang,Yucheng Tang,Petr Bruza,Geoffrey P. Luke
Main category: cs.CV
TL;DR: 本文提出DiffSOS,一种基于条件扩散模型的声速(SoS)重建方法,结合物理引导的ControlNet和混合损失函数,在保证高保真度的同时实现快速、可解释的超声CT定量成像。
Details
Motivation: 现有全波形反演(FWI)计算开销大,而当前深度学习方法重建结果过度平滑、缺乏细节,限制了超声CT在临床中的实用价值。 Method: 提出DiffSOS:基于条件扩散模型,引入声学ControlNet将去噪过程严格约束于实测波形;采用融合噪声预测、空间重建与噪声频谱内容的混合损失函数;使用DDIM加速采样(仅10步);并利用生成随机性估计像素级不确定性。 Result: 在OpenPros USCT基准上,Multi-scale Structural Similarity达0.957,显著优于现有最先进方法;实现近实时重建,并提供像素级置信度图。 Conclusion: DiffSOS兼顾高精度、高效率与可解释性,为超声CT提供了一种兼具保真度与可靠性保障的新范式,有助于提升临床诊断的安全性与速度。 Abstract: Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.[178] SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
Yi Zhang,Youya Xia,Yong Wang,Meng Song,Xin Wu,Wenjun Wan,Bingbing Liu,AiXue Ye,Hongbo Zhang,Feng Wen
Main category: cs.CV
TL;DR: 本文提出SSR框架,通过轻量级跨模态对齐与结构化场景图生成,显著提升MLLM在几何推理与空间智能任务中的性能,以7B参数规模在VSI-Bench上达73.9分,超越更大模型。
Details
Motivation: 现有多模态大语言模型(MLLMs)缺乏空间感知能力,难以进行精细几何推理,且存在模态对齐成本高、结构建模精度不足的问题。 Method: 提出SSR框架:1)基于跨模态加法与token交织,将3D几何特征锚定至LLM预对齐的2D语义;2)设计基于相对坐标的局部三元组链式场景图生成流程;3)引入增量生成算法构建语言模型友好的结构化场景骨架;4)扩展至全局3D定位任务,支持异构数据源下的绝对度量精度。 Result: 在VSI-Bench等空间智能基准上取得SOTA结果(7B模型得73.9分),显著优于参数量更大的模型,并实现跨数据源的全局3D精确接地。 Conclusion: 高效的跨模态特征对齐与结构化场景表征是实现真实空间智能的关键,而非单纯扩大模型规模。 Abstract: While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.[179] PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
Yuanhao Su,Shaofeng Zhang,Xiaosong Jia,Qi Fan
Main category: cs.CV
TL;DR: 本文提出PointAlign方法,通过特征级对齐正则化,在3D视觉-语言建模中显式监督中间点云token,以保留几何语义信息,缓解因仅依赖语言建模导致的几何信息退化问题。
Details
Motivation: 3D视觉-语言模型受限于配对3D-文本数据稀缺,且现有方法仅用语言端next-token预测损失进行监督,导致3D几何信息在中间表征中严重退化和浪费。 Method: 提出PointAlign,引入一致性损失约束大语言模型中的中间点云token与原始视觉输入token对齐;仅训练轻量级对齐投影器和LoRA适配器,实现低开销的特征级监督。 Result: 在ModelNet40和Objaverse上实验表明,分类任务平均提升2.08个百分点,Objaverse开放词汇分类提升7.50个百分点,3D物体描述任务(Qwen2-72B-Instruct评估)提升4.88个百分点。 Conclusion: PointAlign有效缓解了3D VLM中几何信息退化问题,在有限数据下显著提升多任务性能,具备高效性与实用性。 Abstract: The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.[180] DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
Changpu Li,Shuang Wu,Songlin Tang,Guangming Lu,Jun Yu,Wenjie Pei
Main category: cs.CV
TL;DR: 本文提出DiffTrans,一种用于透明物体的可微分渲染框架,通过FlexiCubes和递归可微光线追踪器,实现复杂场景下透明物体几何与材质的高效联合优化重建。
Details
Motivation: 现有方法受限于特定场景假设(如均匀拓扑、理想透明性等),难以适用于真实世界中拓扑多样、纹理复杂的透明物体重建任务。 Method: 采用带膨胀与平滑正则化的FlexiCubes构建初始几何;用环境光辐射场建模场景环境;设计基于CUDA实现的递归可微光线追踪器,端到端联合优化几何、折射率和吸收率。 Result: 在多个基准上验证了DiffTrans在复杂场景(尤其拓扑多样、纹理复杂)中显著优于现有方法,且计算效率高。 Conclusion: DiffTrans提供了一种通用、高效、端到端的透明物体重建方案,突破了传统方法对理想化假设的依赖,提升了实际应用潜力。 Abstract: Reconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed DiffTrans, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our DiffTrans compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. The code is available at https://github.com/lcp29/DiffTrans.[181] Station2Radar: query conditioned gaussian splatting for precipitation field
Doyi Kim,Minseok Seo,Changick Kim
Main category: cs.CV
TL;DR: 本文提出Query-Conditioned Gaussian Splatting(QCGS)框架,首次融合自动气象站(AWS)观测与卫星影像生成降水场,在保持结构清晰的同时实现高效、分辨率可调的实时降水建模,RMSE较传统产品提升超50%。
Details
Motivation: 现有降水预报数据源各具局限:雷达精度高但覆盖有限且成本高;气象站精度高但空间稀疏;卫星覆盖密、分辨率高但无法直接反演降水。需融合多源异构数据以提升降水场建模质量与效率。 Method: 提出Query-Conditioned Gaussian Splatting(QCGS):1)雷达点提议网络识别降雨支持位置;2)隐式神经表征(INR)网络预测各点高斯参数;3)仅对查询的降水区域进行高斯渲染,跳过非降水区,兼顾效率与结构保真。 Result: 在基准降水产品上评估显示,QCGS的RMSE较传统格点降水产品降低超50%,且在多个时空尺度下性能稳定。 Conclusion: QCGS是首个融合AWS与卫星影像生成降水场的框架,实现了高效、分辨率自适应、实时的降水场重建,显著提升了精度与计算效率,为多源遥感数据驱动的精细化降水预报提供了新范式。 Abstract: Precipitation forecasting relies on heterogeneous data. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating precipitation fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried precipitation regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible precipitation field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50\% improvement in RMSE compared to conventional gridded precipitation products, and consistently maintains high performance across multiple spatiotemporal scales.[182] An Interpretable Local Editing Model for Counterfactual Medical Image Generation
Hyungi Min,Taeseung You,Hangyeul Lee,Yeongjae Cho,Sungzoon Cho
Main category: cs.CV
TL;DR: 本文提出InstructX2X模型,通过区域特异性编辑实现可解释的反事实医学图像生成,避免非目标属性改变,并引入MIMIC-EDIT-INSTRUCTION数据集,在胸片反事实生成任务中达到SOTA性能。
Details
Motivation: 现有反事实医学图像生成方法存在两类问题:一是无法防止对非目标属性(如人口统计学特征)的意外修改;二是编辑过程缺乏可解释性,限制其在临床中的实际应用。 Method: 提出InstructX2X模型,采用区域特异性编辑(Region-Specific Editing)机制,将修改严格限制在目标解剖区域,并生成指导图(Guidance Map)以提供视觉可解释性;同时构建专家验证的MIMIC-EDIT-INSTRUCTION数据集,源自医学视觉问答对。 Result: 在多个主流评估指标上达到SOTA;成功生成高质量、高保真的反事实胸部X光图像,并同步输出可解释的编辑指导图。 Conclusion: InstructX2X有效解决了反事实医学图像生成中 unintended modification 和 interpretability 两大核心挑战,为可信AI辅助诊断提供了新范式。 Abstract: Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering "what-if" questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.[183] Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
Hulingxiao He,Zhi Tan,Yuxin Peng
Main category: cs.CV
TL;DR: 本文提出Taxonomy-Aware Representation Alignment (TARA)方法,通过将大视觉语言模型(LMMs)的中间视觉表征与生物学基础模型(BFMs)对齐,并对齐首答案词元与真实标签,提升模型在层级视觉识别(HVR)任务中的层级一致性与细粒度分类精度,尤其适用于已知与新颖生物类别。
Details
Motivation: 现有大多模态模型(LMMs)在细粒度视觉识别(FGVR)中表现优异,但在需预测从粗到细完整标签路径的层级视觉识别(HVR)任务上仍受限,尤其难以泛化至训练中未见的新类别。 Method: 提出TARA策略:1)利用生物学基础模型(BFMs)提供的富含层级关系的表征;2)对齐LMMs中间视觉特征与BFMs表征;3)对齐首答案词元表征与真实标签路径,以灵活适配不同粒度的用户查询意图。 Result: TARA显著提升了LMMs在层级一致性(hierarchical consistency)与叶节点准确率(leaf node accuracy)上的性能,尤其在复杂生物分类体系中对已知和新颖类别均实现可靠识别。 Conclusion: TARA是一种简单而有效的方式,能将外部结构化分类知识注入LMMs,增强其在层级视觉理解任务中的泛化能力与可解释性。 Abstract: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.[184] TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis
Hui Wan,Libin Lan
Main category: cs.CV
TL;DR: 本文提出TAP-SLF框架,结合任务感知软提示与选择性高层微调(LoRA),实现对视觉基础模型在多任务超声图像分析中的高效适配,显著提升泛化性并降低计算开销。
Details
Motivation: 现有视觉基础模型在医学图像多任务学习中面临过拟合、计算成本高及缺乏任务感知和层敏感性建模的问题。 Method: 提出Task-Aware Prompting and Selective Layer Fine-Tuning(TAP-SLF):引入任务感知软提示注入输入序列,并仅对编码器顶部若干层应用LoRA进行参数高效微调,其余主干网络冻结。 Result: 在FMC_UIA 2026挑战赛测试集上获第五名;在官方训练集8:2划分下验证了该方法在分割、分类、检测、回归等多任务上的有效性与高效性。 Conclusion: 任务感知提示与选择性高层微调协同可有效提升VFMs在有限医学数据下的多任务适应能力,兼顾性能、泛化性与参数效率。 Abstract: Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.[185] Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models
April Fu
Main category: cs.CV
TL;DR: 本文提出了一种名为ICLA的内部自校正机制,通过层注意力在生成过程中直接操作隐藏状态,无需外部信号即可实现自我修正,显著提升了大视觉语言模型(LVLMs)的视觉接地能力并缓解幻觉问题。
Details
Motivation: 现有LVLMs虽取得进展,但幻觉问题仍严重;随着模型变强,传统幻觉模式(如语言偏差、过度思考)变得不一致,导致原有缓解方法效果下降。 Method: 提出基于层注意力的内部自校正机制(ICLA),在生成过程中让每层通过‘对角线跨层注意力’从所有前序层中选择性检索信息,实现无外部信号的自我精炼;仅引入极少量额外参数(0.2M/0.1M)并进行轻量训练。 Result: 在LLaVA1.5-7B和Qwen2.5-VL-7B上验证,ICLA在多个幻觉基准上持续提升视觉接地性能,对更先进LVLMs有效。 Conclusion: ICLA是一种高效、轻量、无需外部监督的内部自校正机制,为解决先进LVLMs中的幻觉问题提供了新思路。 Abstract: Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.[186] Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling
Xueyang Li,Yunzhong Lou,Yu Song,Xiangdong Zhou
Main category: cs.CV
TL;DR: 本文提出Mamba-CAD,一种基于Mamba架构的自监督生成模型,用于建模工业中长序列参数化CAD模型;通过编码器-解码器预训练与GAN协同生成,显著提升长序列CAD序列生成的有效性。
Details
Motivation: 工业CAD模型(尤其组件级)细粒度高、结构复杂,需更长参数化序列描述,而现有序列模型难以有效建模长序列。 Method: 提出基于Mamba架构的编码器-解码器框架,以CAD重建为自监督预训练任务学习潜在表示;再用该表示引导GAN生成伪表示,经Mamba-CAD解码器恢复为参数化CAD序列;构建含77,078个长序列CAD模型的新数据集。 Result: 在多项指标上验证有效性,尤其显著提升生成有效参数化CAD序列的最大长度。 Conclusion: Mamba-CAD能高效建模和生成工业级复杂长序列CAD模型,为CAD生成建模提供了新范式。 Abstract: Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from https://github.com/Sunny-Hack/Code-for-Mamba-CAD-AAAI-2025-.[187] SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
Zhuoran Zhao,Xianghao Kong,Linlin Yang,Zheng Wei,Pan Hui,Anyi Rao
Main category: cs.CV
TL;DR: 本文提出SesaHand方法,通过语义对齐(利用视觉语言模型生成图像描述并提取行为语义)和结构对齐(分层结构融合与手部结构注意力增强)提升可控3D手部图像生成质量,从而改善3D手部重建性能。
Details
Motivation: 现有基于游戏引擎的合成数据缺乏纹理、环境多样性及手臂/交互物体;生成模型虽具潜力但存在对齐问题。 Method: 提出SesaHand:1)语义对齐——采用Chain-of-Thought推理从VLM生成的图像描述中提取人类行为语义,抑制无关环境细节;2)结构对齐——引入分层结构融合整合多粒度结构信息,并设计手部结构注意力增强模块。 Result: 在生成性能上优于先前方法,并能有效提升下游3D手部重建精度。 Conclusion: 语义与结构双重对齐策略可显著提升生成手部图像的质量与实用性,为3D手部重建提供更优合成训练数据。 Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.[188] Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
Bin Chen,Weiqi Li,Shijie Zhao,Xuanyu Zhang,Junlin Li,Li Zhang,Jian Zhang
Main category: cs.CV
TL;DR: 本文提出了一种改进的对抗扩散压缩(ADC)方法用于真实世界视频超分辨率(Real-VSR),通过蒸馏具备3D时空注意力的大扩散Transformer(DiT)教师模型DOVE,构建轻量化的2D Stable Diffusion主干网络并引入1D时序卷积,配合双头对抗蒸馏机制,在显著提升推理速度与降低参数量的同时,兼顾空间细节与时间一致性。
Details
Motivation: 现有扩散模型在Real-VSR中虽效果好但推理慢;单步网络如SeedVR2、DOVE、DLoRAL仍参数量大、延迟高;直接应用ADC方法因缺乏时序建模能力和标准对抗学习局限,难以兼顾空间细节与时间一致性。 Method: 1)将3D时空注意力的DiT教师模型DOVE蒸馏至剪枝后的2D Stable Diffusion主干AdcSR,并添加轻量1D时序卷积;2)设计像素域与特征域双头判别器的对抗蒸馏策略,分别优化细节保真与时间一致性。 Result: 压缩后的AdcVSR模型参数量减少95%,推理速度达DOVE的8倍,同时保持有竞争力的视频质量与效率。 Conclusion: 所提改进ADC方法有效解决了Real-VSR中高效性与时空质量平衡的关键挑战,为轻量级高质量视频超分提供了新范式。 Abstract: While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.[189] Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation
Rajdeep Chatterjee,Sudip Chakrabarty,Trishaani Acharjee
Main category: cs.CV
TL;DR: 本文提出LSS-LTCNet,一种结合局部自相似性(LSS)与液态时间常数(LTC)动态边界的可解释语义分割框架,用于足部溃疡精准分割,在MICCAI FUSeg数据集上达到86.96% Dice和8.91像素HD95,参数量仅25.70M,兼顾精度、效率与可解释性。
Details
Motivation: 足部溃疡边界分割困难,源于组织异质性和与周围皮肤对比度低,传统基于强度的网络难以应对。 Method: 提出LSS-LTCNet:1)Local Self-Similarity(LSS)机制提取光照不变纹理特征,显式分离坏死组织与背景干扰;2)Liquid Time-Constant(LTC)精修模块将边界演化建模为ODE动态系统,在连续时间步迭代优化分割掩膜。 Result: 在MICCAI FUSeg数据集上取得SOTA边界对齐性能:Dice达86.96%,HD95为8.91像素;参数量仅25.70M,显著优于U-Net和Transformer基线模型。 Conclusion: LSS-LTCNet通过结构先验与连续时间神经动力学融合,实现了高精度、高效率且具备内在可视化审计能力的足部溃疡分割,适用于移动医疗场景的计算机辅助诊断。 Abstract: Accurate semantic segmentation of foot ulcers is essential for automated wound monitoring, yet boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. To overcome the limitations of standard intensity-based networks, we present LSS-LTCNet:an ante-hoc explainable framework synergizing deterministic structural priors with continuous-time neural dynamics. Our architecture departs from traditional black-box models by employing a Local Self-Similarity (LSS) mechanism that extracts dense, illumination-invariant texture descriptors to explicitly disentangle necrotic tissue from background artifacts. To enforce topological precision, we introduce a Liquid Time-Constant (LTC) refinement module that treats boundary evolution as an ODEgoverned dynamic system, iteratively refining masks over continuous time-steps. Comprehensive evaluation on the MICCAI FUSeg dataset demonstrates that LSS-LTCNet achieves state-of-the-art boundary alignment, securing a peak Dice score of 86.96% and an exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. Requiring merely 25.70M parameters, the model significantly outperforms heavier U-Net and transformer baselines in efficiency. By providing inherent visual audit trails alongside high-fidelity predictions, LSS-LTCNet offers a robust and transparent solution for computer-aided diagnosis in mobile healthcare (mHealth) settings.[190] ReMoT: Reinforcement Learning with Motion Contrast Triplets
Cong Wan,Zeyu Guo,Jiangyang Li,SongLin Dong,Yifan Bai,Lin Peng,Zhiheng Ma,Yihong Gong
Main category: cs.CV
TL;DR: ReMoT是一种统一的训练范式,通过构建大规模运动对比数据集ReMoT-16K和提出Group Relative Policy Optimization方法,显著提升视觉语言模型在时空一致性推理任务上的性能,尤其在细粒度运动对比任务上实现25.1%的性能跃升。
Details
Motivation: 解决视觉语言模型(VLMs)在时空一致性方面的根本缺陷,该缺陷在导航、机器人和自动驾驶等关键应用中尤为突出。 Method: 提出ReMoT训练范式,包含两部分:(1) 基于规则的自动框架生成大规模运动对比数据集ReMoT-16K;(2) Group Relative Policy Optimization算法用于高效学习对比推理。同时构建首个细粒度运动对比三元组基准。 Result: 在新构建的运动对比基准及多个标准VLM基准上达到SOTA性能,在时空推理任务上提升25.1%。 Conclusion: ReMoT有效提升了VLM在时空一致性建模方面的能力,验证了运动对比学习与相对策略优化对VLM时空推理能力的关键作用。 Abstract: We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.[191] OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation
Zhaolin Yu,Litao Yang,Ben Babicka,Ming Hu,Jing Hao,Anthony Huang,James Huang,Yueming Jin,Jiasong Wu,Zongyuan Ge
Main category: cs.CV
TL;DR: 本文提出OPGAgent,一种多工具代理系统,用于可审计的牙科全景片(OPG)解读,通过分层证据收集、专用工具箱和共识子代理提升多任务分析准确性和可解释性,并构建OPG-Bench评估协议以全面评估结构化报告质量。
Details
Motivation: 现有视觉语言模型(VLMs)在多任务OPG分析中灵活性高但单任务性能不足;而任务专用模型虽准确但缺乏通用性;牙科影像领域尚未探索基于代理(agentic)的多工具协同方法。 Method: 提出OPGAgent系统,包含:(1) 分层证据收集模块(全局→象限→牙齿级动态调用工具),(2) 专用工具箱(涵盖空间、检测、效用与专家模型),(3) 基于解剖约束的共识子代理;并构建OPG-Bench评估协议,基于(Location, Field, Value)三元组进行结构化报告评估。 Result: OPGAgent在自建OPG-Bench和公开MMOral-OPG基准上,均优于当前牙科VLM及医学代理框架,在结构化报告和VQA两类评测中表现更优。 Conclusion: 多工具代理架构能兼顾OPG分析的多功能性与高精度,结合结构化临床报告协议可提升结果可审计性与临床可信度,为牙科AI提供新范式。 Abstract: Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized tools offer a path to both versatility and accuracy, this approach remains unexplored in the field of dental imaging. To address this gap, we propose OPGAgent, a multi-tool agentic system for auditable OPG interpretation. OPGAgent coordinates specialized perception modules with a consensus mechanism through three components: (1) a Hierarchical Evidence Gathering module that decomposes OPG analysis into global, quadrant, and tooth-level phases with dynamically invoking tools, (2) a Specialized Toolbox encapsulating spatial, detection, utility, and expert zoos, and (3) a Consensus Subagent that resolves conflicts through anatomical constraints. We further propose OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples derived from real clinical reports, which enables a comprehensive review of findings and hallucinations, extending beyond the limitations of VQA indicators. On our OPG-Bench and the public MMOral-OPG benchmark, OPGAgent outperforms current dental VLMs and medical agent frameworks across both structured-report and VQA evaluation. Code will be released upon acceptance.[192] DreamWorld: Unified World Modeling in Video Generation
Boming Tan,Xiangdong Zhang,Ning Liao,Yuqing Zhang,Shaofeng Zhang,Xue Yang,Qi Fan,Yanyong Zhang
Main category: cs.CV
TL;DR: 本文提出DreamWorld框架,通过联合世界建模范式整合多种世界知识(如物理常识、3D与时间一致性),并引入一致约束退火(CCA)和多源内引导机制,显著提升视频生成的世界一致性。
Details
Motivation: 现有视频生成模型仅具备表面合理性,缺乏对世界的统一、连贯理解;单一知识注入或刚性对齐策略不足以构建涵盖多维异构要素(如物理常识、3D结构、时间一致性)的世界模型。 Method: 提出DreamWorld统一框架,采用联合世界建模范式,同步预测视频像素与基础模型特征;设计一致约束退火(CCA)缓解训练中异构目标冲突导致的视觉不稳定;引入多源内引导机制在推理阶段强化世界先验。 Result: 在VBench上较Wan2.1提升2.26分,显著增强视频生成的世界一致性;代码将开源。 Conclusion: 联合建模多维世界知识并辅以渐进式约束与推理引导,是提升视频生成深层世界理解能力的有效路径。 Abstract: Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{https://github.com/ABU121111/DreamWorld}{\textcolor{mypink}{\textbf{Github}}}.[193] High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System
Pengju Sun,Banglei Guan,Jing Tao,Zhenbao Yu,Xuanyu Bai,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 本文提出了一种硬件-算法协同设计的HDR成像系统,结合空间可变曝光(SVE)微衰减相机与事件相机,在非共轴、异构光学条件下实现高精度跨模态对齐与融合,显著提升强光区域恢复、边缘保真度和鲁棒性。
Details
Motivation: 传统相机在极端光照下易过曝,难以实现高质量HDR成像;事件相机虽具高动态范围和微秒级时间分辨率,但缺乏完整空间信息;SVE传感器可单次获取辐射多样性,但需与事件数据互补。 Method: 设计非对称双模态硬件系统(SVE微衰减相机+事件相机),提出两阶段跨模态对齐框架(特征引导粗略单应估计 + 多尺度空间池化与频域滤波精调),并构建含卷积融合、互信息正则化及可学习融合损失的HDR重建网络。 Result: 在合成基准与真实场景实验中,该系统在高光恢复、边缘保真度和鲁棒性方面均优于纯帧或纯事件HDR方法。 Conclusion: 联合优化光学设计、跨模态对齐与计算融合,为动态剧烈、辐射复杂环境下的可靠HDR感知提供了有效基础。 Abstract: High dynamic range (HDR) imaging under extreme illumination remains challenging for conventional cameras due to overexposure. Event cameras provide microsecond temporal resolution and high dynamic range, while spatially varying exposure (SVE) sensors offer single-shot radiometric diversity.We present a hardware--algorithm co-designed HDR imaging system that tightly integrates an SVE micro-attenuation camera with an event sensor in an asymmetric dual-modality configuration. To handle non-coaxial geometry and heterogeneous optics, we develop a two-stage cross-modal alignment framework that combines feature-guided coarse homography estimation with a multi-scale refinement module based on spatial pooling and frequency-domain filtering. On top of aligned representations, we develop a cross-modal HDR reconstruction network with convolutional fusion, mutual-information regularization, and a learnable fusion loss that adaptively balances intensity cues and event-derived structural constraints. Comprehensive experiments on both synthetic benchmarks and real captures demonstrate that the proposed system consistently improves highlight recovery, edge fidelity, and robustness compared with frame-only or event-only HDR pipelines. The results indicate that jointly optimizing optical design, cross-modal alignment, and computational fusion provides an effective foundation for reliable HDR perception in highly dynamic and radiometrically challenging environments.[194] U-VLM: Hierarchical Vision Language Modeling for Report Generation
Pengcheng Shi,Minghui Zhang,Kehan Song,Jiaqi Liu,Yun Gu,Xinglin Zhang
Main category: cs.CV
TL;DR: 本文提出U-VLM模型,通过分阶段渐进式训练和多层视觉特征注入,提升3D医学影像报告自动生成性能,在多个数据集上达到SOTA效果。
Details
Motivation: 现有视觉-语言模型在3D医学影像报告生成中存在两个局限:未利用分割预训练编码器,且仅在语言模型输入层注入视觉特征,导致多尺度信息丢失。 Method: 提出U-VLM模型,包含两方面创新:(1)从分割→分类→报告生成的渐进式训练策略;(2)将U-Net编码器各层特征对应注入到语言模型多层中,实现分层视觉-语言建模。 Result: 在CT-RATE和AbdomenAtlas 3.0数据集上显著超越SOTA(如CT-RATE F1达0.414 vs 0.258);仅用0.1B参数量从头训练解码器即超越7B+预训练大语言模型;消融实验证明渐进预训练提升F1,多层注入提升BLEU-mean。 Conclusion: 良好的视觉编码器预训练设计比依赖超大预训练语言模型更有效;分阶段训练与多层特征融合是提升3D医学报告生成性能的关键路径。 Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.[195] TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications
Feibo Jiang,Siwei Tu,Li Dong,Xiaolong Li,Kezhi Wang,Cunhua Pan,Zhu Han,Jiangzhou Wang
Main category: cs.CV
TL;DR: 本文提出TaiChi框架,通过双视觉分词器、双边注意力网络和Kolmogorov-Arnold网络投影器,提升视觉-语言模型在 token 级通信中的跨模态对齐与效率。
Details
Motivation: 现有视觉-语言模型受限于token粒度粗、视觉token序列过长及跨模态对齐不足。 Method: 提出TaiChi框架:1)双分辨率视觉分词器协同提取像素细节与全局语义;2)双边注意力网络(BAN)融合多尺度视觉token;3)基于KAN的可学习激活函数模态投影器实现高精度非线性跨模态对齐;4)集成多模态多任务token通信系统与联合VLM-信道编码方案。 Result: 实验验证TaiChi在性能上优于现有方法,并证实其驱动的token通信系统具备可行性与有效性。 Conclusion: TaiChi有效缓解了VLM在token级智能通信中的关键瓶颈,为高效、紧凑、精准的跨模态表示与传输提供了新范式。 Abstract: Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.[196] RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
Liyao Jiang,Ruichen Chen,Chao Gao,Di Niu
Main category: cs.CV
TL;DR: 本文提出RAISE框架,一种无需训练、需求驱动的进化式文本到图像生成方法,通过动态验证和自适应优化提升提示-图像对齐效果。
Details
Motivation: 现有文本到图像扩散模型在复杂提示(多物体、关系和细粒度属性)下难以实现忠实对齐;训练-free推理缩放方法缺乏难度自适应能力,而反射调优模型依赖特定数据集且泛化性差。 Method: RAISE将图像生成建模为需求驱动的自适应缩放过程,在推理时演化候选图像种群,采用提示重写、噪声重采样和指令编辑等多种精炼动作,并基于结构化需求清单进行逐轮验证与定向计算分配。 Result: 在GenEval和DrawBench上达到SOTA对齐性能(GenEval整体0.94),同时减少30–40%生成样本量和80%视觉语言模型调用次数。 Conclusion: RAISE实现了高效、通用、模型无关的多轮自改进,显著提升了复杂提示下的生成对齐能力与计算效率。 Abstract: Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.[197] Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Qihang Fan,Yuang Ai,Huaibo Huang,Ran He
Main category: cs.CV
TL;DR: 本文提出了一种简单高效的随机分组策略用于Vision Transformer中的token分组,替代复杂的分组设计,在多个视觉任务及跨模态任务中均表现出优异性能,并分析了分组策略所需的四个关键要素。
Details
Motivation: 现有Vision Transformer的token分组方法设计复杂多样,作者质疑其必要性,探索是否可用更简单统一的分组策略替代。 Method: 提出随机分组策略:对视觉token进行快速、简单的随机分组,结合对分组策略关键要素(位置信息、头特征多样性、全局感受野、固定分组模式)的分析与验证。 Result: 随机分组在多个基线模型上几乎全面优于其他分组方法;在目标检测等下游任务中优势更显著;并在点云处理、视觉-语言模型等多模态任务中验证了有效性。 Conclusion: 只要满足四个关键设计要素,极其简单的随机分组策略即可高效有效地应对各类视觉及跨模态任务,无需复杂分组机制。 Abstract: Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.[198] ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models
Riccardo de Lutio,Tobias Fischer,Yen-Yu Chang,Yuxuan Zhang,Jay Zhangjie Wu,Xuanchi Ren,Tianchang Shen,Katarina Tothova,Zan Gojcic,Haithem Turki
Main category: cs.CV
TL;DR: 本文提出了一种两阶段生成式方法,通过训练双向生成模型并蒸馏为因果自回归模型,显著提升稀疏视角下新视角合成的质量与可扩展性,在PSNR上超越现有方法1-3 dB。
Details
Motivation: 现有基于生成先验的方法在欠观测区域修复效果差,存在可扩展性不足(如扩散模型生成视图数受限)和生成质量低(与场景内容不一致、完全未观测区域失效)两大问题。 Method: 提出两阶段流水线:第一阶段训练带新颖不透明度混合策略的双向生成模型,兼顾观测一致性与未见区域外推能力;第二阶段将其蒸馏为因果自回归模型,支持单次生成数百帧,并可用于直接生成新视角或作为伪监督优化底层3D表示。 Result: 在标准数据集上大幅超越所有现有基线,PSNR提升1-3 dB;在现有方法完全失效的极端稀疏视角场景下仍能生成合理重建结果。 Conclusion: 该方法有效解决了生成式新视角合成中可扩展性与质量的矛盾,为稀疏输入下的高质量、高效率三维重建提供了新范式。 Abstract: Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.[199] COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
Yuchen Che,Jingtu Wu,Hao Zheng,Asako Kanezaki
Main category: cs.CV
TL;DR: 本文提出了一种名为COG的无监督框架,将跨视角对应关系估计建模为置信度感知的最优传输问题,通过预测点级置信度并结合视觉基础模型的语义先验,实现鲁棒的6DoF姿态估计。
Details
Motivation: 现有方法依赖离散的一对一匹配,不可微且易坍缩到稀疏关键点,难以应对遮挡、视角变化和异常值等挑战。 Method: 提出置信度感知最优几何对应(COG)框架,将对应估计建模为置信度加权的最优传输问题,引入点级置信度作为传输边际,并融合视觉基础模型的语义先验进行正则化。 Result: 无监督COG性能媲美监督方法,监督COG则超越现有监督方法。 Conclusion: 将置信度显式融入对应匹配与位姿估计流程,可提升鲁棒性与泛化性,并支持端到端无监督学习。 Abstract: Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.[200] M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
Dawei Yan,Haokui Zhang,Guangda Huzhang,Yang Li,Yibo Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Ying Li,Wei Dong,Chunhua Shen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、基于双层记忆机制的框架M²,用于提升多模态大语言模型(MLLMs)在长周期网页导航任务中的上下文效率与决策鲁棒性,显著提升成功率并降低token消耗和计算开销。
Details
Motivation: 现有基于MLLMs的网页导航智能体在处理长周期任务时存在计算成本高、推理能力不足的问题,且依赖大量数据与训练。 Method: 提出训练无关的M²框架,包含动态轨迹摘要(内部记忆)压缩交互历史,以及洞察检索增强(外部记忆)从离线洞察库中检索可执行指南。 Result: 在WebVoyager和OnlineMind2Web上验证,M²使Qwen3-VL-32B的成功率提升19.6%、token减少58.7%;Claude等私有模型准确率提升达12.5%,计算开销显著降低。 Conclusion: M²通过轻量、训练-free的记忆增强设计,有效缓解了MLLMs在长周期网页导航中的上下文与推理瓶颈,具备强泛化性与实用性。 Abstract: Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.[201] Hierarchical Classification for Improved Histopathology Image Analysis
Keunho Byeon,Jinsol Song,Seong Min Hong,Yosep Chong,Jin Tae Kwak
Main category: cs.CV
TL;DR: 本文提出HiClass框架,用于全切片图像的层次化分类,通过双向特征集成和定制损失函数提升粗粒度与细粒度病理图像分类性能。
Details
Motivation: 现有深度学习方法在病理全切片图像分析中多采用扁平化分类,忽略了类别标签间的层次关系。 Method: 基于多实例学习,引入双向特征集成以促进粗粒度与细粒度特征间的信息交换,并设计层次一致性损失、类内/类间距离损失及组别交叉熵损失。 Result: 在包含4个粗粒度和14个细粒度类别的胃活检数据集上,HiClass在粗粒度与细粒度分类任务中均取得更优性能。 Conclusion: HiClass能有效建模病理图像中粗粒度与细粒度的层次化特征,显著提升全切片图像分类效果。 Abstract: Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.[202] What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Yingqi Fan,Junlong Tong,Anhao Zhao,Xiaoyu Shen
Main category: cs.CV
TL;DR: 本文提出EmbedLens分析框架,发现MLLM中视觉token存在语义稀疏性,仅约60%的'活跃token'承载图像语义,且其已含丰富细粒度信息;内部视觉计算对多数任务冗余,视觉信息更适于中层注入而非浅层处理。
Details
Motivation: 现有MLLM中视觉token的内部结构与语义处理机制尚不明确,亟需细粒度分析以提升模型效率与可解释性。 Method: 提出两阶段分析框架及新型探针工具EmbedLens,通过语义分类(sink/dead/alive)、补丁压缩基准测试及内部视觉计算分析,系统考察视觉token的分布、编码能力与处理必要性。 Result: 发现视觉token存在显著语义稀疏性(仅≈60%为'alive'并携带图像特异性语义);alive token在输入阶段即编码细粒度视觉线索;内部视觉计算对大多数任务冗余;对强视觉任务,alive token天然适配LLM中层而非初始嵌入层。 Conclusion: 视觉token处理具有统一机制:可选择性剪枝非活跃token、最小化内部视觉计算、直接向中层注入活跃token,从而构建更高效、可解释的MLLM架构。 Abstract: Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60\%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: https://github.com/EIT-NLP/EmbedLens.[203] Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
Ruoshuang Du,Xin Sun,Qiang Liu,Bowen Song,Zhongqi Chen,Weiqiang Wang,Liang Wang
Main category: cs.CV
TL;DR: 本文提出了一种多模态自适应检索增强生成(MMA-RAG)方法,通过动态评估模型内部知识置信度,决定是否引入外部检索信息,从而缓解视觉问答(VQA)中的幻觉问题。
Details
Motivation: 现有VQA系统因幻觉问题导致可靠性低;静态检索增强易引入无关或冲突的视觉证据,尤其在视觉RAG中检索到视觉相似但语义错误的内容。 Method: 提出MMA-RAG框架,核心是一个基于层间分析训练的决策分类器,利用联合的视觉与文本内部表征指导反向图像检索,并动态决定是否融合外部知识。 Result: 在三个VQA数据集上显著提升回答性能;消融实验验证了内部表征对自适应检索决策的关键作用。 Conclusion: MMA-RAG能有效平衡外部知识利用与推理鲁棒性,适用于多样化多模态场景。 Abstract: Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.[204] Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
Wang Chen,Yuhui Zeng,Yongdong Luo,Tianyu Xie,Luojun Lin,Jiayi Ji,Yan Zhang,Xiawu Zheng
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的基于小波变换的视频帧选择方法WFS-SB,通过检测语义边界(即叙事变化的关键时刻)来提升长视频理解效果,显著优于现有方法。
Details
Motivation: 现有帧选择方法仅关注帧与查询的相关性,忽略了视频的叙事结构;而有效理解视频更依赖于捕捉语义跃迁(即叙事变化的关键时刻)。 Method: 利用小波变换对查询-帧相似度信号进行多尺度分解,从最粗尺度提取干净的语义变化信号,并以局部极值点作为语义边界划分视频片段;再采用两阶段策略:先按复合重要性分数自适应分配各片段帧预算,再在每段内用最大边际相关性(MMR)选取多样且相关的帧。 Result: 在VideoMME、MLVU和LongVideoBench上分别提升LVLM准确率5.5%、9.5%和6.2%,持续超越SOTA方法。 Conclusion: 语义边界比单纯高相关帧更能支撑长视频整体理解;WFS-SB提供了一种鲁棒、高效、无需训练的帧选择新范式。 Abstract: Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.[205] MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
Xingyilang Yin,Chengzhengxu Li,Jiahao Chang,Chi-Man Pun,Xiaodong Cun
Main category: cs.CV
TL;DR: 本文提出MLLM-4D框架,通过构建高质量4D时空指令数据集和创新的后训练策略(含ST-CoT提示与ST-reward),显著提升多模态大语言模型对2D视频输入的4D时空理解与推理能力。
Details
Motivation: 人类天生具备基于视觉的4D时空智能,而当前多模态大语言模型(MLLMs)在此方面存在显著瓶颈。 Method: 提出MLLM-4D框架:1)构建低成本、高质量的4D时空指令数据集(MLLM4D-2M、MLLM4D-R1-30k、MLLM4D-Bench);2)采用监督微调(SFT)建立基础4D理解,并结合分组相对策略优化(GRPO)、时空链式思维(ST-CoT)提示与时空奖励函数(ST-reward)进行强化微调(RFT),不修改模型架构。 Result: MLLM-4D在纯2D RGB输入下实现了时空理解与推理能力的最先进(SOTA)水平。 Conclusion: MLLM-4D有效弥合了MLLMs在4D时空感知与推理上的关键缺口,为从2D视频中提取高阶时空语义提供了可扩展、无需架构修改的解决方案。 Abstract: Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.[206] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
Quan Kong,Yanru Xiao,Yuhao Shen,Cong Wang
Main category: cs.CV
TL;DR: 本文提出Vision-TTT,一种基于Test-Time Training(TTT)的线性时间视觉建模方法,通过双向扫描与Conv2d模块建模二维全局视觉相关性,在ImageNet上达到高性能,同时显著降低计算量、内存占用并提升推理速度。
Details
Motivation: Vision Transformers(ViTs)因自注意力机制的二次复杂度而受限,亟需高效且表达力强的线性时间替代方案。 Method: 将Test-Time Training(TTT)引入视觉领域,提出Vision-TTT;采用双向扫描策略和Conv2d模块,实现对2D视觉特征的自监督序列压缩与全局建模。 Result: Vision-TTT系列模型(Vittt-T/S/B)在ImageNet上分别达到77.3%、81.2%、82.5% Top-1准确率;在1280×1280分辨率下,Vittt-T相较DeiT-T减少79.4% FLOPs、提速4.38倍、内存降低88.9%。 Conclusion: Vision-TTT兼具高表达力与高效率,是一种有潜力的下一代通用视觉主干网络。 Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.[207] Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness
Yuyang Chen,Linqian Zeng,Yijin ZHou,Hengjie Li,Jidong Zhai
Main category: cs.CV
TL;DR: Jano是一种无需训练的加速框架,通过识别生成内容中不同区域在去噪过程中的异质收敛模式,实现区域感知的自适应计算资源调度,在保持生成质量的同时实现平均2.0倍的加速。
Details
Motivation: 现有扩散模型(尤其是DiT)计算开销大,而传统加速方法采用内容无关的统一优化策略,忽略了生成内容不同区域在去噪过程中收敛速度的差异性。 Method: 提出Jano框架:1)早期复杂度识别算法,在初始去噪步骤中准确判断各区域收敛需求;2)自适应token调度机制,在运行时动态分配计算资源。 Result: 在先进模型上评估显示,Jano平均提速2.0倍,最高达2.4倍,且不损害生成质量。 Conclusion: 打破统一处理假设,验证了区域感知加速的有效性与实用性,为大规模内容生成提供了高效可行的新路径。 Abstract: Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at https://github.com/chen-yy20/Jano.[208] Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Zhen Zhou,Jian Liu,Biwen Lei,Jing Xu,Haohan Weng,Yiling Zhu,Zhuo Chen,Junfeng Fan,Yunkai Ma,Dazhao Du,Song Guo,Fengshui Jing,Chunchao Guo
Main category: cs.CV
TL;DR: 本文提出了一种面向3D网格生成的异步在线强化学习框架ARPO及Mesh-Pro模型,显著提升训练效率与生成质量。
Details
Motivation: 现有3D生成中强化学习应用受限于离线DPO方法的低效性和泛化能力差,亟需更高效的在线RL方案。 Method: 提出异步在线RL框架、Advantage-guided Ranking Preference Optimization(ARPO)算法,以及Mesh-Pro模型,包含对角感知混合三角-四边形网格分词和基于光线的几何完整性奖励。 Result: Mesh-Pro在艺术性和稠密网格生成上达到SOTA;异步RL比同步快3.75倍;ARPO在训练效率与泛化间取得更好平衡。 Conclusion: 本工作首次将高效在线强化学习成功引入3D网格生成,为该领域提供了新范式和实用工具。 Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.[209] TP-Spikformer: Token Pruned Spiking Transformer
Wenjie Wei,Xiaolong Zhou,Malu Zhang,Ammar Belatreche,Qian Sun,Yimeng Shan,Dehao Zhang,Zijian Zhou,Zeyu Ma,Yang Yang,Haizhou Li
Main category: cs.CV
TL;DR: 本文提出了一种名为TP-Spikformer的简单有效的脉冲Transformer令牌剪枝方法,通过时空信息保留准则和块级早期停止策略,在减少计算与存储开销的同时保持竞争力,并支持训练无关部署。
Details
Motivation: 现有脉冲Transformer虽提升精度,但模型规模大、计算资源消耗高,难以部署于资源受限设备。 Method: 提出基于启发式时空信息保留准则的令牌重要性评估机制,并设计块级早期停止策略进行信息保留型令牌剪枝,而非直接删除。 Result: 在Spikformer、QKFormer及Spike-driven Transformer等多个架构及图像分类、目标检测、语义分割和事件相机目标跟踪等任务上验证了TP-Spikformer的有效性、高效性与可扩展性;尤其支持无需训练的部署方式。 Conclusion: TP-Spikformer是一种高效实用的方案,有望推动脉冲神经网络在真实世界低资源场景中的落地应用。 Abstract: Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.[210] CaptionFool: Universal Image Captioning Model Attacks
Swapnil Parekh
Main category: cs.CV
TL;DR: 本文提出CaptionFool,一种针对先进Transformer图像描述模型的通用对抗攻击方法,仅修改约1.2%图像块即可高成功率生成任意(含违规)目标描述,揭示了多模态模型的安全隐患。
Details
Motivation: 图像描述模型易受对抗攻击,且部署模型缺乏对恶意输入的鲁棒性,亟需评估其安全漏洞。 Method: 提出CaptionFool——一种输入无关的通用对抗攻击方法,通过优化极少量(7/577)图像patch扰动,操控Transformer captioning模型输出指定目标caption,包括规避内容审核的俚语表达。 Result: 在SOTA transformer captioning模型上实现94–96%攻击成功率;可生成规避现有内容过滤器的‘俚语’式违规描述。 Conclusion: 揭示了当前视觉-语言模型在实际部署中存在严重安全脆弱性,强调必须开发针对性鲁棒防御机制。 Abstract: Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate "slang" terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.[211] RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation
Xianhao Zhou,Jianghao Wu,Lanfeng Zhong,Ku Zhao,Jinlong He,Shaoting Zhang,Guotai Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为RAFM的新方法,通过引入检索增强的流匹配机制,改进了无配对CBCT到CT合成任务中的Rectified Flow训练稳定性与性能。
Details
Motivation: CBCT在放疗中广泛使用但存在严重伪影和HU值不可靠问题,而获取高质量配对CBCT-CT数据困难;现有无配对转换方法在小样本医学数据下效果受限且训练不稳定。 Method: 提出Retrieval-Augmented Flow Matching(RAFM),利用冻结的DINOv3编码器和全局CT记忆库构建检索引导的伪配对样本,提升分布级耦合质量并稳定无配对流匹配训练。 Result: 在SynthRAD2023严格主体级无配对协议下,RAFM在FID、MAE、SSIM、PSNR和SegScore指标上均优于现有方法。 Conclusion: RAFM有效解决了小规模医学数据下无配对CBCT-to-CT合成中流匹配训练不稳定的问题,显著提升了sCT生成质量,具备临床应用潜力。 Abstract: Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT--CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at https://github.com/HiLab-git/RAFM.git.[212] Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation
Yafei Zhang,Shuaitian Song,Huafeng Li,Shujuan Wang,Yu Liu
Main category: cs.CV
TL;DR: 本文提出了一种自适应动态去雾框架,通过闭环优化机制,结合下游任务反馈和用户文本指令,在推理阶段动态调整去雾结果,无需重新训练即可适配多种下游视觉任务。
Details
Motivation: 现实视觉系统中,去雾不仅需提升图像可见性,还需满足不同下游任务的特定需求,现有方法缺乏任务自适应性和交互性。 Method: 提出一种闭环优化的自适应动态去雾框架,包含两个核心机制:(1) 基于多下游任务性能反馈的任务反馈环;(2) 支持用户文本指令输入的接口,实现高阶任务偏好引导。 Result: 在多种视觉任务上的实验表明该方法具有强有效性、鲁棒性和泛化性。 Conclusion: 该工作建立了面向交互式、任务自适应去雾的新范式,使去雾模型能与下游应用协同工作。 Abstract: In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream tasks.To address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization mechanism.It enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without retraining.Technically,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task preferences.This dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple tasks.Extensive experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our approach.These results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.[213] Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Ke Cao,Xuanhua He,Xueheng Li,Lingting Zhu,Yingying Wang,Ao Ma,Zhanjie Zhang,Man Zhou,Chengjun Xie,Jie Zhang
Main category: cs.CV
TL;DR: 本文提出PanScale数据集和PanScale-Bench基准,用于评估跨尺度泛锐化性能,并设计ScaleFormer模型,通过将图像分辨率泛化建模为序列长度泛化,实现对不同尺度图像的高质量融合。
Details
Motivation: 现有泛锐化方法多在低分辨率设定下评估,难以推广到真实高分辨率场景,亟需解决跨尺度泛锐化的数据、算法与计算挑战。 Method: 构建首个大规模跨尺度泛锐化数据集PanScale及配套基准PanScale-Bench;提出ScaleFormer架构,将图像分块为长度可变但分辨率一致的序列,引入Scale-Aware Patchify模块支持固定尺寸裁剪训练,并解耦块内空间特征学习与块间序列建模,结合旋转位置编码增强尺度外推能力。 Result: 实验表明ScaleFormer在融合质量与跨尺度泛化能力上均优于当前SOTA方法。 Conclusion: ScaleFormer有效解决了跨尺度泛锐化中的泛化难题,PanScale系列资源为该方向提供了重要基础支撑。 Abstract: Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.[214] Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer
Juan A. Castro-Silva,Maria N. Moreno Garcia,Diego H. Peluffo-Ordoñez
Main category: cs.CV
TL;DR: 本文提出了一种名为MIMD-3DVT的新型多输入混合数据3D视觉Transformer方法,用于阿尔茨海默病(AD)诊断,通过联合处理连续MRI切片、融合多个3D脑区影像及整合人口统计、认知评估与影像数据,在ADNI、AIBL和OASIS数据集上达到97.14%的准确率,显著优于现有方法。
Details
Motivation: 现有基于MRI的AD诊断方法存在三大缺陷:2D Transformer忽略3D上下文信息;ROI模型仅关注少数脑区,忽视AD多区域影响特性;单模态分类难以满足临床所需的多源数据综合判断需求。 Method: 提出MIMD-3DVT模型:1)采用3D Vision Transformer联合处理连续MRI切片以保留空间与特征维度信息;2)融合多个3D ROI影像数据;3)整合人口统计学、认知量表评分与脑影像等多模态混合数据。 Result: 在ADNI、AIBL与OASIS联合数据集上,MIMD-3DVT对正常认知与阿尔茨海默病的二分类准确率达97.14%,优于当前最优方法。 Conclusion: MIMD-3DVT通过建模3D结构信息、扩展ROI覆盖范围及融合多源异构数据,有效提升了AD诊断性能,为临床提供更鲁棒、可解释的AI辅助诊断工具。 Abstract: The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer's affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer's requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer's Disease.[215] Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
Yu Wang,Shengjie Zhao
Main category: cs.CV
TL;DR: 本文提出LAS-VAD框架,通过异常连通分量机制、意图感知机制和异常属性信息建模,提升弱监督视频异常检测性能。
Details
Motivation: 现有弱监督视频异常检测方法缺乏帧级标注,难以有效学习异常语义。 Method: 提出LAS-VAD框架,包含异常连通分量机制(对帧进行语义分组)、意图感知机制(区分相似的正常与异常行为)以及异常属性信息建模(利用爆炸等异常的典型特征辅助检测)。 Result: 在XD-Violence和UCF-Crime两个基准数据集上显著优于当前最先进方法。 Conclusion: LAS-VAD通过引入语义分组、意图区分与属性引导,有效缓解了弱监督下异常语义学习困难的问题,提升了检测精度。 Abstract: Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.[216] Geometry OR Tracker: Universal Geometric Operating Room Tracking
Yihua Shao,Kang Chen,Feng Xue,Siyu Chen,Long Bai,Hongyuan Yu,Hao Tang,Jinlin Wu,Nassir Navab
Main category: cs.CV
TL;DR: 本文提出Geometry OR Tracker,通过两阶段方法解决手术室中多视角3D跟踪因标定不准导致的几何不一致问题:先用多视角度量几何校正模块统一相机尺度与几何关系,再在统一手术室世界坐标系中进行遮挡鲁棒的3D点跟踪。
Details
Motivation: 手术室中多视角3D跟踪需精确物理量(如米级距离),但临床实际中相机标定和RGB-D配准不可靠,造成跨视角几何不一致和‘鬼影’,损害3D轨迹精度。 Method: 提出两阶段Geometry OR Tracker:第一阶段为Multi-view Metric Geometry Rectification,实现单全局尺度下的相机标定校正;第二阶段为Occlusion-Robust 3D Point Tracking,在统一OR世界坐标系中直接跟踪。 Result: 在MM-OR基准上,校正模块将跨视角深度误差降低30倍以上;消融实验验证了几何一致性提升显著增强世界坐标系下的跟踪精度。 Conclusion: 几何一致性是多视角3D跟踪性能的关键前提,所提校正方法可有效提升临床场景下跟踪的鲁棒性与精度。 Abstract: In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition, where physically meaningful quantities such as distances and motion statistics must be measured in meters. However, real clinical deployments rarely satisfy the geometric prerequisites for stable multi-view fusion and tracking: camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency that produces "ghosting" during fusion and degrades 3D trajectories in a shared OR coordinate frame. To address this, we introduce Geometry OR Tracker, a two-stage pipeline that first rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup with a single global scale via a Multi-view Metric Geometry Rectification module, and then performs Occlusion-Robust 3D Point Tracking directly in the unified OR world frame. On the MM-OR benchmark, improved geometric consistency translates into tracking gains: our rectification front-end reduces cross-view depth disagreement by more than 30$\times$ compared to raw calibration. Ablation studies further demonstrate the relationship between calibration quality and tracking accuracy, showing that improved geometric consistency yields stronger world-frame tracking.[217] MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
Yilian Liu,Xiaojun Jia,Guoshun Nan,Jiuyang Lyu,Zhican Chen,Tao Guan,Shuyuan Luo,Zhongyi Zhai,Yang Liu
Main category: cs.CV
TL;DR: 本文提出MIDAS框架,通过多图像分散与语义重构实现对多模态大语言模型(MLLMs)的高效越狱攻击,显著提升对强对齐闭源模型的攻击成功率。
Details
Motivation: 现有越狱方法仅依赖单图像掩码或孤立视觉线索,推理路径短、效果有限,尤其难以攻破强对齐的商业闭源MLLMs。 Method: 提出Multi-Image Dispersion and Semantic Reconstruction(MIDAS):将有害语义拆解为风险子单元,分散至多个图像中,并利用跨图像推理逐步重建恶意意图,强制模型进行更长、更结构化的多图链式推理。 Result: 在4个闭源MLLMs上平均攻击成功率达81.46%,显著优于现有最先进方法;实验覆盖多个数据集和模型。 Conclusion: MIDAS通过增强视觉依赖、延迟恶意语义暴露、削弱安全注意力,有效突破当前MLLM安全机制,揭示了多模态对齐中的新脆弱性。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model's reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model's security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this [link](https://github.com/Winnie-Lian/MIDAS).[218] Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
Yongbo He,Zirun Guo,Tao Jin
Main category: cs.CV
TL;DR: 本文提出DASP框架,通过解耦适配器的稳定与可塑组件,针对多模态测试时适应中的负迁移和灾难性遗忘问题,实现对偏差模态的可塑性更新和无偏模态的稳定性保持。
Details
Motivation: 现有方法在多模态测试时适应中常出现偏差模态的灾难性遗忘和无偏模态的负迁移问题。 Method: 提出Decoupling Adaptation for Stability and Plasticity (DASP)框架,基于对统一潜在空间中模态间维度冗余差异的诊断,对偏差模态启用可塑组件更新、固定稳定组件;对无偏模态则绕过可塑组件,仅用KL正则化更新稳定组件。 Result: 在多个多模态基准上显著优于当前最优方法。 Conclusion: DASP通过不对称适配策略,兼顾模型在新领域中的适应灵活性与通用知识的保持能力,有效缓解多模态测试时适应中的关键挑战。 Abstract: Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.[219] WildActor: Unconstrained Identity-Preserving Video Generation
Qin Guo,Tianyu Yang,Xuanhua He,Fei Shen,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Dan Xu
Main category: cs.CV
TL;DR: 本文提出Actor-18M大规模人类视频数据集和WildActor框架,解决数字演员在多视角、动态动作下全身身份一致性生成难题,通过不对称身份保持注意力与自适应视角蒙特卡洛采样提升生成质量。
Details
Motivation: 现有方法难以在动态镜头、多视角和复杂动作下保持数字演员全身身份一致性,存在面部中心化、身体不一致或姿态锁定等问题。 Method: 构建Actor-18M数据集(1.6M视频/18M图像),提出WildActor框架,引入不对称身份保持注意力机制和视角自适应蒙特卡洛采样策略,迭代重加权参考条件以平衡流形覆盖。 Result: 在新提出的Actor-Bench评测中,WildActor在多样化镜头构图、大幅视角变化和剧烈运动下均显著优于现有方法,实现更优的全身身份一致性。 Conclusion: Actor-18M与WildActor共同为生产级人类视频生成提供了高质量数据基础与有效技术路径,推动全身份一致、任意视角可控的人类视频生成发展。 Abstract: Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.[220] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
Cencen Liu,Dongyang Zhang,Wen Yin,Jielei Wang,Tianyu Li,Ji Guo,Wenbo Jiang,Guoqing Wang,Guoming Lu
Main category: cs.CV
TL;DR: 本文提出AlignVAR,一种用于图像超分辨率(ISR)的视觉自回归框架,通过空间一致性自回归(SCA)和分层一致性约束(HCC)解决局部注意力偏差与残差监督误差累积问题,显著提升结构一致性和感知质量,同时实现更快推理和更少参数。
Details
Motivation: 视觉自回归(VAR)模型在图像生成中展现出优势,但在图像超分辨率(ISR)中应用尚不充分,面临局部注意力偏差导致结构破碎、仅残差监督引发跨尺度误差累积两大挑战。 Method: 提出AlignVAR框架,包含两个核心组件:(1) 空间一致性自回归(SCA),采用自适应掩码重加权注意力,增强长程依赖;(2) 分层一致性约束(HCC),在每尺度引入全重建监督以补充残差学习,稳定粗到细的细化过程。 Result: AlignVAR在结构连贯性和感知保真度上持续优于现有生成方法,推理速度提升10倍以上,参数量比主流扩散模型减少近50%。 Conclusion: AlignVAR为高效图像超分辨率建立了新范式,兼顾性能、效率与全局一致性。 Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.[221] UNICBench: UNIfied Counting Benchmark for MLLM
Chenggang Rong,Tao Han,Zhiyuan Zhao,Yaowu Fan,Jia Wan,Song Guo,Yuan Yuan,Junyu Gao
Main category: cs.CV
TL;DR: 本文提出了UNICBench,一个统一的多模态计数基准测试和评估工具包,用于全面评估多模态大语言模型(MLLMs)在图像、文本和音频上的计数能力。
Details
Motivation: 当前缺乏统一的多模态计数数据集来严格评估MLLMs在图像、文本和音频上的计数能力。 Method: 构建了包含图像、文档和音频的UNICBench基准,采用三级能力分类法与难度标签,并在标准化协议下对45个SOTA MLLMs进行跨模态评估。 Result: 实验表明模型在基础计数任务上表现良好,但在推理类及最难子集上存在显著性能差距,暴露出长尾错误并显示较大提升空间。 Conclusion: UNICBench为多模态计数能力提供了严谨可比的评估基础,并开源工具包以推动该方向研究进展。 Abstract: Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.[222] Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation
Keiller Nogueira,Codrut-Andrei Diaconu,Dávid Kerekes,Jakob Gawlikowski,Cédric Léonard,Nassim Ait Ali Braham,June Moh Goo,Zichao Zeng,Zhipeng Liu,Pallavi Jain,Andrea Nascetti,Ronny Hänsch
Main category: cs.CV
TL;DR: 本文提出了一种面向数据的基准,用于识别、量化和排序遥感图像语义分割中的标签噪声样本,结合模型不确定性、预测一致性和表征分析等策略,显著优于现有基线方法。
Details
Motivation: 高质量像素级标注成本高且易受噪声影响,而标注错误会严重降低分割模型的性能和鲁棒性,因此亟需可靠机制来识别和量化训练样本中的标签噪声。 Method: 提出了一个新型数据为中心的基准、一个公开数据集,以及两种基于模型不确定性、预测一致性和表征分析的互补策略,用于识别、量化和排序训练样本的标签噪声水平。 Result: 所提方法在多种实验设置下始终优于现有基线方法。 Conclusion: 该工作为遥感语义分割中标签噪声的检测与评估提供了有效工具,并开源了相关数据集与代码。 Abstract: High-quality pixel-level annotations are essential for the semantic segmentation of remote sensing imagery. However, such labels are expensive to obtain and often affected by noise due to the labor-intensive and time-consuming nature of pixel-wise annotation, which makes it challenging for human annotators to label every pixel accurately. Annotation errors can significantly degrade the performance and robustness of modern segmentation models, motivating the need for reliable mechanisms to identify and quantify noisy training samples. This paper introduces a novel Data-Centric benchmark, together with a novel, publicly available dataset and two techniques for identifying, quantifying, and ranking training samples according to their level of label noise in remote sensing semantic segmentation. Such proposed methods leverage complementary strategies based on model uncertainty, prediction consistency, and representation analysis, and consistently outperform established baselines across a range of experimental settings. The outcomes of this work are publicly available at https://github.com/keillernogueira/label_noise_segmentation.[223] IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
Honghao Cai,Xiangyuan Wang,Yunhao Bai,Tianze Zhou,Sijie Xu,Yuyang Hao,Zezhou Cui,Yuyuan Yang,Wei Zhu,Yibo Chen,Xu Tang,Yao Hu,Zhen Li
Main category: cs.CV
TL;DR: IdGlow是一种无需掩码、分两阶段的多主体图像生成框架,基于流匹配扩散模型,通过任务自适应时间步调度、视觉-语言模型驱动的提示合成和细粒度组级直接偏好优化,有效解决多身份融合中的稳定性与可塑性冲突。
Details
Motivation: 现有方法在多主体图像生成中难以兼顾身份稳定性与结构可塑性,尤其在需复杂形变(如年龄变换)的任务中表现不佳,存在‘稳定性-可塑性困境’。 Method: 提出IdGlow:第一阶段为监督微调(SFT),引入线性衰减时间步调度和时序门控机制以平衡组构自然性与身份保真;结合badcase驱动的VLM实现上下文感知提示合成;第二阶段采用加权间隔的细粒度组级DPO优化,消除多主体伪影、提升纹理和谐性与身份保真度。 Result: 在直接多人融合与年龄变换群体生成两大挑战性基准上,IdGlow显著缓解稳定性-可塑性冲突,在面部保真度与商业级美学质量间取得更优Pareto平衡。 Conclusion: IdGlow通过掩码无关、动态时序控制与偏好驱动的协同优化,为多主体可控图像生成提供了新范式,突破了传统方法在复杂语义变形下的性能瓶颈。 Abstract: Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.[224] Linking Modality Isolation in Heterogeneous Collaborative Perception
Changxing Liu,Zichen Chao,Siheng Chen
Main category: cs.CV
TL;DR: 本文提出CodeAlign,一种无需共现数据的跨模态对齐框架,通过特征-码本-特征(FCF)翻译实现多智能体异构模态间的高效协同感知,显著降低参数量与通信开销,并在多个数据集上达到SOTA性能。
Details
Motivation: 协同感知中存在智能体间模态异构性导致的域差距,尤其当不同模态从不在同一帧中同时出现(即模态隔离)时,传统依赖空间重叠监督的对齐方法失效。 Method: 提出CodeAlign框架,构建各模态专属码本,将模态特异性特征映射到紧凑码空间;通过无空间对应约束的FCF翻译(特征→目标模态码→目标模态特征),实现跨模态对齐。 Result: 在OPV2V和DAIR-V2X数据集上,集成三模态时仅需先前方法8%的训练参数、通信负载降低1024倍,并取得最优感知性能。 Conclusion: CodeAlign首次实现了无需共现样本的高效跨模态对齐,通过码本正则化与FCF翻译克服模态隔离问题,为异构多智能体协同感知提供了可扩展、低开销的新范式。 Abstract: Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released on https://github.com/cxliu0314/CodeAlign.[225] Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
Lijing Cai,Zhan Shi,Chenglong Huang,Jinyao Wu,Qiping Li,Zikang Huo,Linsen Chen,Chongde Zi,Xun Cao
Main category: cs.CV
TL;DR: 本文提出了一种面向动态高光谱视频的重建方法PG-SVRT,构建了首个高质量动态高光谱图像数据集DynaSpec,并通过时空注意力机制与桥接token设计,在保证重建质量、光谱保真度和时间一致性的同时降低计算开销。
Details
Motivation: 现有光谱压缩成像(SCI)重建方法多为单帧图像级,存在空间-光谱特征被编码过程掩盖、单次压缩测量难以恢复缺失信息,以及逐帧重建导致时间不一致的问题。 Method: 构建首个动态高光谱图像数据集DynaSpec;提出传播引导的光谱视频重建Transformer(PG-SVRT),采用先空间后时间的注意力机制,并引入桥接token降低计算复杂度;搭建DD-CASSI原型系统进行真实数据采集与验证。 Result: PG-SVRT在重建质量、光谱保真度和时间一致性上均优于现有方法,且FLOPs最小;仿真与真实系统实验均验证其有效性。 Conclusion: 将光谱重建从图像级提升至视频级是可行且有效的,PG-SVRT通过挖掘帧间互补性与时间连续性,显著提升了动态光谱视频重建性能。 Abstract: Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs. Project page: https://github.com/nju-cite/DynaSpec[226] Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered
Jinfan Hu,Fanghua Yu,Zhiyuan You,Xiang Yin,Hongyu An,Xinqi Lin,Chao Dong,Jinjin Gu
Main category: cs.CV
TL;DR: 本文主张现代视觉处理系统的评估不应再主要依赖单一指标的图像质量评估基准,而应转向更以人为中心、上下文感知和细粒度的评估范式。
Details
Motivation: 当前基于单一指标的图像质量评估(IQA)基准与人类感知和用户偏好日益脱节,可能限制创新并误导研究方向。 Method: 提出一种重新平衡评估范式的观点,强调以人为本、上下文感知和细粒度的视觉模型结果评估方法。 Result: 呼吁视觉处理领域在评估体系上进行范式转变,以更好地对齐人类感知与实际应用需求。 Conclusion: 不应完全抛弃客观指标,但需将其与更贴近人类感知和应用场景的评估方式相结合,实现评估体系的重构与升级。 Abstract: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.[227] Exploring 3D Dataset Pruning
Xiaohan Zhao,Xinyi Shang,Jiacheng Liu,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出了一种面向3D数据集的剪枝方法,通过表征感知子集选择与先验不变教师监督,解决长尾分布下OA与mAcc指标冲突问题,并实现二者协同提升。
Details
Motivation: 现有数据集剪枝方法主要针对2D图像,而3D数据剪枝研究匮乏;且3D数据普遍存在长尾类别分布,导致常用评估指标OA和mAcc优化目标冲突,使剪枝更具挑战性。 Method: 将剪枝建模为对全量数据期望风险的加权子集逼近,识别出覆盖误差与先验不匹配偏差两类关键误差;进而提出表征感知的子集选择(含按类保留配额)与基于校准软标签和嵌入几何蒸馏的先验不变教师监督。 Result: 在多个3D数据集上实验表明,该方法能在多种设置下同步提升OA和mAcc,并支持根据下游任务偏好调节二者权衡。 Conclusion: 本文首次系统研究3D数据集剪枝问题,提出的框架有效缓解长尾分布带来的指标冲突,为高效3D模型训练提供了新思路。 Abstract: Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long-tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full-data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior-mismatch bias from inconsistency between subset-induced class weights and target metrics. We propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation. The retention quota also serves as a switch to control the OA-mAcc trade-off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at https://github.com/XiaohanZhao123/3D-Dataset-Pruning.[228] RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception
Xiaokai Bai,Lianqing Zheng,Runwei Guan,Siyuan Cao,Huiliang Shen
Main category: cs.CV
TL;DR: 本文提出RC-GeoCP框架,首次在协同感知中融合4D雷达与图像,通过雷达锚定的几何一致性解决跨智能体对齐问题,并设计几何结构校正、不确定性感知通信和共识驱动组装模块,在新构建的雷达-相机协同感知基准上实现SOTA性能且通信开销显著降低。
Details
Motivation: LiDAR系统成本高且恶劣天气下性能下降,而相机与4D雷达的协同感知潜力尚未在多智能体场景中被充分探索。 Method: 提出RC-GeoCP框架,包含三部分:1)几何结构校正(GSR),利用雷达几何信息对齐视觉语义;2)不确定性感知通信(UAC),基于条件熵减少选择性传输特征;3)共识驱动组装(CDA),通过共享几何锚点聚合多智能体信息。 Result: 在新构建的V2X-Radar和V2X-R雷达-相机协同感知基准上达到SOTA性能,同时显著降低通信开销。 Conclusion: 雷达与相机在协同感知中具有互补优势,以雷达为几何锚点构建共识是提升多智能体感知鲁棒性与效率的有效范式。 Abstract: Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.[229] Stateful Cross-layer Vision Modulation
Ying Liu,Yudong Han,Kean Shi,Liyuan Pan
Main category: cs.CV
TL;DR: 本文提出了一种跨层记忆调制的视觉框架(SCVM),通过在视觉编码器中引入递归更新的跨层记忆状态和逐层反馈调制机制,控制视觉表征的演化过程,缓解细粒度信息丢失与语义分布不匹配问题,在不修改或微调语言模型的前提下提升了多模态大模型性能。
Details
Motivation: 现有MLLMs采用静态融合方式,导致早期层的细粒度信息在层级抽象中被抑制,且浅层特征直接输入LLM易引发语义分布不匹配,需额外适配或微调。 Method: 提出SCVM框架:1)在视觉编码器内构建递归更新的跨层记忆状态以建模长程层间依赖;2)设计逐层反馈调制机制,基于累积记忆刷新各层token表示;3)引入辅助语义对齐目标,监督最终记忆状态以实现任务相关信息的渐进压缩与增强。 Result: 在多个视觉问答与幻觉评估基准上取得一致性能提升,且不增加视觉token数、不引入额外视觉编码器、也不修改或微调语言模型。 Conclusion: 通过控制视觉表征演化过程而非后融合,SCVM有效缓解了信息丢失与语义失配问题,为MLLMs的视觉表征学习提供了新范式。 Abstract: Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM's cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.[230] Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
Wentao Huang,Weimin Lyu,Peiliang Lou,Qingqiao Hu,Xiaoling Hu,Shahira Abousamra,Wenchao Han,Ruifeng Guo,Jiawei Zhou,Chao Chen,Chen Wang
Main category: cs.CV
TL;DR: 本文提出HistoSelect框架,通过问题引导、组织感知和粗到细的检索策略,模拟病理学家阅片方式,在减少70%视觉token的同时提升病理问答准确率。
Details
Motivation: 现有模型无法像病理学家一样根据临床问题选择性聚焦关键区域,导致注意力分散且效率低下。 Method: 提出HistoSelect框架,包含两阶段:1)组采样器识别问题相关的组织区域;2)补丁选择器在这些区域内检索最具信息量的图像块。 Result: 在35.6万问答对上验证,视觉token使用减少70%,三个病理问答任务准确率均提升,并生成可解释、符合病理学家判断的定位结果。 Conclusion: 将人类式的搜索与注意模式引入全切片图像(WSI)推理,是构建实用可靠病理视觉语言模型的重要方向。 Abstract: Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.[231] Direct low-field MRI super-resolution using undersampled k-space
Daniel Tweneboah Anyimadu,Mohammed M. Abdelsamea,Ahmed Karam Eldaly
Main category: cs.CV
TL;DR: 本文提出了一种基于k空间的双通道U-Net框架,直接从欠采样的低场MRI k空间数据重建高质量高场MRI-like图像,在脑部低场MRI实验中显著优于空间域方法,并达到与全k空间采集相当的图像质量。
Details
Motivation: 低场MRI成本低但成像时间长、图像质量差;现有加速方法(如k空间欠采样)与超分辨率/图像质量迁移(SR/IQT)多在空间域进行,缺乏直接在k空间实现高质量重建的有效方法。 Method: 提出k空间双通道U-Net,分别处理欠采样k空间的实部和虚部,以恢复缺失的频率信息,实现端到端的高场MRI-like图像重建。 Result: 在低场脑MRI数据上,所提k空间方法在图像质量上持续优于对应的空间域方法;由欠采样k空间重建的图像质量可媲美全k空间采集结果。 Conclusion: 首次实现了从欠采样低场k空间直接进行超分辨率与图像质量迁移,验证了k空间驱动重建的有效性与优越性,为低场MRI临床实用化提供了新路径。 Abstract: Low-field magnetic resonance imaging (MRI) provides affordable access to diagnostic imaging but suffers from prolonged acquisition and limited image quality. Accelerated imaging can be achieved with k-space undersampling, while super-resolution (SR) and image quality transfer (IQT) methods typically rely on spatial-domain post-processing. In this work, we propose a novel framework for reconstructing high-field MR like images directly from undersampled low-field k-space. Our approach employs a k-space dual channel U-Net that processes the real and imaginary components of undersampled k-space to restore missing frequency content. Experiments on low-field brain MRI demonstrate that our k-space-driven image enhancement consistently outperforms the counterpart spatial-domain method. Furthermore, reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions. To the best of our knowledge, this is the first work that investigates low-field MRI SR/IQT directly from undersampled k-space.[232] Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis
Youngjin Yoo,Han Liu,Bogdan Georgescu,Yanbo Zhang,Sasa Grbic,Michael Baumgartner,Thomas J. Re,Jyotipriya Das,Poikavila Ullaskrishnan,Eva Eibenberger,Andrei Chekkoury,Uttam K. Bodanapally,Savvas Nicolaou,Pina C. Sanelli,Thomas J. Schroeppel,Yvonne W. Lui,Eli Gibson
Main category: cs.CV
TL;DR: 本文提出了一种名为MoLRE的混合低秩专家框架,用于提升基础模型在多标签头颅CT诊断任务中的性能,通过无监督软路由和多个专用低秩适配器实现条件特征自适应,在仅增加<0.5%参数下显著提升检测AUC,尤其对通用及医学领域基础模型效果更优。
Details
Motivation: 现有参数高效微调方法(如LoRA)在复杂多标签医学诊断任务(如头颅CT多发现检测)中采用统一适配策略,难以应对不同病理类型的异质性,限制了性能提升。 Method: 提出Mixture of Low-Rank Experts (MoLRE)框架:扩展LoRA,引入多个低秩专家适配器和无监督软路由机制,实现无需显式病理标注的条件化特征适配。 Result: 在6个SOTA医学影像基础模型(涵盖2D/3D、通用/医学/头颅CT专用、7M–431M参数)上验证,MoLRE在75类头颅CT发现检测任务中均带来提升;DINOv3-Base和MedGemma分别提升+4.6%和+4.3%,MoLRE+MedGemma达最高平均AUC 0.917。 Conclusion: MoLRE是一种轻量、通用且无需额外监督的适配方法,显著增强基础模型在临床多标签诊断任务中的表现;系统性基准测试揭示预训练领域、架构与规模之间存在非直观交互效应,强调面向具体临床任务评估的重要性。 Abstract: Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.[233] CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
Yushan Han,Hui Zhang,Qiming Xia,Yi Jin,Yidong Li
Main category: cs.CV
TL;DR: 本文提出了一种通信高效的早期协同感知框架CoLC,通过LiDAR补全与三种互补设计(FAPS采样、CEEF融合、DGDA对齐)在低带宽下恢复场景完整性,兼顾感知性能与通信效率,并在异构模型下保持鲁棒性。
Details
Motivation: 早期融合虽具感知互补性和模型异构鲁棒性,但通信开销高,导致实际部署受限;现有工作多转向中间或晚期融合,牺牲了早期融合的优势。 Method: 提出CoLC框架,包含:1)Foreground-Aware Point Sampling (FAPS) 用于稀疏但信息丰富的点云采样;2)Completion-Enhanced Early Fusion (CEEF) 对稀疏输入进行柱状体补全并融合;3)Dense-Guided Dual Alignment (DGDA) 在训练中对齐补全与稠密特征的语义与几何一致性。 Result: 在仿真与真实数据集上验证了CoLC在感知-通信权衡上的优越性,且在异构模型设置下保持鲁棒性。 Conclusion: CoLC成功实现了低通信成本下的高质量早期协同感知,为实际车载协同系统提供了可行方案。 Abstract: Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings. The code is available at https://github.com/CatOneTwo/CoLC.[234] SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion
Guoquan Wei,Liu Shi,Shaoyu Wang,Mohan Li,Cunfeng Wei,Qiegen Liu
Main category: cs.CV
TL;DR: 本文提出了一种无需外部数据、无需预训练的CT超低剂量重建方法,利用空间非局部相似性和投影域共轭特性生成伪3D数据进行自监督训练,实现快速高保真重建,并有效抑制环形伪影、增强细节恢复。
Details
Motivation: 现有CT重建方法存在重建时间过长或过度依赖数据驱动模型的问题,忽视了原始医学3D数据本身蕴含的有价值信息。 Method: 提出一种基于空间非局部相似性和投影域共轭特性的自监督重建方法,通过生成伪3D数据实现无需外部数据和预训练的超低剂量CT重建。 Result: 该方法在极短时间内实现了高保真重建,有效抑制探测器引起的环形伪影,并在细节恢复方面展现出前所未有的能力。 Conclusion: 该方法为仅使用未标注原始投影数据的研究提供了新范式。 Abstract: Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at https://github.com/yqx7150/SCOUT.[235] STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification
Xingguo Xu,Zhanyu Liu,Weixiang Zhou,Yuansheng Gao,Junjie Cao,Yuhao Wang,Jixiang Luo,Dell Zhang
Main category: cs.CV
TL;DR: 本文提出STMI框架,通过分割引导特征调制、语义令牌重分配和跨模态超图交互三个模块,提升多模态目标重识别性能。
Details
Motivation: 现有方法依赖硬令牌过滤或简单融合策略,易导致判别性线索丢失和背景干扰增加。 Method: 提出STMI框架,包含:(1) 分割引导特征调制(SFM)模块,利用SAM生成掩码增强前景表征并抑制背景噪声;(2) 语义令牌重分配(STR)模块,使用可学习查询令牌和自适应重分配机制提取紧凑且信息丰富的表征;(3) 跨模态超图交互(CHI)模块,构建跨模态统一超图以捕获高阶语义关系。 Result: 在RGBNT201、RGBNT100和MSVR310等公开基准上实验验证了STMI框架的有效性和鲁棒性。 Conclusion: STMI框架有效缓解了多模态ReID中判别性线索丢失与背景干扰问题,显著提升了检索性能。 Abstract: Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.[236] TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
Yihui Li,Chengxin Lv,Zichen Tang,Hongyu Yang,Di Huang
Main category: cs.CV
TL;DR: TokenSplat是一种前馈框架,用于从无位姿的多视角图像中联合重建3D高斯分布并估计相机位姿,其核心是基于令牌对齐的高斯预测模块和不对称双流解码器,以提升重建质量与位姿估计精度。
Details
Motivation: 解决无位姿多视角图像下3D重建与相机位姿联合估计的挑战,尤其在缺乏初始位姿先验时提高鲁棒性与精度。 Method: 提出Token-aligned Gaussian Prediction模块,在特征空间对齐跨视图语义信息;引入可学习相机令牌与Asymmetric Dual-Flow Decoder(ADF-Decoder),实现相机与图像令牌间方向约束通信,保持前馈架构下的因子分解清晰性。 Result: 在无位姿设定下,显著提升3D重建保真度、新视角合成质量及相机位姿估计精度,优于现有无位姿方法。 Conclusion: TokenSplat通过端到端前馈设计实现了高质量联合重建与位姿估计,无需迭代优化,兼顾效率与性能。 Abstract: We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods. Project page: https://kidleyh.github.io/tokensplat/.[237] Towards Universal Khmer Text Recognition
Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing,Masakazu Iwamura,Koichi Kise
Main category: cs.CV
TL;DR: 本文提出了一种通用高棉文文本识别(UKTR)框架,通过模态感知自适应特征选择(MAFS)技术,统一处理印刷体、手写体和场景文本等多种模态,在数据稀缺情况下提升跨模态泛化能力,并发布了首个全面的通用高棉文OCR基准。
Details
Motivation: 高棉文OCR面临低资源、复杂字形及多模态(印刷/手写/场景文本)数据不均衡等挑战;现有模态专用模型无法共享知识、部署开销大,而简单混合训练又损害稀疏模态性能。 Method: 提出通用Khmer文本识别(UKTR)框架,核心是模态感知自适应特征选择(MAFS)机制,根据输入图像模态动态调整视觉特征提取,实现跨模态鲁棒识别;同时构建并开源首个综合性通用高棉文OCR基准。 Result: 在多个模态上达到SOTA性能,并发布首个通用高棉文OCR基准数据集与模型。 Conclusion: UKTR框架及其MAFS方法有效缓解了低资源语言多模态OCR的数据稀缺与模态失衡问题,为通用文档理解提供了可扩展、轻量且鲁棒的解决方案。 Abstract: Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.[238] Towards Khmer Scene Document Layout Detection
Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing,Masakazu Iwamura,Koichi Kise
Main category: cs.CV
TL;DR: 本文提出了首个针对高棉语场景文档布局检测的综合研究,包括专用数据集、开源文档增强工具和基于YOLO与定向边界框的布局检测基线模型。
Details
Motivation: 拉丁文字档布局分析已取得显著进展,但高棉语因标注数据稀缺,尤其在存在透视畸变和复杂背景的场景文档中,布局分析仍严重受限;且高棉文结构复杂(如变音符号、多层字符堆叠),现有拉丁语模型难以准确划分密集文本区域的语义布局单元。 Method: 提出一个包含三部分的新框架:(1) 专用于高棉语场景布局的鲁棒训练与基准测试数据集;(2) 开源文档增强工具,可合成逼真的场景文档以扩充训练数据;(3) 基于YOLO架构并采用定向边界框(OBB)的布局检测基线模型,以应对几何畸变。 Result: 构建了首个高棉语场景文档布局检测基准数据集与增强工具,并实现了适配几何畸变的OBB布局检测基线;所有模型、代码和数据集将公开发布。 Conclusion: 本工作填补了高棉语文档布局分析的研究空白,为高棉语文档分析与识别(DAR)社区提供了关键基础设施与可复现基线,推动低资源语言文档智能的发展。 Abstract: While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).[239] A Reconstruction System for Industrial Pipeline Inner Walls Using Panoramic Image Stitching with Endoscopic Imaging
Rui Ma,Yifeng Wang,Ziteng Yang,Xinghui Li
Main category: cs.CV
TL;DR: 本文提出了一种基于工业内窥镜和全景图像拼接技术的管道内壁可视化分析与重建系统,通过极坐标变换与图像拼接将环形视频帧展开为平面全景图,显著提升缺陷检测与状态评估效率。
Details
Motivation: 管道内壁的视觉分析与重建在工业检测中仍具挑战性,传统逐帧视频审查方法效率低下,亟需高效、直观的可视化重建手段。 Method: 构建基于全景图像拼接的专用重建系统,配备定制GUI界面,从内窥镜视频中提取关键帧,并结合极坐标变换与图像拼接技术,将环形内壁视频帧展开为平面全景图像。 Result: 实验表明该方法能高效处理工业内窥镜视频,生成的全景拼接图像完整保留管道内壁全部细节特征,为缺陷检测与状态评估提供直观准确的视觉支持。 Conclusion: 所提方法相较传统逐帧审查显著提升了管道内壁重建效率,具有较高的工程应用价值。 Abstract: Visual analysis and reconstruction of pipeline inner walls remain challenging in industrial inspection scenarios. This paper presents a dedicated reconstruction system for pipeline inner walls via industrial endoscopes, which is built on panoramic image stitching technology. Equipped with a custom graphical user interface (GUI), the system extracts key frames from endoscope video footage, and integrates polar coordinate transformation with image stitching techniques to unwrap annular video frames of pipeline inner walls into planar panoramic images. Experimental results demonstrate that the proposed method enables efficient processing of industrial endoscope videos, and the generated panoramic stitched images preserve all detailed features of pipeline inner walls in their entirety. This provides intuitive and accurate visual support for defect detection and condition assessment of pipeline inner walls. In comparison with the traditional frame-by-frame video review method, the proposed approach significantly elevates the efficiency of pipeline inner wall reconstruction and exhibits considerable engineering application value.[240] Diversity over Uniformity: Rethinking Representation in Generated Image Detection
Qinghui He,Haifeng Zhang,Qiao Qin,Bo Liu,Xiuli Bi,Bin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种抗特征坍缩学习框架,旨在提升生成图像检测模型在未见生成机制下的泛化能力,通过过滤无关成分和抑制不同伪造线索间的表征重叠,保持多视角判别证据的多样性与互补性。
Details
Motivation: 现有生成图像检测方法过度依赖训练后少数显著伪造线索,导致在面对未知生成机制时泛化能力不足。 Method: 提出抗特征坍缩学习框架,通过在表征空间中过滤任务无关成分并抑制不同伪造线索间的过度重叠,防止判别信息坍缩至少数主导特征方向。 Result: 在多个公开基准上显著优于现有最优方法,在跨模型场景下准确率提升5.02%,展现出更强的泛化性与检测可靠性。 Conclusion: 维持多视角、多样且互补的判别证据可有效提升生成图像检测模型对未知生成机制的鲁棒性与可靠性。 Abstract: With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.[241] BornoViT: A Novel Efficient Vision Transformer for Bengali Handwritten Basic Characters Classification
Rafi Hassan Chowdhury,Naimul Haque,Kaniz Fatiha
Main category: cs.CV
TL;DR: 本文提出了一种轻量级视觉Transformer模型BornoViT,用于高效分类孟加拉手写字符和数字,在保证高准确率的同时显著降低参数量、模型大小和计算量,适用于资源受限环境。
Details
Motivation: 孟加拉手写字符分类因字符复杂多变而具挑战性;现有主流模型计算开销大、数据需求高,不适用于资源有限的语言场景。 Method: 提出轻量级Vision Transformer模型BornoViT,采用简化的深度卷积神经网络(DCNN)结构,大幅减少参数量(0.65M)、模型大小(0.62MB)和计算量(0.16 GFLOPs)。 Result: 在BanglaLekha Isolated数据集上达到95.77%准确率;在自建Bornomala数据集(222样本,多年龄段)上达91.51%准确率,效率优于现有SOTA方法。 Conclusion: BornoViT在精度与效率间取得良好平衡,为低资源语言的手写识别提供了可行、实用的轻量化解决方案。 Abstract: Handwritten character classification in the Bengali script is a significant challenge due to the complexity and variability of the characters. The models commonly used for classification are often computationally expensive and data-hungry, making them unsuitable for resource-limited languages such as Bengali. In this experiment, we propose a novel, efficient, and lightweight Vision Transformer model that effectively classifies Bengali handwritten basic characters and digits, addressing several shortcomings of traditional methods. The proposed solution utilizes a deep convolutional neural network (DCNN) in a more simplified manner compared to traditional DCNN architectures, with the aim of reducing computational burden. With only 0.65 million parameters, a model size of 0.62 MB, and 0.16 GFLOPs, our model, BornoViT, is significantly lighter than current state-of-the-art models, making it more suitable for resource-limited environments, which is essential for Bengali handwritten character classification. BornoViT was evaluated on the BanglaLekha Isolated dataset, achieving an accuracy of 95.77%, and demonstrating superior efficiency compared to existing state-of-the-art approaches. Furthermore, the model was evaluated on our self-collected dataset, Bornomala, consisting of approximately 222 samples from different age groups, where it achieved an accuracy of 91.51%.[242] Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder
Adam Marcus,Paul Bentley,Daniel Rueckert
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散概率模型的自监督学习方法,从CT图像中提取语义丰富的卒中表征,并通过引入纵向图像和发病时间进一步优化,从而提升卒中严重程度及功能预后预测性能。
Details
Motivation: 卒中预后预测对个体化临床决策至关重要,但现有方法在建模脑组织最终命运方面仍面临挑战,尤其在标签稀缺情况下。 Method: 采用扩散概率模型进行自监督学习,从CT图像中学习卒中表征;进一步扩展模型以融合纵向影像数据和卒中发病时间信息。 Result: 在包含5824张CT图像、3573名患者、来自两个医学中心的弱标注数据集上验证,本方法在次日严重程度和出院时功能预后预测任务中均取得最优性能。 Conclusion: 该方法有效提升了无/少监督条件下卒中影像表征的学习能力,为临床预后预测提供了新范式。 Abstract: Stroke is a major cause of death and disability worldwide. Accurate outcome and evolution prediction has the potential to revolutionize stroke care by individualizing clinical decision-making leading to better outcomes. However, despite a plethora of attempts and the rich data provided by neuroimaging, modelling the ultimate fate of brain tissue remains a challenging task. In this work, we apply recent ideas in the field of diffusion probabilistic models to generate a self-supervised semantically meaningful stroke representation from Computed Tomography (CT) images. We then improve this representation by extending the method to accommodate longitudinal images and the time from stroke onset. The effectiveness of our approach is evaluated on a dataset consisting of 5,824 CT images from 3,573 patients across two medical centers with minimal labels. Comparative experiments show that our method achieves the best performance for predicting next-day severity and functional outcome at discharge.[243] Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models
Zhenyu Zhou,Defang Chen,Siwei Lyu,Chun Chen,Can Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为TORS的采样时间调度策略,通过Frenet-Serret公式揭示扩散模型几何特性,实现训练无关的高效高质量图像生成,仅需10步即可在Flux.1-Dev和Stable Diffusion 3.5上取得优异效果。
Details
Motivation: 现有无训练采样加速方法各自独立开发,缺乏对整体性能与兼容性的系统性探索;且文本到图像扩散模型在有限采样步数下仍难以生成高质量图像。 Method: 系统分析扩散模型采样设计空间,发现采样时间表是关键因素;基于Frenet-Serret公式揭示的几何性质,提出恒定总旋转调度(TORS)策略,确保采样轨迹上几何变化均匀。 Result: TORS在Flux.1-Dev和Stable Diffusion 3.5上仅用10步采样即超越以往训练-free加速方法,生成高质量图像;实验验证其对未见模型、超参数及下游任务具有强适应性。 Conclusion: 采样时间调度是提升训练-free加速效果的核心,TORS提供了一种几何直观、通用性强且高性能的调度范式,为扩散模型高效采样提供了新思路。 Abstract: Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.[244] Unified Vision-Language Modeling via Concept Space Alignment
Yifu Qiu,Paul-Ambroise Duquenne,Holger Schwenk
Main category: cs.CV
TL;DR: 本文提出了V-SONAR,一种扩展自SONAR文本嵌入空间的视觉-语言嵌入空间,支持1500种文本语言和177种语音语言;通过后验对齐方法将视觉编码器映射到SONAR空间,并在视频检索与多语言视频描述任务中取得领先性能;进一步基于V-SONAR构建了零样本视觉概念理解能力的LCM模型及其视觉-语言指令微调版本V-LCM,在62种语言(尤其低资源语言)上显著超越现有视觉-语言模型。
Details
Motivation: 解决现有视觉-语言模型在多语言、尤其是低资源语言场景下泛化能力弱的问题,同时提升零样本跨模态理解能力,并复用已有的强大文本嵌入空间(SONAR)以降低训练成本。 Method: 提出后验对齐pipeline,将视觉编码器表征映射至SONAR文本嵌入空间,构建V-SONAR;在此基础上,扩展Large Concept Model(LCM)为V-LCM,采用统一的视觉-语言潜变量序列建模与潜扩散式下一嵌入预测目标进行多模态指令微调。 Result: V-SONAR在文本-视频检索和视频描述(DREAM-1K/PE-VIDEO)上达到SOTA;LCM实现零样本单/多视觉概念理解;V-LCM在图像/视频描述与问答任务上媲美SOTA模型,并在61/62种测试语言(含大量低资源语言)上显著超越现有方法。 Conclusion: V-SONAR与V-LCM验证了复用大规模文本嵌入空间并进行轻量视觉对齐与指令微调的有效性,为构建高效、可扩展、强泛化的多语言多模态模型提供了新范式。 Abstract: We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.[245] DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents
Zikang Xu,Ruinan Jin,Xiaoxiao Li
Main category: cs.CV
TL;DR: 本文提出DUCK框架,系统审计胸片诊断代理模型中的偏见来源,分解出工具暴露偏差、工具转换偏差和模型推理偏差三类,并通过实验证明仅靠端到端评估无法发现中间过程中的显著亚组差异,强调需进行过程级公平性审计与去偏。
Details
Motivation: 工具使用型医疗代理虽能提升胸片问答性能,但其复杂流水线引入了独立模型所不具备的新偏见路径,亟需系统性公平性审计。 Method: 提出阶段式公平性分解方法,将端到端偏差分解为工具暴露偏差、工具转换偏差和模型推理偏差三类;基于MedRAX构建代理实例,在五种主干模型上开展实验,量化各阶段的亚组性能差距。 Result: 实验发现:(i) 端到端层面存在显著人口统计学差距(等价几率差最高达20.79%,公平-效用权衡最低至28.65%);(ii) 中间阶段(如工具可用性条件下的效用差)偏差更严重(高达50%),且无法由端到端评估预测。 Conclusion: 仅依赖端到端公平评估不足,必须开展过程级公平性审计与干预,以保障临床代理系统的公平部署。 Abstract: Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md[246] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Niu Lian,Yuting Wang,Hanshu Yao,Jinpeng Wang,Bin Chen,Yaowei Wang,Min Zhang,Shu-Tao Xia
Main category: cs.CV
TL;DR: 本文提出MM-Mem,一种受模糊痕迹理论启发的分层多模态记忆架构,通过感官缓冲区、情景流和符号图式三级结构实现视频长时程理解,并引入语义信息瓶颈目标与SIB-GRPO优化方法,配合熵驱动的自上而下检索策略,在多个基准上验证了其有效性。
Details
Motivation: 现有方法在长时程视频理解中存在局限:视觉中心方法延迟高、冗余大;文本中心方法易丢失细节并产生幻觉;且通用模型受限于上下文窗口和静态记忆机制,无法模拟人类高效认知。 Method: 提出MM-Mem分层记忆架构(感官缓冲区→情景流→符号图式),基于模糊痕迹理论实现从verbatim到gist的渐进提炼;设计语义信息瓶颈(Semantic Information Bottleneck)目标及SIB-GRPO优化算法;采用熵驱动的自上而下记忆检索策略。 Result: 在4个基准测试中,MM-Mem在离线与流式任务上均显著优于现有方法,展现出强泛化能力与鲁棒性,验证了认知启发式记忆组织的有效性。 Conclusion: 分层、动态、认知对齐的记忆机制是提升多模态大模型长时程视频理解能力的关键路径,MM-Mem为构建更类人、更高效的多模态推理系统提供了新范式。 Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.[247] Neural Functional Alignment Space: Brain-Referenced Representation of Artificial Neural Networks
Ruiyu Yan,Hanqi Jiang,Yi Pan,Xiaobo Li,Tianming Liu,Xi Jiang,Lin Zhao
Main category: cs.CV
TL;DR: 本文提出神经功能对齐空间(NFAS),通过建模网络深度方向上刺激表征的内在动力学演化,并结合动态模态分解(DMD)与脑信号锚定,构建一种脑参照的、跨模型可比的表征框架;在45个预训练多模态模型上验证了其能揭示模态特异性聚类与跨模态整合趋势。
Details
Motivation: 现有对齐方法多依赖层特征或任务激活,缺乏对表征动态演化的建模,且难以在脑功能层面实现跨模型、跨模态统一比较。 Method: 将层嵌入建模为深度方向的动力学轨迹,应用动态模态分解(DMD)提取稳定模态,并将其投影到由分布式神经响应定义的生物锚定坐标系中;同时提出信噪一致性指数(SNCI)量化模态级跨模型一致性。 Result: 在45个涵盖视觉、听觉和语言的预训练模型上,NFAS展现出结构化组织:包括模态特异性聚类及在整合皮层系统中的跨模态收敛。 Conclusion: 表征动力学为人工神经网络的功能对齐提供了原理性基础,NFAS支持更符合神经科学依据的模型比较与解释。 Abstract: We propose the Neural Functional Alignment Space (NFAS), a brain-referenced representational framework for characterizing artificial neural networks on equal functional grounds. NFAS departs from conventional alignment approaches that rely on layer-wise features or task-specific activations by modeling the intrinsic dynamical evolution of stimulus representations across network depth. Specifically, we model layer-wise embeddings as a depth-wise dynamical trajectory and apply Dynamic Mode Decomposition (DMD) to extract the stable mode. This representation is then projected into a biologically anchored coordinate system defined by distributed neural responses. We also introduce the Signal-to-Noise Consistency Index (SNCI) to quantify cross-model consistency at the modality level. Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization within this brain-referenced space, including modality-specific clustering and cross-modal convergence in integrative cortical systems. Our findings suggest that representation dynamics provide a principled basis for[248] Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
Simon Ging,Philipp Arnold,Sebastian Walter,Hani Alnahas,Hannah Bast,Elmar Kotter,Jiancheng Yang,Behzad Bozorgtabar,Thomas Brox
Main category: cs.CV
TL;DR: 本文提出了一种基于大规模单中心CT报告-体数据对(98k对)与公共数据联合训练的3D CT视觉-语言模型,结合SigLIP式对比学习与疾病提示监督,并首次引入‘扫描内片段定位’任务,实现文本描述到CT轴向切片的精准定位(MAE=36.3mm),同时保持检索与分类性能,达成多任务统一建模。
Details
Motivation: 现有3D CT视觉-语言模型依赖有限公开数据、仅提供粗粒度全局监督,且未利用报告中隐含的精确图像定位信息(如“series X, image Y”)。 Method: 采用SigLIP风格对比预训练 + 基于提示的疾病监督;自动挖掘262k个报告文本片段-CT切片对,定义并解决‘扫描内片段定位’任务(回归轴向深度);端到端联合优化检索、分类与定位目标。 Result: 在CT-RATE上文本到图像检索R@10达31.5(SOTA),疾病分类AUC为83.8;在Rad-ChestCT上AUC为77.0;片段定位MAE为36.3 mm(显著优于基线67.0 mm);三任务联合训练不损害原有性能。 Conclusion: 通过融合大规模真实临床数据、细粒度疾病监督与新提出的扫描内定位任务,本文构建了首个支持检索、分类与文本到切片定位的统一3D CT视觉-语言模型,验证了临床报告中结构化空间信息的价值。 Abstract: Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.[249] NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
Seemandhar Jain,Keshav Gupta,Kunal Gupta,Manmohan Chandraker
Main category: cs.CV
TL;DR: NERFIFY是一个多智能体框架,能将NeRF研究论文自动、可靠地转化为可训练的Nerfstudio插件,显著提升复现效率与代码质量。
Details
Motivation: NeRF领域论文复现成本高,通用模型(如GPT-5)难以生成可运行代码,亟需领域专用的自动化代码生成工具。 Method: 提出六项关键技术:基于Nerfstudio上下文无关文法(CFG)约束LLM生成;图式思维(Graph-of-Thought)多文件协同生成;基于引用图的组件检索与集成;结合PSNR/几何/VLM的视觉反馈迭代优化;支持方法增强的知识增强机制;面向30篇NeRF论文的评估基准。 Result: 在无公开实现的论文上,生成代码视觉质量媲美人工实现(PSNR误差±0.5 dB,SSIM误差±0.2),实现时间从数周缩短至数分钟。 Conclusion: 领域感知的设计可有效解决复杂视觉论文的代码转化难题,推动可复现研究的加速与普及。 Abstract: The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.[250] COMBAT: Conditional World Models for Behavioral Agent Training
Anmol Agarwal,Pranay Meshram,Sumer Singh,Saurav Suman,Andrew Lapp,Shahbuland Matiana,Louis Castricato,Spencer Frazier
Main category: cs.CV
TL;DR: 本文提出COMBAT,一种基于扩散模型的实时、动作控制的世界模型,能在格斗游戏Tekken 3中模拟具备反应能力的动态对手,仅用单人输入数据即自发涌现智能行为,无需对手动作标签。
Details
Motivation: 现有视频生成与世界模型难以建模能主动交互、响应玩家动作的动态智能体,本文旨在填补这一空白。 Method: 采用12亿参数的扩散Transformer,以深度压缩自编码器的潜在表示为条件,并结合因果蒸馏与扩散强制等技术实现实时推理;仅使用单玩家输入训练,无对手动作监督。 Result: 模型成功在Tekken 3中生成具有反应性、策略性的对手行为,展现出涌现的复杂智能体特性,并提出新评估方法验证该行为。 Conclusion: 扩散模型可作为构建交互式世界模型的有效基础,COMBAT证明了在部分可观测、无显式策略监督条件下,仍能训练出可控、响应式的动态智能体。 Abstract: Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.[251] MME: Mixture of Mesh Experts with Random Walk Transformer Gating
Amir Belder,Ayellet Tal
Main category: cs.CV
TL;DR: 本文提出了一种用于网格分析的混合专家(MoE)框架,通过新设计的门控机制和动态损失平衡策略,提升专家专业化与知识共享,实现了网格分类、检索和语义分割的SOTA性能。
Details
Motivation: 现有网格分析方法各有所长但适用于不同物体类别,需融合多种方法的优势。 Method: 提出基于随机游走和注意力机制的新型门控架构,并引入动态损失平衡方案,在多样性(促进专家专业化)和相似性(促进知识共享)损失间自适应权衡。 Result: 在网格分类、检索和语义分割任务上达到当前最优性能。 Conclusion: 所提MoE框架能有效整合异构网格分析方法,提升整体性能,具备良好泛化性和实用性。 Abstract: In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME-Mixture-of-Mesh-Experts.[252] Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement
Cong Wang,Jinshan Pan,Liyan Wang,Wei Wang,Yang Yang
Main category: cs.CV
TL;DR: 本文提出UHDPromer,一种基于神经判别先验(NDP)的Transformer模型,用于超高清(UHD)图像恢复与增强,通过NDPA注意力机制和NDPN网络结构提升低分辨率特征表征能力,并结合超分引导重建策略,在多项UHD任务中实现SOTA性能与高效计算。
Details
Motivation: 观察到高分辨率与低分辨率特征间存在隐式神经差异,利用该差异可提升低分辨率特征表征能力,从而改善UHD图像恢复效果。 Method: 提出神经判别先验(NDP)度量特征差异,并构建神经判别提示注意力(NDPA)与神经判别提示网络(NDPN);NDPA将NDP融入注意力机制以全局感知判别信息,NDPN采用NDP引导的连续门控机制选择性传递有益内容;并设计超分引导的重建策略。 Result: 在低光增强、去雾、去模糊三项UHD图像恢复任务上达到SOTA性能,同时具备最优计算效率。 Conclusion: UHDPromer验证了利用神经判别先验建模分辨率间差异的有效性,为UHD图像恢复提供了新思路,兼具高性能与高效率。 Abstract: We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement. Our UHDPromer is inspired by an interesting observation that there implicitly exist neural differences between high-resolution and low-resolution features, and exploring such differences can facilitate low-resolution feature representation. To this end, we first introduce Neural Discrimination Priors (NDP) to measure the differences and then integrate NDP into the proposed Neural Discrimination-Prompted Attention (NDPA) and Neural Discrimination-Prompted Network (NDPN). The proposed NDPA re-formulates the attention by incorporating NDP to globally perceive useful discrimination information, while the NDPN explores a continuous gating mechanism guided by NDP to selectively permit the passage of beneficial content. To enhance the quality of restored images, we propose a super-resolution-guided reconstruction approach, which is guided by super-resolving low-resolution features to facilitate final UHD image restoration. Experiments show that UHDPromer achieves the best computational efficiency while still maintaining state-of-the-art performance on $3$ UHD image restoration and enhancement tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes and pre-trained models will be made available at https://github.com/supersupercong/uhdpromer.[253] PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
Jie Li,Shengwei Tian,Long Yu,Xin Ning
Main category: cs.CV
TL;DR: PPC-MT is a parallel point cloud completion framework using PCA-guided ordering and a hybrid Mamba-Transformer architecture to balance reconstruction quality and efficiency.
Details
Motivation: Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. Method: PPC-MT introduces a PCA-guided parallel completion strategy that orders unordered point clouds, decomposes them into subsets, and reconstructs them in parallel using a multi-head reconstructor; it combines Mamba (for linear-complexity encoding) and Transformer (for fine-grained decoding). Result: PPC-MT outperforms state-of-the-art methods on PCN, ShapeNet-55/34, and KITTI across multiple metrics, improving uniformity, detail fidelity, and efficiency. Conclusion: The structured parallel synthesis paradigm with hybrid Mamba-Transformer effectively balances efficiency and reconstruction accuracy for point cloud completion. Abstract: Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid Mamba-Transformer architecture. Our approach introduces an innovative parallel completion strategy guided by Principal Component Analysis (PCA), which imposes a geometrically meaningful structure on unordered point clouds, transforming them into ordered sets and decomposing them into multiple subsets. These subsets are reconstructed in parallel using a multi-head reconstructor. This structured parallel synthesis paradigm significantly enhances the uniformity of point distribution and detail fidelity, while preserving computational efficiency. By integrating Mamba's linear complexity for efficient feature extraction during encoding with the Transformer's capability to model fine-grained multi-sequence relationships during decoding, PPC-MT effectively balances efficiency and reconstruction accuracy. Extensive quantitative and qualitative experiments on benchmark datasets, including PCN, ShapeNet-55/34, and KITTI, demonstrate that PPC-MT outperforms state-of-the-art methods across multiple metrics, validating the efficacy of our proposed framework.[254] MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment
Halil Ismail Helvaci,Justin Huber,Jihye Bae,Sen-ching Samson Cheung
Main category: cs.CV
TL;DR: 本文提出了一种名为Multi-Membership Temporal Attention (MMTA)的高分辨率时序Transformer模型,用于精细康复动作分割,通过允许多个局部归一化时序注意力窗口同时作用于同一帧,提升边界敏感性与上下文建模能力,且适用于视频与IMU数据,在多个基准上显著优于现有方法。
Details
Motivation: 现有时序动作分割(TAS)模型难以同时捕捉亚秒级微动作和保持整体运动上下文,导致相位转换边界模糊,影响康复评估可靠性。 Method: 提出MMTA:一种新型时序注意力机制,使每帧在单层内可同时关注多个局部归一化的时序注意力窗口,并通过特征空间重叠解析融合多视角信息;支持视频与IMU单阶段统一建模。 Result: 在StrokeRehab数据集上Edit Score提升+1.3(视频)和+1.6(IMU),在50Salads上提升+3.3;消融实验证明性能增益源于多隶属时序视图而非结构复杂度。 Conclusion: MMTA为资源受限的康复评估提供了一种高效、高精度、多模态兼容的单阶段解决方案,显著提升了细粒度动作分割的边界敏感性与上下文一致性。 Abstract: To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.[255] Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos
Yu Luo,Guangyu Wei,Yangfan Li,Jieyu He,Yueming Lyu
Main category: cs.CV
TL;DR: 本文提出了一种基于SAM3的半监督血管分割框架SMART,通过运动感知一致性与渐进置信度正则化,有效应对XCA视频中边界模糊、对比度低、运动复杂及标注数据少等挑战,在多个数据集上达到SOTA性能。
Details
Motivation: 冠状动脉X射线血管造影(XCA)序列的血管分割对冠心病诊断至关重要,但面临边界模糊、辐射对比度不一致、复杂运动模式及标注数据稀缺等挑战;传统半监督学习方法难以建模时序动态并提供可靠不确定性估计。 Method: 提出SAM3-based Teacher-student框架SMART,包含三方面创新:1)基于SAM3提示式分割设计构建师生框架;2)引入血管掩码形变与运动一致性损失建模血管动态;3)设计渐进置信度感知的一致性正则化以缓解教师模型在低质量图像上的不可靠预测。 Result: 在三个来自不同医疗机构的XCA序列数据集上实验表明,SMART显著优于现有方法,达到当前最优性能,且仅需极少标注数据。 Conclusion: SMART是一种高效、鲁棒的半监督XCA视频血管分割方法,兼顾性能与临床实用性,尤其适用于标注稀缺的真实医疗场景。 Abstract: Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3's unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: https://github.com/qimingfan10/SMART.[256] VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
Longmi Gao,Pan Gao
Main category: cs.CV
TL;DR: 本文提出VEMamba框架,通过3D依赖重排序范式(含ALCSSM和DWAM模块)及退化感知的MoCo训练策略,实现高效、高质量的各向同性体积电子显微镜(VEM)数据重建。
Details
Motivation: 现有VEM各向同性重建方法忽视轴向信息且退化模拟不真实,导致重建质量受限。 Method: 提出VEMamba:1)Axial-Lateral Chunking Selective Scan Module(ALCSSM)将3D空间依赖重映射为优化的1D序列以适配Mamba建模;2)Dynamic Weights Aggregation Module(DWAM)自适应聚合序列输出;3)引入真实退化模拟与MoCo对比学习提升鲁棒性。 Result: 在模拟与真实各向异性VEM数据集上,VEMamba在多项指标上达到领先性能,同时计算开销更低。 Conclusion: VEMamba是一种高效、鲁棒的各向同性VEM重建新范式,兼顾建模能力与计算效率,代码已开源。 Abstract: Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint. The source code is available on GitHub: https://github.com/I2-Multimedia-Lab/VEMamba[257] pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
Zhanpeng Luo,Ce Zhang,Silong Yong,Cunxi Dai,Qianwei Wang,Haoxi Ran,Guanya Shi,Katia Sycara,Yaqi Xie
Main category: cs.CV
TL;DR: 本文提出pySpatial框架,通过Python代码生成使多模态大语言模型(MLLMs)能调用3D空间工具(如3D重建、位姿估计、新视角渲染),无需微调即可实现零样本3D空间理解与推理,在MindCube和Omni3D-Bench等基准及真实机器人导航任务中显著优于现有MLLM基线。
Details
Motivation: 多模态大语言模型(MLLMs)在通用感知与推理上表现优异,但在需3D空间理解的任务上仍存在明显不足。 Method: 提出pySpatial视觉编程框架,让MLLM通过生成Python代码调用各类空间工具(如3D重建、相机位姿恢复、新视角渲染),将2D图像序列转化为可交互的3D场景,从而支持显式的结构化空间推理;整个过程无需梯度微调,完全零样本。 Result: 在MindCube和Omni3D-Bench基准测试中持续超越强MLLM基线,例如在MindCube上比GPT-4.1-mini高出12.94%;并在真实室内导航实验中成功驱动机器人按生成路径穿越复杂环境。 Conclusion: pySpatial为MLLM赋予了零样本、免训练的3D空间理解与操作能力,显著提升了其在空间推理与具身智能任务中的实用性与泛化性。 Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.[258] ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Xiaolong Zeng,Yitong Yu,Shiyao Xiong,Jinhua Hao,Ming Sun,Chao Zhou,Bin Wang
Main category: cs.CV
TL;DR: 本文提出ShiftLUT框架,通过可学习空间偏移、不对称双分支结构和误差有界自适应采样压缩策略,在保持高效性的同时显著扩大LUT方法的感受野,提升图像恢复性能。
Details
Motivation: 现有基于查找表(LUT)的图像恢复方法在扩展感受野时引入额外计算与存储开销,难以部署于边缘设备。 Method: 提出ShiftLUT框架,包含三个核心组件:1)可学习空间偏移模块(LSS)以通道级空间偏移扩展感受野;2)不对称双分支架构,将更多计算分配给信息密集分支以降低延迟;3)误差有界自适应采样(EAS)实现特征级LUT压缩。 Result: 相比SOTA方法TinyLUT,ShiftLUT感受野扩大3.8倍,多个基准上平均PSNR提升超0.21 dB,同时保持小存储量和低推理时延。 Conclusion: ShiftLUT在效率与性能间取得更好平衡,为边缘设备上的高效图像恢复提供了新思路。 Abstract: Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.[259] UD-SfPNet: An Underwater Descattering Shape-from-Polarization Network for 3D Normal Reconstruction
Puyun Wang,Kaimin Yu,Huayang He,Feng Huang,Xianyu Wu,Yating Chen
Main category: cs.CV
TL;DR: 本文提出UD-SfPNet,一种联合去散射与偏振形状重建的网络,用于提升水下光学3D成像中表面法向量预测精度。
Details
Motivation: 水下光学成像受散射严重干扰,而偏振成像兼具去散射和偏振形状恢复(SfP)双重优势,亟需统一建模二者以避免串行处理误差累积。 Method: 提出UD-SfPNet框架,联合建模偏振图像去散射与SfP法向估计;引入颜色嵌入模块增强几何一致性,设计细节增强卷积模块保留高频几何细节。 Result: 在MuS-Polar3D数据集上实现平均表面法向角误差15.12°,为当前最优;代码已开源。 Conclusion: 联合去散射与偏振形状推断可显著提升水下3D重建精度,UD-SfPNet具备实际应用价值。 Abstract: Underwater optical imaging is severely hindered by scattering, but polarization imaging offers the unique dual advantages of descattering and shape-from-polarization (SfP) 3D reconstruction. To exploit these advantages, this paper proposes UD-SfPNet, an underwater descattering shape-from-polarization network that leverages polarization cues for improved 3D surface normal prediction. The framework jointly models polarization-based image descattering and SfP normal estimation in a unified pipeline, avoiding error accumulation from sequential processing and enabling global optimization across both tasks. UD-SfPNet further incorporates a novel color embedding module to enhance geometric consistency by exploiting the relationship between color encodings and surface orientation. A detail enhancement convolution module is also included to better preserve high-frequency geometric details that are lost under scattering. Experiments on the MuS-Polar3D dataset show that the proposed method significantly improves reconstruction accuracy, achieving a mean surface normal angular error of 15.12$^\circ$ (the lowest among compared methods). These results confirm the efficacy of combining descattering with polarization-based shape inference, and highlight the practical significance and potential applications of UD-SfPNet for optical 3D imaging in challenging underwater environments. The code is available at https://github.com/WangPuyun/UD-SfPNet.[260] On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms
Sushish Baral,Paulo Garcia,Warisa Sritriratanarak
Main category: cs.CV
TL;DR: 本文提出了一种分层算法,用于在有限平面网格中精确发现轴对齐的矩形镶嵌(tessellations),支持多层嵌套重复模式的符号化识别,具有确定性、高效性和可扩展性。
Details
Motivation: 现有统计方法适用于噪声数据但缺乏精确性,而符号化分析中对确定性周期结构的提取尚不成熟,尤其在多模式共存与层次化结构场景下存在明显空白。 Method: 采用分层算法,包含复合发现(双重检验与广度优先剪枝)以识别内部重复的矩形区域、归一化至最小表示形式、以及素元提取(选择性复制与分层记忆化)以处理不规则维度并提升计算效率。 Result: 在2×2至32×32网格上验证了算法可扩展性:简单重复图块的重叠检测耗时低于1ms;复杂模式因需穷举与系统探索呈指数级时间增长;算法能确定性地识别精确、轴对齐、矩形的镶嵌结构。 Conclusion: 该算法填补了符号化网格分析中确定性周期结构提取的关键空白,适用于谜题求解推理任务及离散符号域中精确重复结构的识别。 Abstract: The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.[261] VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Yang Cao,Feize Wu,Dave Zhenyu Chen,Yingji Zhong,Lanqing Hong,Dan Xu
Main category: cs.CV
TL;DR: 本文提出VGGT-Det,首个无需传感器几何信息(如相机位姿或深度)的多视角室内3D目标检测框架,在ScanNet和ARKitScenes上显著超越现有方法。
Details
Motivation: 现有方法依赖昂贵且难以获取的精确多视角相机标定参数,限制了实际部署;本文旨在解决无传感器几何输入(SG-Free)下的多视角室内3D检测问题。 Method: 基于VGGT模型,将其实例编码器嵌入Transformer检测流程,并设计两个新模块:(i) 注意力引导查询生成(AG),利用VGGT注意力图作为语义先验初始化对象查询;(ii) 查询驱动特征聚合(QD),通过可学习的‘See-Query’动态聚合VGGT各层的多级几何特征以实现2D到3D提升。 Result: 在ScanNet和ARKitScenes数据集上,mAP@0.25分别提升4.4和8.6;消融实验证明AG与QD能有效利用VGGT内部学习的语义与几何先验。 Conclusion: VGGT-Det验证了无需显式几何输入、仅从图像中挖掘强3D线索进行多视角室内3D检测的可行性与有效性,为更实用的部署提供了新范式。 Abstract: Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.[262] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Seungwook Kim,Minsu Cho
Main category: cs.CV
TL;DR: 本文提出ARC框架,利用模型自去噪过程中的自信心信号替代外部奖励监督,实现无需额外数据、标注或奖励模型的文本到图像生成模型后训练优化。
Details
Motivation: 提升文本到图像生成模型在人类偏好、事实性和美学方面的匹配度,避免依赖外部奖励监督带来的成本与偏差。 Method: 提出ARC(Adaptive Rewarding by self-Confidence)框架,通过自去噪探针评估模型对注入噪声的恢复精度,生成内在自信心信号并转化为标量奖励,实现无监督后训练优化。 Result: ARC在组合生成、文本渲染和图文对齐等任务上持续优于基线;与外部奖励结合时可互补提升性能,并缓解奖励作弊问题。 Conclusion: 内部自信心是一种可行且有效的替代外部奖励的信号,ARC为文本到图像模型的无监督后训练提供了新范式。 Abstract: Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.[263] DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
Zhiye Wang,Yanbo Jiang,Rui Zhou,Bo Zhang,Fang Zhang,Zhenhua Xu,Yaqin Zhang,Jianqiang Wang
Main category: cs.CV
TL;DR: 本文提出DriveCode,一种新型数值编码方法,将数字表示为专用嵌入而非离散文本标记,以克服大语言模型在自动驾驶中数值推理精度和效率的局限。
Details
Motivation: 现有大语言模型将数字离散化为token,限制了精确数值推理能力,影响传感器数据处理和控制指令生成,阻碍LLM在自动驾驶中的实际部署。 Method: 提出DriveCode方法,通过数字投影器(number projector)将数字映射到语言模型隐空间,作为专用嵌入参与多模态序列建模,与视觉和文本特征无缝融合。 Result: 在OmniDrive、DriveGPT4和DriveGPT4-V2数据集上验证,DriveCode在轨迹预测和控制信号生成任务中性能优于基线方法。 Conclusion: DriveCode有效提升了LLM在自动驾驶任务中的数值建模能力,为构建高精度、高效率的端到端自动驾驶系统提供了新路径。 Abstract: Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.[264] Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications
Md. Adnanul Islam,Wasimul Karim,Md Mahbub Alam,Subhey Sadi Rahman,Md. Abdur Rahman,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Kheng Cher Yeo,Deepika Mathur,Sami Azam
Main category: cs.CV
TL;DR: 本文提出了一种多模态重量预测框架(MWP),结合RGB图像与物理信息元数据(如尺寸、相机距离和高度),在真实世界数据集Waste-Weight-10K上实现了高精度废料重量估计,并引入可解释模块提升可信度。
Details
Motivation: 现有基于图像的废料重量估计方法受限于外观相似但密度不同、以及视角导致的尺寸变化问题,难以准确估计重量。 Method: 提出Multimodal Weight Predictor(MWP)框架:使用Vision Transformer提取图像特征,专用元数据编码器处理几何与类别信息,通过Stacked Mutual Attention Fusion融合视觉与物理线索;采用Mean Squared Logarithmic Error损失函数训练;并集成SHAP与大语言模型提供物理可解释性。 Result: 在Waste-Weight-10K测试集上达到88.06 kg MAE、6.39% MAPE、R²=0.9548;轻量级(0–100 kg)MAE为2.38 kg、MAPE为3.1%;重型(1000–2000 kg)MAPE为11.1%;并支持可解释预测。 Conclusion: MWP通过融合视觉与物理元数据,并辅以注意力融合机制和对数误差训练策略,显著提升了跨重量范围的鲁棒性与准确性,同时具备可解释性,适用于实际物流与回收场景。 Abstract: Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.[265] Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Shreshth Saini,Bowen Chen,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik
Main category: cs.CV
TL;DR: 本文提出HDR-Q,首个面向HDR用户生成视频质量评估的多模态大语言模型,并构建了大规模主观数据集Beyond8Bits以推动该领域发展。
Details
Motivation: 现有感知视频质量评估(VQA)系统主要针对标准动态范围(SDR)设计,难以应对高动态范围(HDR)UGC视频中特有的失真(如近黑 crushed、高光裁剪、色带、曝光闪烁等),亟需适配HDR特性的新方法和新数据集。 Method: 构建了包含44K HDR UGC视频与1.5M众包评分的大规模主观数据集Beyond8Bits;提出HDR-Q模型,包括:(i) HDR感知视觉编码器,生成对HDR敏感的嵌入;(ii) HDR感知策略优化(HAPO)强化学习微调框架,结合HDR-SDR对比KL散度与高斯加权回归奖励。 Result: HDR-Q在Beyond8Bits及公开HDR-VQA基准上均达到SOTA性能。 Conclusion: HDR-Q首次将多模态大语言模型成功应用于HDR UGC视频质量评估,HAPO与HDR感知编码器的设计显著提升了对HDR特有失真的建模能力,Beyond8Bits为后续研究提供了关键数据支撑。 Abstract: High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.[266] \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On
Zhenchen Wan,Ce Chen,Runqi Lin,Jiaxin Huang,Tianxi Chen,Yanwu Xu,Tongliang Liu,Mingming Gong
Main category: cs.CV
TL;DR: Mobile-VTON 是一种可在普通移动设备上离线运行的高质量、隐私保护型虚拟试衣框架,采用教师-服装-试穿三模块架构与多项轻量化技术创新,在不牺牲视觉质量的前提下实现端侧高效部署。
Details
Motivation: 现有虚拟试衣系统多依赖云端GPU并需上传用户照片,存在隐私泄露风险且难以在移动设备上本地部署。 Method: 提出模块化TGT架构(TeacherNet–GarmentNet–TryonNet),结合特征引导对抗蒸馏(FGA)、轨迹一致性损失、潜在空间拼接与轻量跨模态条件机制,实现知识蒸馏、服装语义保持与人衣对齐的协同优化。 Result: 在VITON-HD和DressCode数据集上以1024×768分辨率实现媲美甚至超越服务端基线的生成质量,全程离线运行,计算开销低。 Conclusion: 证明了高质量虚拟试衣完全可在移动端离线实现,兼顾性能、隐私与实用性,为现实应用提供了安全可行的解决方案。 Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.[267] StegoNGP: 3D Cryptographic Steganography using Instant-NGP
Wenxiang Jiang,Yujun Lan,Shuo Zhao,Yuanshan Liu,Mingzhu Zhou,Jinxin Wang
Main category: cs.CV
TL;DR: 本文提出StegoNGP,一种无需额外参数的3D加密隐写方法,利用Instant-NGP的哈希编码函数作为密钥控制的场景切换器,在单一模型中嵌入覆盖场景与隐藏场景,保持模型结构与参数量不变,并通过多密钥机制增强安全性与鲁棒性。
Details
Motivation: 现有神经隐写方法依赖外部解码器、需修改网络结构、容量有限且易被检测,难以在Instant-NGP中安全嵌入高容量三维场景数据。 Method: 提出参数无关的StegoNGP方法,将Instant-NGP哈希编码函数用作密钥控制的场景切换器;默认密钥对应覆盖场景,秘密密钥对应隐藏场景;引入多级独立密钥的Multi-Key方案以扩大密钥空间并提升抗部分密钥泄露能力。 Result: 实验表明StegoNGP可高质量隐藏完整3D场景,具备强不可感知性与安全性,模型外观与标准Instant-NGP完全一致。 Conclusion: StegoNGP为神经场中的高容量、不可检测信息隐藏提供了新范式,兼具实用性与安全性。 Abstract: Recently, Instant Neural Graphics Primitives (Instant-NGP) has achieved significant success in rapid 3D scene reconstruction, but securely embedding high-capacity hidden data, such as an entire 3D scene, remains a challenge. Existing methods rely on external decoders, require architectural modifications, and suffer from limited capacity, which makes them easily detectable. We propose a novel parameter-free 3D Cryptographic Steganography using Instant-NGP (StegoNGP), which leverages the Instant-NGP hash encoding function as a key-controlled scene switcher. By associating a default key with a cover scene and a secret key with a hidden scene, our method trains a single model to interweave both representations within the same network weights. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count. We also introduce an enhanced Multi-Key scheme, which assigns multiple independent keys across hash levels, dramatically expanding the key space and providing high robustness against partial key disclosure attacks. Experimental results demonstrated that StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security, providing a new paradigm for high-capacity, undetectable information hiding in neural fields. The code can be found at https://github.com/jiang-wenxiang/StegoNGP.[268] Decoupling Motion and Geometry in 4D Gaussian Splatting
Yi Zhang,Yulei Kang,Jian-Fang Hu
Main category: cs.CV
TL;DR: 本文提出VeGaS,一种基于速度的4D高斯点绘框架,通过引入伽利略剪切矩阵和几何形变网络,解耦高斯运动与几何属性,显著提升动态场景的高保真重建效果。
Details
Motivation: 现有4D高斯点绘方法将高斯运动与几何属性耦合在单一协方差中,导致对复杂运动建模能力不足并产生视觉伪影。 Method: 提出VeGaS框架:1)使用伽利略剪切矩阵显式引入时变速度,解耦运动与几何;2)设计几何形变网络,利用时空上下文和速度线索优化高斯形状与朝向。 Result: 在多个公开数据集上实验表明,VeGaS达到当前最优性能。 Conclusion: VeGaS通过运动-几何解耦策略有效提升了动态场景重建的质量与鲁棒性,为4D重建提供了新思路。 Abstract: High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.[269] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
Jiangshan Wang,Kang Zhao,Jiayi Guo,Jiayu Wang,Hang Guo,Chenyang Zhu,Xiu Li,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出PreciseCache框架,通过精确检测并跳过真正冗余的计算(包括步级和块级),在不损失生成质量的前提下显著加速视频生成模型的推理过程。
Details
Motivation: 现有基于特征缓存的加速方法因无法区分真正冗余特征,导致跳过重要计算、生成质量明显下降。 Method: 提出PreciseCache框架,包含LFCache(基于低频差LFD进行步级缓存)和BlockCache(进行网络块级冗余计算跳过)两个组件。 Result: 在多种骨干网络上实现平均2.6倍加速,且无明显质量损失。 Conclusion: PreciseCache是一种即插即用、精度高、通用性强的视频生成加速方案,有效解决了加速与质量难以兼顾的问题。 Abstract: High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of 2.6x speedup without noticeable quality loss. Source code will be released.[270] EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization
Zhaoxin Fan,Nanxiang Jiang,Daiheng Gao,Shiji Zhou,Wenjun Wu
Main category: cs.CV
TL;DR: 本文提出EraseAnything++,一种统一框架,用于在基于流匹配的文本到图像和文本到视频扩散模型中移除不期望的概念,同时保持生成质量。该方法将概念擦除建模为带约束的多目标优化问题,并通过隐式梯度手术、LoRA微调与注意力正则化实现高效、保效的擦除;在视频中引入锚定-传播机制以增强时序一致性。实验表明其在擦除效果、生成保真度和时序一致性上均达到新SOTA。
Details
Motivation: 现有概念擦除方法难以泛化到采用流匹配和Transformer架构的新型T2I/T2V模型(如Stable Diffusion v3、Flux、OpenSora),尤其在长时序视频生成中表现不佳。 Method: 提出EraseAnything++框架:1)将概念擦除建模为约束多目标优化;2)设计基于隐式梯度手术的效用保持反学习策略;3)结合LoRA参数调优与注意力层正则化,锚定关键视觉表征并跨空-时维度传播擦除;4)在视频中引入锚定-传播机制,在参考帧初始化擦除并逐层强制传播以抑制时序漂移。 Result: 在图像与视频基准测试中,EraseAnything++在擦除有效性、生成保真度和时序一致性三方面均显著优于先前方法,成为下一代扩散模型概念擦除的新SOTA。 Conclusion: EraseAnything++为现代流匹配与Transformer驱动的T2I/T2V模型提供了通用、高效且保质的概念擦除解决方案,解决了现有方法泛化性差与时序不一致的关键瓶颈。 Abstract: Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.[271] Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation
Jiaqi Tang,Mengyan Zheng,Shu Zhang,Fandong Zhang,Qingchao Chen
Main category: cs.CV
TL;DR: 本文提出了一种解剖学感知的合成监督预训练框架,通过引入真实解剖结构特征的轻量级形状库和结构感知的序列化补丁合成策略,在保持隐私合规与数据可扩展性的同时,显著提升ViT在3D医学图像分割中的性能。
Details
Motivation: 现有公式驱动监督学习(FDSL)虽能保护隐私,但因使用通用几何形状导致语义鸿沟——缺乏真实解剖的形态保真度、固定空间布局及器官间关系,难以学习全局结构先验。 Method: 构建基于5例去标识化、仅标签分割掩码的轻量解剖形状库;设计结构感知的序列化补丁合成策略,包括空间锚点确保定位合理性与拓扑图约束器官交互(如避免不可能重叠)。 Result: 在BTCV和MSD数据集上显著超越当前最优FDSL基线(+1.74%)和SSL方法(最高+1.66%),且性能随合成数据量增加而稳健提升。 Conclusion: 该方法为3D医学分割提供了高效、可扩展、隐私合规的数据替代方案,弥合了合成数据与真实解剖语义之间的关键差距。 Abstract: Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL's infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74\% and up to 1.66\%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.[272] Event-Anchored Frame Selection for Effective Long-Video Understanding
Wang Chen,Yongdong Luo,Yuhui Zeng,Luojun Lin,Tianyu Xie,Fei Chao,Rongrong Ji,Xiawu Zheng
Main category: cs.CV
TL;DR: 本文提出了一种事件锚定的关键帧选择方法(EFS),通过分层、事件感知的方式提升长视频理解性能,无需训练即可即插即用,显著提升多个基准测试准确率。
Details
Motivation: 大规模帧冗余和有限上下文窗口使得在长视频理解中高效选择关键帧至关重要;现有方法采用扁平化采样范式,忽视视频的语义结构。 Method: EFS利用自监督DINO嵌入将视频划分为视觉同质的时间段(作为语义事件代理),在每个事件内选择最与查询相关的帧作为锚点,并通过自适应最大边际相关性(MMR)进行全局优化,兼顾事件覆盖、查询相关性和视觉多样性。 Result: EFS作为无训练、即插即用模块,在LLaVA-Video-7B上使VideoMME、LongVideoBench和MLVU准确率分别提升4.7%、4.9%和8.8%。 Conclusion: EFS是一种高效、通用且无需训练的关键帧选择方法,能显著增强现有LVLM在长视频理解任务中的性能。 Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.[273] The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers
Jiaqi Tang,Weixuan Xu,Shu Zhang,Fandong Zhang,Qingchao Chen
Main category: cs.CV
TL;DR: 本文提出了一种物理启发的空间解耦合成框架,解决公式驱动监督学习(FDSL)在医学图像中因高频纹理引入导致的边界混叠问题,从而提升Vision Transformer在小样本、隐私敏感场景下的性能。
Details
Motivation: Vision Transformers在医学图像分析中面临数据稀缺与隐私限制,而现有公式驱动监督学习(FDSL)仅生成简单几何图形,忽略CT/MRI中的组织纹理与噪声模式,难以支撑真实临床需求。 Method: 提出物理启发的空间解耦合成框架:首先基于边界距离构建梯度屏蔽缓冲区以稳定形状学习,再在物体核心注入物理驱动的频谱纹理,实现形状表征鲁棒性与采集噪声不变性的协同优化。 Result: 在BTCV和MSD数据集上显著优于先前FDSL及基于真实数据的自监督方法,分别提升1.43%和1.51%。 Conclusion: 该方法为医学ViTs提供了可扩展、免标注的训练基础,弥合了合成数据与真实模态间的语义鸿沟。 Abstract: Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.[274] Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality
Danfeng Hong,Chenyu Li,Xuyang Li,Gustau Camps-Valls,Jocelyn Chanussot
Main category: cs.CV
TL;DR: 本文对遥感领域的基础模型进行了全面技术综述,从单模态到多模态视角梳理其发展脉络,并为初学者提供训练与应用的实用指南。
Details
Motivation: 遥感数据量和多样性激增,亟需更强大的建模与理解能力;基础模型为此带来革命性潜力。 Method: 开展系统性技术综述,按单模态/多模态分类现有模型,并设置教程式章节指导模型训练与实际应用。 Result: 构建了面向遥感的基础模型知识框架,明确了定义、必要性与发展路径,提供了入门与实践指引。 Conclusion: 该综述为遥感与基础模型交叉研究提供了重要切入点,助力研究人员快速掌握并有效应用相关技术。 Abstract: Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.[275] MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation
Yi Zhang,Puxun Tu,Kun Wang,Yulin Yan,Tao Ying,Xiaojun Chen
Main category: cs.CV
TL;DR: 本文提出MLRecon,一种基于单个RGB-D相机的无标记、抗漂移6D探头姿态跟踪框架,结合视觉基础模型与双阶段姿态优化网络,显著提升自由手三维超声重建精度与鲁棒性。
Details
Motivation: 现有自由手三维超声重建的跟踪方法面临成本高、侵入性强或累积漂移严重三重限制,亟需低成本、非侵入、高精度的解决方案。 Method: 提出MLRecon框架:1)利用视觉基础模型实现连续无标记探头跟踪;2)设计视觉引导的发散检测器实现自动失败恢复;3)构建双阶段姿态细化网络,分离高频抖动与低频偏差以保真运动轨迹。 Result: 在复杂轨迹上平均位置误差低至0.88 mm,三维重建表面平均精度达亚毫米级,显著优于现有传感器辅助与无传感器方法。 Conclusion: MLRecon为资源有限临床环境提供了低成本、高可用的三维超声成像新范式,确立了无标记自由手超声重建的新基准。 Abstract: Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.[276] Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
Yuze Li,Dong Gong,Xiao Cao,Junchao Yuan,Dongsheng Li,Lei Zhou,Yun Sing Koh,Cheng Yan,Xinyu Zhang
Main category: cs.CV
TL;DR: 本文提出了FlexiMMT,首个支持多物体、多运动迁移的隐式图像到视频(I2V)运动迁移框架,通过运动解耦掩码注意力机制与差异化掩码传播机制解决跨物体运动纠缠问题,实现精确、可组合的多运动迁移。
Details
Motivation: 现有运动迁移方法主要面向单物体场景,在多个物体需不同运动模式时表现不佳。 Method: 提出FlexiMMT框架,包含Motion Decoupled Mask Attention Mechanism(利用物体特定掩码约束注意力)和Differentiated Mask Propagation Mechanism(从扩散注意力中直接推导并跨帧传播物体掩码)。 Result: 在I2V多物体多运动迁移任务上实现了精准、可组合且SOTA的性能。 Conclusion: FlexiMMT首次实现了显式的多物体、多运动迁移,有效解耦运动表征并支持任意运动-物体映射,为可控视频生成提供了新范式。 Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.[277] Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Xubo Zhu,Haoyang Zhang,Fei He,Rui Wu,Yanhu Shan,Wen Yang,Huai Yu
Main category: cs.CV
TL;DR: 本文提出Dr.Occ框架,通过深度引导的视图变换器(D²-VFormer)和区域引导的专家变换器(R/R²-EFormer),解决3D语义占据预测中几何错位与空间类别不平衡问题,在Occ3D-nuScenes上显著提升mIoU和IoU。
Details
Motivation: 现有方法在视图变换中存在几何错位(因缺乏像素级精确深度估计)和严重空间类别不平衡(语义类别具有强空间各向异性)。 Method: 提出Dr.Occ框架:1)深度引导的2D-to-3D视图变换器(D²-VFormer),利用MoGe-2提供的高质量稠密深度线索构建可靠几何先验;2)区域引导的专家变换器(R/R²-EFormer),借鉴MoE思想自适应分配区域专用专家以应对空间语义变化。 Result: 在Occ3D-nuScenes基准上,Dr.Occ在纯视觉设定下相较强基线BEVDet4D提升7.43% mIoU和3.09% IoU。 Conclusion: 深度引导确保几何对齐,区域专家增强语义学习,二者互补提升3D语义占据预测性能。 Abstract: 3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.[278] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers
Main category: cs.CV
TL;DR: 本文提出了一种数据到数据的流匹配框架(Data-to-Data Flow Matching)及改进版PDG-FM,通过确定性映射和基于预训练扩散模型密度的测地线流约束,提升新视角合成中的一致性和几何连贯性,效果优于现有扩散模型基线。
Details
Motivation: 现有基于扩散模型的新视角合成方法因随机噪声建模导致跨视角不一致,缺乏对确定性结构和几何一致性的显式建模。 Method: 提出Data-to-Data Flow Matching框架,实现配对视角间的确定性映射;进一步引入PDG-FM,利用预训练扩散模型的概率密度定义测地线插值,约束流轨迹位于高密度数据流形区域。 Result: 在新视角合成任务上超越扩散基线,展现出更强的结构一致性与更平滑的视角过渡。 Conclusion: 将数据依赖的几何正则化引入确定性流匹配,可有效提升视角一致性与生成质量。 Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.[279] Implementation of Licensed Plate Detection and Noise Removal in Image Processing
Yiquan Gao
Main category: cs.CV
TL;DR: 本文介绍了车牌识别系统在马来西亚的应用背景及其在电子停车支付、高速公路收费、交通监控和警务执法等领域的应用潜力。
Details
Motivation: 随着马来西亚车辆数量的快速增长,对车牌识别系统的需求日益增加。 Method: 基于图像处理技术,包括自动车牌识别(ANPR)、自动车辆识别和光学字符识别(OCR)等方法。 Result: 车牌识别系统已在多个实际场景中得到应用,并具备与其他领域技术融合的潜力。 Conclusion: 车牌识别系统在交通管理与执法等领域具有重要价值,且未来可拓展至生物、航空航天等跨学科领域。 Abstract: Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.[280] RaUF: Learning the Spatial Uncertainty Field of Radar
Shengpeng Wang,Kuangyu Wang,Wei Wang
Main category: cs.CV
TL;DR: 本文提出RaUF框架,通过建模雷达测量的各向异性物理特性来学习空间不确定性场,解决特征到标签映射模糊问题,并利用双向域注意力机制提升检测可靠性。
Details
Motivation: 毫米波雷达在恶劣天气下具有优势,但存在空间分辨率低、方位模糊严重及杂波干扰等问题;现有方法忽视模糊的特征-标签映射,导致几何推理病态,影响下游感知任务。 Method: 提出RaUF框架:1)设计各向异性概率模型学习细粒度不确定性;2)引入双向域注意力机制,融合空间结构与多普勒一致性以抑制虚假/多径回波。 Result: 在公开基准和真实数据集上验证了RaUF能提供高可靠性且校准良好的空间检测结果;下游案例研究进一步证实其在复杂真实驾驶场景下的可靠性与可扩展性。 Conclusion: RaUF通过物理驱动的不确定性建模与跨域注意力机制,显著提升了毫米波雷达空间感知的鲁棒性与可信度,为恶劣天气下的自动驾驶感知提供了新思路。 Abstract: Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.[281] Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Junbo Ke,Yangyang Xu,You-Wei Wen,Chao Wang
Main category: cs.CV
TL;DR: 本文提出了一种内容感知的频率编码方法(CAFE),通过并行线性层与Hadamard积改进傅里叶特征,显式高效合成更广频率基,并扩展为CAFE+引入切比雪夫特征以增强稳定性与表现力。
Details
Motivation: 隐式神经表示(INRs)存在频谱偏差,难以捕捉高频细节;现有基于固定傅里叶基的方法效率低、表征能力受限。 Method: 提出CAFE:在傅里叶特征基础上,采用多平行线性层加Hadamard积实现动态频率合成;进一步提出CAFE+,融合切比雪夫特征作为补充。 Result: 在多个基准实验中显著优于现有方法,验证了其有效性与高效性。 Conclusion: CAFE及其扩展CAFE+能自适应选择任务相关频率,提升INRs对高频细节的建模能力,兼具表达力与稳定性。 Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at https://github.com/JunboKe0619/CAFE.[282] Vision-Language Feature Alignment for Road Anomaly Segmentation
Zhuolin He,Jiacheng Tang,Jian Pu,Xiangyang Xue
Main category: cs.CV
TL;DR: 本文提出VL-Anomaly,一种基于视觉-语言模型的路障异常分割框架,通过提示学习对齐视觉特征与文本嵌入,并融合多源信息提升异常检测的准确率与鲁棒性。
Details
Motivation: 现有基于像素统计的异常分割方法在天空、植被等正常背景区域易产生高误报,在真实分布外(OOD)实例上召回率低,威胁自动驾驶安全。 Method: 提出VL-Anomaly框架:1)设计提示学习驱动的对齐模块,将Mask2Former视觉特征对齐CLIP已知类别文本嵌入,抑制背景区域伪异常响应;2)引入融合文本引导相似度、CLIP图文相似度和检测器置信度的多源推理策略。 Result: 在RoadAnomaly、SMIYC和Fishyscapes等基准数据集上达到SOTA性能。 Conclusion: 利用预训练视觉-语言模型的语义先验可显著提升道路异常分割的可靠性,降低误报与漏报,增强自动驾驶系统安全性。 Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.[283] Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Yangyang Xu,Junbo Ke,You-Wei Wen,Chao Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于隐式神经表示(INRs)的张量环(TR)函数分解方法,用于处理网格和非网格数据,并通过频域分析揭示了TR因子频谱结构对高频建模能力的限制;为此设计了一种重参数化TR函数分解框架,结合可学习潜在张量与固定基,并给出理论保证和初始化方案,在图像修复、去噪、超分辨率及点云恢复任务中均取得优越性能。
Details
Motivation: 传统张量环(TR)分解受限于离散网格数据,难以建模连续、非网格的高阶数据;同时其高频建模能力受TR因子频谱结构限制,导致细节重建困难。 Method: 提出TR函数分解框架,用隐式神经表示参数化TR因子;通过频域分析揭示频谱瓶颈;引入重参数化形式——每个TR因子为可学习潜在张量与固定基的结构化组合;设计理论支撑的固定基初始化,并证明模型Lipschitz连续性。 Result: 在图像修复、去噪、超分辨率和点云恢复等多个任务上,性能持续优于现有方法。 Conclusion: 重参数化的TR函数分解提升了训练动力学与高频建模能力,兼具理论严谨性与实际有效性,拓展了TR分解在连续信号建模中的适用性。 Abstract: Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at https://github.com/YangyangXu2002/RepTRFD.[284] SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
Kuanxu Hou
Main category: cs.CV
TL;DR: 本文提出SMR-Net,一种基于自注意力机制的多尺度目标检测算法,并设计专用传感器,显著提升机器人自动装配中透明/低对比度卡扣件的检测与定位精度。
Details
Motivation: 传统视觉方法在处理透明或低对比度卡扣件时鲁棒性差、定位误差大,难以满足高精度装配需求。 Method: 设计专用传感器,提出SMR-Net:采用嵌入注意力机制的特征提取器增强关键特征并抑制噪声;并行处理三层多尺度特征图(结合标准与空洞卷积)以统一维度并保持分辨率;通过自适应重加权网络动态融合特征,生成兼顾细节与全局语义的精细表征。 Result: 在Type A和Type B卡扣数据集上,IoU分别提升6.52%和5.8%,mAP分别提升2.8%和1.5%,显著优于Faster R-CNN。 Conclusion: SMR-Net在复杂卡扣检测与定位任务中具有明显优势,为高精度机器人自动化装配提供了有效技术支撑。 Abstract: In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method's superiority in complex snap detection and localization tasks.[285] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
Haoyuan Zhang,Keyao Wang,Guosheng Zhang,Haixiao Yue,Zhiwen Tan,Siran Peng,Tianshuo Zhang,Xiao Tan,Kunbin Chen,Wei He,Jingdong Wang,Ajian Liu,Xiangyu Zhu,Zhen Lei
Main category: cs.CV
TL;DR: 本文提出了一种名为TAR-FAS的新型人脸反欺骗框架,通过将多模态大语言模型(MLLM)与外部视觉工具结合,采用链式思维加视觉工具(CoT-VT)范式,提升跨域泛化能力与细粒度欺骗线索识别能力,并构建了ToolFAS-16K数据集与DT-GRPO训练方法,实验表明其在严苛跨域协议下达到SOTA性能。
Details
Motivation: 现有基于MLLM的人脸反欺骗方法仅依赖文本描述,难以捕捉细粒度视觉线索,导致跨域泛化能力受限。 Method: 提出TAR-FAS框架,引入Chain-of-Thought with Visual Tools(CoT-VT)范式;设计工具增强的数据标注流程,构建ToolFAS-16K数据集;提出DT-GRPO训练策略以实现模型自主高效调用多种视觉工具。 Result: 在极具挑战性的一对十一跨域协议下,TAR-FAS显著优于现有方法,达到SOTA性能,并能提供可解释的细粒度视觉分析过程。 Conclusion: 融合外部视觉工具与MLLM的推理机制可有效提升FAS系统的鲁棒性与可信性,CoT-VT范式为多模态安全任务提供了新思路。 Abstract: Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.[286] MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
Huanjin Yao,Qixiang Yin,Min Yang,Ziwang Zhao,Yibo Wang,Haotian Luo,Jingyi Zhang,Jiaxing Huang
Main category: cs.CV
TL;DR: 本文提出MM-DeepResearch,一种具备显式推理、多工具调用与跨模态信息融合能力的多模态研究代理,通过Hyper-Search(超图生成搜索密集型多模态问答数据)、DR-TTS(按搜索工具类型分解任务并树搜索组合专家)和离线搜索引擎三大创新,克服数据稀缺、轨迹低效与API成本高三大挑战。
Details
Motivation: 现有多模态研究代理面临三方面挑战:缺乏搜索密集型多模态问答数据、缺少高效搜索轨迹、在线搜索API训练成本过高。 Method: 提出三种关键技术:1)Hyper-Search——基于超图建模图文节点跨模态关联,生成需多工具调用的搜索密集型QA对;2)DR-TTS——按搜索工具类型分解任务、训练专用工具专家,并通过树搜索联合探索有效搜索轨迹;3)构建支持多工具的离线搜索引擎,支撑免API的智能体强化学习。 Result: 所提出的MM-DeepResearch在多个基准测试中展现出显著优越性,验证了方法的有效性与泛化能力。 Conclusion: MM-DeepResearch通过结构化多模态建模、分而治之的工具协同机制与低成本离线训练范式,为构建高性能、可扩展的多模态研究代理提供了系统性解决方案。 Abstract: We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch[287] Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
Yuechen Luo,Qimao Chen,Fang Li,Shaoqing Xu,Jaxin Liu,Ziying Song,Zhi-xin Yang,Fuxi Wen
Main category: cs.CV
TL;DR: 本文提出ELF-VLA框架,通过引入结构化失败诊断反馈增强强化学习,解决VLA模型在自动驾驶中因稀疏奖励导致的性能瓶颈,显著提升长尾场景表现和整体性能。
Details
Motivation: VLA模型在RL优化中常因监督微调限制探索能力,在长尾场景中遭遇零奖励失败,但传统标量奖励无法揭示失败根源(如规划、推理或执行错误)。 Method: 提出ELF-VLA框架:用可解释的结构化失败诊断报告替代模糊标量奖励;策略基于反馈生成‘反馈引导的修正’;将高奖励修正样本注入RL训练批次以提供定向梯度。 Result: 在NAVSIM基准上达到SOTA:整体PDMS、EPDMS得分及高层规划准确率均显著提升,验证了对模型潜在能力的释放效果。 Conclusion: 显式从失败中学习能有效突破RL优化瓶颈,使VLA模型更鲁棒地应对关键长尾驾驶场景。 Abstract: Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause -- whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.[288] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
Zebin You,Xiaolu Zhang,Jun Zhou,Chongxuan Li,Ji-Rong Wen
Main category: cs.CV
TL;DR: LLaDA-o是一种长度自适应的多模态扩散模型,通过解耦文本理解和图像生成的扩散过程,并共享高效注意力主干网络,在多模态理解与生成任务中达到SOTA性能。
Details
Motivation: 现有统一多模态扩散模型在处理不同模态时存在计算冗余、难以灵活支持变长文本输入等问题,需要一种更高效、更灵活的建模范式。 Method: 提出混合扩散(MoD)框架,解耦离散掩码扩散(用于文本理解)与连续扩散(用于视觉生成),并共享轻量注意力主干;引入数据驱动的长度自适应策略,实现无需修改架构的灵活解码。 Result: 在多模态理解与生成基准上达到omni-diffusion模型SOTA;在DPG-Bench文本到图像生成任务中得分87.04。 Conclusion: 统一的、长度自适应的omni扩散建模是可行且高效的,LLaDA-o为多模态大模型提供了一种兼顾理解与生成能力的新范式。 Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.[289] Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
Yunguan Fu,Wenjia Bai,Wen Yan,Matthew J Clarkson,Rhodri Huw Davies,Yipeng Hu
Main category: cs.CV
TL;DR: FlowReg是一种基于流匹配的无监督图像配准框架,能在仅两步内实现高质量心脏 cine MR 图像配准,并支持进一步细化;通过warmup-reflow训练与Initial Guess策略提升性能,在多个任务上超越现有方法,且参数增量极小、无需分割标签。
Details
Motivation: 现有基于扩散的无监督图像配准方法在心脏cine MR上虽有探索,但多步推理成本高、实用性受限。 Method: 提出FlowReg框架,在位移场空间进行流匹配;采用warmup-reflow训练策略(单步网络先作教师,再训练学生从任意中间状态 refine);引入Initial Guess策略,将模型预测反馈为下一步起始点。 Result: 在ACDC和MM2数据集六个任务(含跨数据集泛化)中,五项任务Dice得分平均提升0.6%,左心室提升最多(+1.09%);LVEF估计误差在所有六项任务中平均降低2.58个百分点;仅增加0.7%参数,无需分割标签。 Conclusion: FlowReg以极简设计实现了高效、精准、可扩展的心脏图像无监督配准,显著优于现有方法,具备临床落地潜力。 Abstract: Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at https://github.com/mathpluscode/FlowReg.[290] Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
Wangkai Li,Zhaoyang Li,Yuwen Pan,Rui Sun,Yujia Chen,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出A3Point框架,通过自适应增强感知的潜在学习方法提升LiDAR点云语义分割在恶劣天气下的鲁棒性,核心包括语义混淆先验(SCP)学习和语义偏移区域(SSR)定位,有效缓解增强导致的语义偏移,在多个基准上达到SOTA。
Details
Motivation: 恶劣天气导致LiDAR点云语义分割性能显著下降,现有基于增强的方法难以兼顾轻微与强增强,无法充分挖掘增强潜力。 Method: 提出A3Point框架,包含语义混淆先验(SCP)潜在学习模块以建模模型固有语义混淆,以及语义偏移区域(SSR)定位模块以解耦语义混淆与语义偏移,从而实现针对不同扰动程度的自适应优化。 Result: 在多个标准通用LiDAR分割基准(恶劣天气场景下)上取得SOTA性能。 Conclusion: A3Point能有效利用多样化增强并抑制其引发的语义偏移,显著提升模型在分布外恶劣天气条件下的泛化能力与鲁棒性。 Abstract: Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model's inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results.[291] Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
Xuan Lu,Kangle Li,Haohang Huang,Rui Meng,Wenjun Zeng,Xiaoyu Shen
Main category: cs.CV
TL;DR: 本文提出了MCMR基准,用于评估多条件跨模态检索能力,涵盖五个产品领域,强调细粒度、多约束的图文匹配,并通过实验揭示了不同模型在多条件理解上的差异与优势。
Details
Motivation: 现有基准主要关注粗粒度或单条件对齐,无法反映真实场景中用户查询涉及多个跨模态互依约束的需求,因此需要构建更贴近实际的细粒度多条件检索评测基准。 Method: 构建了大规模多条件多模态检索基准MCMR,覆盖五类商品领域,保留丰富长文本元数据;设计融合视觉与文本属性的自然语言查询;在多种MLLM-based检索器和视觉-语言重排序器上进行系统评测。 Result: 实验发现:(i) 不同模型存在明显模态不对称性;(ii) 视觉线索主导高排名精度,文本元数据提升长尾排序稳定性;(iii) 基于MLLM的逐点重排序器能显著增强细粒度匹配能力。 Conclusion: MCMR为推进多模态检索向组合式、约束感知及可解释理解方向发展提供了具有挑战性和诊断价值的新基准。 Abstract: Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR[292] Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective
Arctanx An,Shizhao Sun,Danqing Huang,Mingxi Cheng,Yan Gao,Ji Li,Yu Qiao,Jiang Bian
Main category: cs.CV
TL;DR: 本文提出AesEval-Bench基准,系统评估视觉语言模型(VLMs)在平面设计美学质量评估任务上的能力,并构建专用训练数据集以提升VLM表现。
Details
Motivation: 现有VLM在图形设计美学评估方面研究不足,且存在基准狭窄、模型比较缺乏系统性、训练数据有限三大问题。 Method: 构建涵盖四个维度、十二个指标、三项量化任务(美学判断、区域选择、精确定位)的AesEval-Bench基准;系统评测多类VLM;利用人工引导的VLM标注与指标驱动的推理构建训练数据集。 Result: 揭示了当前VLM在美学评估任务中与人类判断之间存在显著性能差距;验证了所提训练策略可有效提升VLM在该领域的表现。 Conclusion: 本工作建立了首个面向平面设计美学质量评估的系统性框架,为VLM在设计理解方向提供了新基准与方法范式。 Abstract: Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}[293] Differential privacy representation geometry for medical image analysis
Soroosh Tayebi Arasteh,Marziyeh Mohammadi,Sven Nebelung,Daniel Truhn
Main category: cs.CV
TL;DR: 本文提出DP-RGMI框架,通过分析差分隐私(DP)对医学影像表征空间的结构化影响,将性能下降分解为编码器几何变化和任务头利用率两部分,并揭示DP主要改变表征各向异性而非均匀压缩特征。
Details
Motivation: 现有研究仅通过端到端性能评估差分隐私在医学影像中的影响,机制不明;需深入理解隐私引入的效用损失根源。 Method: 提出DP-RGMI框架,将DP视为表征空间的结构化变换;量化编码器几何(如表征位移、谱有效维数)与任务头利用率(线性探针与端到端性能差距);在4个胸部X光数据集(超59万图像)及多种预训练初始化下进行实证分析。 Result: 发现DP始终伴随利用率缺口,即使线性可分性基本保留;几何指标呈现非单调、依赖初始化与数据集的重塑造,表明DP改变表征各向异性;利用率与端到端性能关联稳健,几何指标则捕获先验与数据集相关的额外变异。 Conclusion: DP-RGMI为诊断隐私导致的失效模式和指导隐私模型选择提供了可复现的分析框架。 Abstract: Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.[294] HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
Jiashu Li,Xumeng Han,Zhaoyang Wei,Zipeng Wang,Kuiran Wang,Guorong Li,Zhenjun Han,Jianbin Jiao
Main category: cs.CV
TL;DR: 本文提出HeroGS,一种分层引导的鲁棒3D高斯点绘框架,在稀疏视角下显著提升3DGS的重建质量与渲染保真度。
Details
Motivation: 3D高斯点绘(3DGS)在稀疏视角下因监督不足导致高斯分布不规则,表现为全局稀疏、背景模糊和高频失真。 Method: 提出三层级引导框架:图像级伪稠密监督、特征级自适应稠密化与剪枝(FADP)、参数级协同剪枝几何一致性(CPG)。 Result: 在稀疏视角下显著优于现有方法,实现高保真重建与高质量实时渲染。 Conclusion: 分层引导策略能有效约束并优化整体高斯分布,提升结构保真度与渲染质量。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.[295] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting
Dantong Qin,Alessandro Bozzon,Xian Yang,Xun Zhang,Yike Guo,Pan Wang
Main category: cs.CV
TL;DR: 本文提出StrokeDiff,一种基于扩散模型的笔触生成框架,结合平滑正则化(SmR)和贝塞尔曲线条件模块,在极小样本(n=470)下实现可控、多样且结构一致的笔触学习与绘画流水线集成。
Details
Motivation: 现有生成模型难以从稀缺的手绘笔触数据中学习表达力强、可控制的视觉基元,限制了其在过程感知型多媒体创作中的应用。 Method: 提出StrokeDiff框架,引入平滑正则化(SmR)注入随机视觉先验以稳定稀疏监督下的扩散训练,并设计贝塞尔曲线驱动的条件模块实现可控生成;进一步构建包含预测、生成、排序与合成的完整笔触绘画流水线。 Result: 在小样本下生成多样且结构连贯的笔触,显著提升绘画纹理丰富度与图层表现力,自动指标与人工评估均验证有效性。 Conclusion: 数据高效的基元建模可支撑富有表现力与结构化的多媒体内容创作,为稀缺风格数据下的生成建模提供新范式。 Abstract: Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.[296] GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation
Tajamul Ashraf,Abrar Ul Riyaz,Wasif Tak,Tavaheed Tariq,Sonia Yadav,Moloud Abdar,Janibul Bashir
Main category: cs.CV
TL;DR: 本文提出了GroundedSurg,首个面向临床手术场景的语言引导、实例级外科器械定位基准,旨在提升视觉-语言模型在真实手术环境中的指代表达解析与像素级定位能力。
Details
Motivation: 现有外科工具基准仅评估类别级分割,无法满足临床中对特定器械实例(基于功能角色、空间关系或解剖交互)的精准识别需求。 Method: 构建了GroundedSurg数据集,包含多模态手术图像、自然语言描述及结构化空间标注(边界框与点锚),覆盖多种术式与成像条件,并设计联合评估语言指代解析与像素定位的任务范式。 Result: 实验表明当前主流分割模型和视觉语言模型在该基准上存在显著性能差距,验证了任务难度与临床必要性。 Conclusion: GroundedSurg为推动具备临床可信度的手术视觉-语言理解提供了新基准和评估标准,凸显发展临床可落地的多模态推理能力的紧迫性。 Abstract: Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg[297] DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
Yiming Ma,Hongkun Yang,Lionel Z. Wang,Bin Chen,Weizhi Xian,Jianzhi Teng
Main category: cs.CV
TL;DR: 本文提出DeAR框架,通过分解注意力头角色实现视觉-语言模型的细粒度适配,避免任务适应与零样本泛化之间的权衡。
Details
Motivation: 现有提示学习方法基于层中心假设,导致可学习token与原始token间不可控交互,损害模型零样本泛化能力。 Method: 提出DeAR框架,定义概念熵指标对深层注意力头进行功能分类(属性、泛化、混合),并设计专用属性token和基于角色的注意力掩码机制,辅以任务自适应融合策略。 Result: 在15个数据集上实验表明,DeAR在任务适应与泛化能力之间取得更好平衡,性能优于先前方法。 Conclusion: VLM中的功能专业化发生在注意力头级别而非层级别;DeAR通过精细化控制注意力头角色,有效解耦任务适应与零样本泛化。 Abstract: Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.[298] GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation
Zhuonan Liang,Wei Guo,Jie Gan,Yaxuan Song,Runnan Chen,Hang Chang,Weidong Cai
Main category: cs.CV
TL;DR: GuiDINO是一种轻量级框架,利用DINOv3基础模型生成空间引导掩码,通过TokenBook机制和引导监督损失提升医学图像分割性能,避免全量微调。
Details
Motivation: 基础视觉模型在医学图像分析中存在域偏移问题,难以直接适配分割任务,需更高效、低开销的适配方式。 Method: 提出GuiDINO框架:基于DINOv3提取视觉特征,用轻量级TokenBook机制生成空间引导掩码;该掩码门控多个分割主干的特征激活;引入引导监督损失(含可选边界聚焦hinge损失);支持LoRA对DINOv3进行参数高效微调。 Result: 在多种医学数据集及nnUNet式推理下,GuiDINO持续提升分割精度与边界鲁棒性。 Conclusion: GuiDINO为医学图像分割提供了一种无需全量微调的实用替代方案,拓展了基础模型在医疗视觉中的应用范式。 Abstract: Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO[299] Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains
Alp Eren Gençoğlu,Hazım Kemal Ekenel
Main category: cs.CV
TL;DR: 本文提出了一种针对灾后建筑损毁评估(BDA)模型MambaBDA的三模块增强方法,以应对类别不平衡、背景干扰和跨灾害域偏移问题,显著提升了模型在域内及跨域(尤其是未见灾害)场景下的泛化性能。
Details
Motivation: 可靠灾后建筑损毁评估面临严重类别不平衡、背景杂波干扰以及跨灾害类型和地理区域的域偏移问题。 Method: 在MambaBDA基础上引入三个模块:(i) Focal Loss缓解类别不平衡;(ii) 轻量级注意力门抑制无关上下文;(iii) 紧凑对齐模块对预事件特征进行空间形变以对齐后事件内容。 Result: 在xBD、巴基斯坦洪灾、土耳其地震、飓风Ida等多个卫星数据集上验证,域内性能提升0.8%–5%,跨数据集(未见灾害)提升高达27%。 Conclusion: 所提模块化增强显著提升MambaBDA的泛化能力,尤其适用于未见过的灾害场景。 Abstract: Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and crossdataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.[300] ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
Xiwei Liu,Yulong Li,Xinlin Zhuang,Xuhui Li,Jianxu Chen,Haolin Yang,Imran Razzak,Yutong Xie
Main category: cs.CV
TL;DR: 本文提出ClinCoT框架,通过视觉驱动的临床感知链式推理,改进医学视觉语言模型的事实 grounding 能力,减少幻觉。
Details
Motivation: 现有医学对齐方法仅在响应层面优化,中间推理与视觉区域关联弱;CoT方法偏文本中心,难以有效融合临床视觉线索。 Method: 提出ClinCoT:1)构建基于假设驱动区域提案的临床可信偏好数据生成流程;2)多Med-LLM评估器打分并排序用于监督训练;3)引入基于评分差异的margin-aware优化策略强化区域级推理路径;4)采用迭代学习动态更新偏好数据。 Result: 在三个医学VQA和报告生成基准上,ClinCoT显著提升事实 grounding 能力,性能优于现有基于偏好的对齐方法。 Conclusion: ClinCoT将偏好优化从响应级提升至视觉驱动的推理级,有效缓解医学VLM中的事实幻觉问题,增强临床决策支持的可靠性。 Abstract: Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.[301] Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
Chengtai Li,Yuting He,Jianfeng Ren,Ruibin Bai,Yitian Zhao,Heng Yu,Xudong Jiang
Main category: cs.CV
TL;DR: 本文提出PR-A²CL方法,通过预测-验证范式和增强异常对比学习解决组合视觉关系(CVR)推理任务,在多个基准上显著优于现有模型。
Details
Motivation: 组合视觉关系(CVR)因复杂性高而缺乏研究,现有方法难以建模大量组合规则。 Method: 提出Predictive Reasoning with Augmented Anomaly Contrastive Learning(PR-A²CL),包含增强异常对比学习模块(提升正常样本相似性、降低正常与异常样本相似性)和预测-验证范式(通过多个Predictive Anomaly Reasoning Blocks迭代预测并验证第四个图像的特征)。 Result: 在SVRT、CVR和MC²R数据集上显著超越当前最优视觉推理模型。 Conclusion: PR-A²CL有效提升了对复杂组合视觉关系的建模与推理能力,验证了预测-验证范式和异常对比学习在CVR任务中的有效性。 Abstract: While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.[302] Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers
Kuai Jiang,Zhaoyan Ding,Guijuan Zhang,Dianjie Lu,Zhuoran Zheng
Main category: cs.CV
TL;DR: 本文提出TCD-Net,通过因果干预实现图像去噪中的内容与噪声解耦,提升分布外鲁棒性;引入环境偏差校正、正交双分支解耦及Nano Banana Pro引导的因果先验,实现实时高性能去噪。
Details
Motivation: 传统图像去噪模型易学习环境因素与噪声间的虚假相关,且难以区分高频纹理与随机噪声,导致细节丢失或残留噪声;纯相关建模混淆内在内容与外在噪声,损害分布偏移下的鲁棒性。 Method: 提出教师引导的因果解耦网络(TCD-Net),基于ViT框架实施结构化特征空间干预:(1)环境偏差调整(EBA)模块进行去中心化投影以消除全局环境偏差;(2)正交约束的双分支解耦头强制内容与噪声表征分离;(3)利用Nano Banana Pro生成因果先验,将内容表征拉回自然图像流形。 Result: TCD-Net在多个基准上超越主流方法,在保真度与效率上均表现优异,单块RTX 5090 GPU上达104.2 FPS实时速度。 Conclusion: 因果干预可有效解耦图像去噪中的内容与噪声生成机制,显著提升模型泛化性与鲁棒性;TCD-Net为基于因果推理的低级视觉任务提供了新范式。 Abstract: Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.[303] ArtLLM: Generating Articulated Assets via 3D LLM
Penghao Wang,Siyuan Xie,Hongyu Yan,Xianghui Yang,Jingwei Huang,Chunchao Guo,Jiayuan Gu
Main category: cs.CV
TL;DR: 本文提出ArtLLM框架,利用3D多模态大语言模型从完整3D网格自回归生成高质量可动结构(部件与关节),无需人工拟合或固定部件库,显著提升部件布局与关节预测精度,并在数字孪生和机器人学习中展现应用潜力。
Details
Motivation: 现有可动3D物体重建方法存在局限:优化方法速度慢、仅支持单关节;检索方法依赖固定部件库,导致几何重复、泛化差。 Method: 构建ArtLLM框架,包含一个在大规模可动数据集上训练的3D多模态大语言模型,从点云统一预测可变数量的部件与关节(即运动学结构),再以此布局条件化3D生成模型合成高保真部件几何。 Result: 在PartNet-Mobility数据集上,ArtLLM在部件布局准确率和关节预测性能上显著超越SOTA方法,并对真实世界物体具备强泛化能力;成功应用于数字孪生构建与机器人学习。 Conclusion: ArtLLM实现了端到端、可扩展、高保真的可动3D资产生成,为游戏、仿真与机器人领域提供了新范式。 Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.[304] TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
Zhuo Chen,Shawn Young,Lijian Xu
Main category: cs.CV
TL;DR: 本文提出TC-SSA方法,通过语义槽聚合实现WSI视觉token压缩,在保持诊断关键信息的同时将token数量降至原序列的1.7%,显著提升大模型在计算病理学中的效率与性能。
Details
Motivation: 大型视觉语言模型在计算病理学中面临全切片图像(WSI)千兆像素尺度带来的计算瓶颈,传统空间采样易丢失关键诊断信息。 Method: 提出TC-SSA(基于语义槽聚合的Token压缩)框架:利用门控路由模块结合稀疏Top-2路由将图像块特征分配至固定数量语义槽,并加权聚合,实现全局覆盖下的可学习token压缩。 Result: 在SlideBench(TCGA)上达78.34%总体准确率;在TCGA-BRCA等数据集MIL分类任务中AUC最高达98.27%;token数压缩至原始1.7%。 Conclusion: 可学习的语义聚合在效率与诊断性能间实现了有效权衡,为千兆像素病理推理提供了新范式。 Abstract: The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.[305] ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features
Jiamei Guo,Zhehao Duan,Maria Neiiendam,Dianye Huang,Nassir Navab,Zhongliang Jiang
Main category: cs.CV
TL;DR: 本文提出ConVibNet,一种基于时间依赖性的实时超声引导下针体检测框架,通过引入交集-差分损失函数提升针尖定位与针轴角度估计精度,在自建数据集上实现2.80±2.42 mm针尖误差和1.69±2.00°角度误差,优于基线模型且保持实时性。
Details
Motivation: 超声图像中针体可见性差、易受伪影/遮挡/低对比度影响,现有方法难以支持实时连续穿刺跟踪。 Method: 提出ConVibNet——VibNet的扩展模型,利用连续超声帧间时间依赖性实现针尖位置与针体角度的连续估计;引入交集-差分损失函数增强针尖运动相关性建模;构建专用训练与评估数据集。 Result: 在自建数据集上,ConVibNet针尖误差为2.80±2.42 mm,角度误差为1.69±2.00°,较最优基线提升0.75 mm针尖定位精度,同时保持实时推理能力。 Conclusion: ConVibNet通过融合时序建模与新型损失函数,显著提升超声引导下实时针体检测的准确性与鲁棒性,具备向自主穿刺系统集成的潜力。 Abstract: Purpose: Ultrasound-guided needle interventions are widely used in clinical practice, but their success critically depends on accurate needle placement, which is frequently hindered by the poor and intermittent visibility of needles in ultrasound images. Existing approaches remain limited by artifacts, occlusions, and low contrast, and often fail to support real-time continuous insertion. To overcome these challenges, this study introduces a robust real-time framework for continuous needle detection. Methods: We present ConVibNet, an extension of VibNet for detecting needles with significantly reduced visibility, addressing real-time, continuous needle tracking during insertion. ConVibNet leverages temporal dependencies across successive ultrasound frames to enable continuous estimation of both needle tip position and shaft angle in dynamic scenarios. To strengthen temporal awareness of needle-tip motion, we introduce a novel intersection-and-difference loss that explicitly leverages motion correlations across consecutive frames. In addition, we curated a dedicated dataset for model development and evaluation. Results: The performance of the proposed ConVibNet model was evaluated on our dataset, demonstrating superior accuracy compared to the baseline VibNet and UNet-LSTM models. Specifically, ConVibNet achieved a tip error of 2.80+-2.42 mm and an angle error of 1.69+-2.00 deg. These results represent a 0.75 mm improvement in tip localization accuracy over the best-performing baseline, while preserving real-time inference capability. Conclusion: ConVibNet advances real-time needle detection in ultrasound-guided interventions by integrating temporal correlation modeling with a novel intersection-and-difference loss, thereby improving accuracy and robustness and demonstrating high potential for integration into autonomous insertion systems.[306] GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection
Durgesh Ameta,Ujjwal Mishra,Praful Hambarde,Amit Shukla
Main category: cs.CV
TL;DR: 本文提出GRAD-Former,一种结合全局-局部上下文建模与高效计算的新型遥感变化检测框架,通过AFRAR模块(含SEA和GLFR)提升特征选择性与表征能力,在多个数据集上以更少参数超越现有SOTA方法。
Details
Motivation: 现有基于CNN、Transformer和SSM的方法在高分辨率遥感影像变化检测中难以精准分割变化区域;尤其Transformer存在计算复杂度高、小样本下性能差、空间信息利用不足等问题。 Method: 提出GRAD-Former框架,核心为带自适应特征相关性与精炼(AFRAR)模块的编码器,包含选择性嵌入放大(SEA)和全局-局部特征精炼(GLFR)两个子模块,分别采用门控机制和差分注意力生成多softmax堆以增强关键特征、抑制无关特征。 Result: 在LEVIR-CD、CDD、DSIFN-CD三个主流变化检测数据集上全面超越当前SOTA方法,所有指标均最优,且模型参数更少。 Conclusion: GRAD-Former在保持高效性的同时显著提升了VHR遥感影像的变化检测精度,为该领域设立了新基准。 Abstract: Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former's superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD-Former[307] BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
Jiachen Yang,Xianhui Lin,Yi Dong,Zebiao Zheng,Xing Liu,Hong Gu,Yanmei Fang
Main category: cs.CV
TL;DR: 本文提出BeautyGRPO框架,通过构建细粒度偏好数据集FRPref-10K和专用奖励模型,并引入动态路径引导(DPG)机制,在强化学习中兼顾探索性与高保真度,显著提升人脸修图在纹理质量、瑕疵去除和审美对齐方面的效果。
Details
Motivation: 现有方法存在根本权衡:监督学习难以建模主观审美偏好;在线强化学习虽能对齐偏好,但其随机探索易导致高保真修图任务中出现噪声和漂移。 Method: 提出BeautyGRPO强化学习框架,构建五维细粒度人脸修图偏好数据集FRPref-10K,训练专用奖励模型,并设计动态路径引导(DPG)机制——通过锚点驱动的ODE路径动态重规划采样轨迹,抑制随机漂移并保持可控探索。 Result: BeautyGRPO在纹理质量、瑕疵去除准确性和人类审美对齐度上均优于现有人脸修图及通用图像编辑方法。 Conclusion: BeautyGRPO有效解决了人脸修图中偏好对齐与高保真生成之间的矛盾,DPG机制为生成式强化学习提供了兼顾稳定性与探索性的新范式。 Abstract: Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.[308] FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
Maomao Li,Yunfei Liu,Yu Li
Main category: cs.CV
TL;DR: 本文提出了一种编辑感知的注意力注入方法(REE)及零样本图像驱动视频编辑框架FREE-Edit,通过编辑掩码与光流跟踪动态调节各token的特征注入强度,避免编辑区域干扰,在无需微调下显著提升编辑质量。
Details
Motivation: 现有图像驱动视频编辑方法在注意力注入强度上难以平衡:过强导致源视频语义冲突,过弱则保留不足;需一种自适应调节注入强度的机制。 Method: 提出Editing-awaRE(REE)注入法:基于首帧源图与编辑图的像素差生成编辑掩码,用光流传播至全帧,据此动态生成各token的注入强度(编辑区不注入);在此基础上构建基于rectified-Flow模型的零样本框架FREE-Edit。 Result: FREE-Edit在多种图像驱动视频编辑任务中实现高质量输出,无需微调或训练,效果优于现有方法。 Conclusion: 编辑感知的动态注入策略能更精准保留源视频运动与布局,结合rectified-Flow模型可实现高效、零样本的图像驱动视频编辑。 Abstract: Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.[309] TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
Sumin Kim,Hyemin Jeong,Mingu Kang,Yejin Kim,Yoori Oh,Joonseok Lee
Main category: cs.CV
TL;DR: 本文提出TripleSumm模型,通过在帧级别自适应加权融合视觉、文本和音频模态,解决现有视频摘要方法因静态或模态无关融合策略而难以理解复杂视频的问题;同时构建首个大规模三模态视频摘要基准MoSu。
Details
Motivation: 现有视频摘要方法采用静态或模态无关的融合策略,无法建模视频中帧依赖的模态显著性动态变化,且缺乏全面的多模态视频摘要基准。 Method: 提出TripleSumm架构,在帧级别对视觉、文本、音频模态进行自适应加权融合;并构建首个大规模三模态视频摘要基准MoSu。 Result: TripleSumm在四个基准(含MoSu)上显著超越现有方法,达到SOTA性能。 Conclusion: 帧级自适应多模态融合能有效提升视频摘要性能,MoSu基准为后续研究提供了重要支撑。 Abstract: The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.[310] VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification
Abdellah Zakaria Sellam,Fadi Abdeladhim Zidi,Salah Eddine Bekhouche,Ihssen Houhou,Marouane Tliba,Cosimo Distante,Abdenour Hadid
Main category: cs.CV
TL;DR: VP-Hype是一种结合状态空间模型(SSM)线性效率与Transformer关系建模能力的新型混合架构,用于高效、低样本依赖的高光谱图像分类。
Details
Motivation: 高光谱图像分类面临高维光谱数据与标注样本极度稀缺之间的矛盾,且标准Transformer的二次计算复杂度阻碍其扩展。 Method: 提出VP-Hype框架:1)采用鲁棒的3D-CNN光谱前端;2)用Hybrid Mamba-Transformer主干替代传统注意力块,兼顾长程依赖建模与计算效率;3)引入视觉与文本双模态提示以缓解标签稀缺问题。 Result: 在仅2%训练样本下,Salinas和Longkou数据集的总体精度(OA)分别达99.69%和99.45%,刷新低数据场景下的SOTA。 Conclusion: 混合序列建模与多模态提示的融合,为高性能、样本高效的遥感图像分析提供了可靠路径。 Abstract: Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.[311] RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
Mochu Xiang,Zhelun Shen,Xuesong Li,Jiahui Ren,Jing Zhang,Chen Zhao,Shanshan Liu,Haocheng Feng,Jingdong Wang,Yuchao Dai
Main category: cs.CV
TL;DR: RnG是一种新型前馈Transformer模型,通过重建引导的因果注意力机制,统一3D重建与生成任务,能从稀疏2D图像推断完整3D结构并实时渲染高质量新视角RGBD图像。
Details
Motivation: 现有通用3D重建模型只能建模观测区域,无法推断未见几何结构,因此亟需一种能从部分2D观测中恢复完整3D结构的方法。 Method: 提出RnG模型,核心是重建引导的因果注意力机制,将KV缓存视为隐式3D表示,并支持任意姿态查询以渲染新视角RGBD输出。 Result: 在通用3D重建和新视角生成任务上达到SOTA性能,且具备实时交互能力。 Conclusion: RnG成功统一重建与生成,不仅能精确重建可见几何,还能生成合理连贯的未见几何与外观,推动了从稀疏视图到完整3D理解的发展。 Abstract: Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG[312] VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
Mingkang Dong,Hongyi Cai,Jie Li,Sifan Zhou,Bin Ren,Kunyu Peng,Yuqian Fu
Main category: cs.CV
TL;DR: 本文提出VisNec框架,通过量化视觉输入的边际贡献来筛选真正需要视觉推理的指令样本,显著提升多模态指令微调的效率与效果。
Details
Motivation: 现有多模态指令数据集中存在大量视觉冗余(仅靠文本即可求解)和多模态错位监督样本,影响模型训练效果。 Method: 提出VisNec(Visual Necessity Score)框架,通过对比有无视觉输入时的预测损失,评估每个样本的视觉必要性;结合语义聚类,在各簇内选择高必要性样本以保持任务多样性。 Result: 在LLaVA-665K上仅用15%精选数据即达全量性能(100.2%);在Vision-Flan-186K上不仅数据更少,性能反超全量训练15.8%。 Conclusion: 量化并利用视觉必要性是实现高效、鲁棒多模态指令微调的有效途径。 Abstract: The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.[313] CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling
Li Jin,Weikai Chen,Yujie Wang,Yingda Yin,Zeyu Hu,Runze Zhang,Keyang Luo,Shengju Qian,Xin Wang,Xueying Qin
Main category: cs.CV
TL;DR: 本文提出了一种名为\methodName{}的新方法,通过从数据中学习潜在的规范参考系,实现开放世界可提示的3D语义分割,显著提升了部分语义的稳定性和可迁移性。
Details
Motivation: 现有开放世界可提示3D语义分割方法在输入传感器坐标系中推断语义,鲁棒性差;而人类通过将物体 mentally 旋转到规范空间来理解其功能部件,本文旨在填补这一差距。 Method: 提出\methodName{},构建统一的规范数据集(通过LLM引导的类内和跨类对齐),并设计双分支架构(含规范图锚定与规范框校准)以在模型内部实现规范性,将姿态变化和对称性坍缩为稳定的规范嵌入。 Result: 实验表明,\methodName{}在开放世界可提示3D分割任务上达到新的SOTA性能。 Conclusion: 将语义推理从输入姿态空间转向规范嵌入空间,能显著提升部件语义的稳定性与可迁移性,为3D理解提供了更符合人类认知的新范式。 Abstract: Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.[314] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
Ari Wahl,Dorian Gawlinski,David Przewozny,Paul Chojecki,Felix Bießmann,Sebastian Bosse
Main category: cs.CV
TL;DR: 本文提出了一种针对单目RGB图像、自然语言指令和机器人状态,实现3D物体位置估计的微调视觉-语言模型(VLM),在自建数据集上达到13mm中位MAE,较基线提升5倍。
Details
Motivation: 现有预训练视觉-语言模型(VLM)虽具备丰富世界知识和2D检测能力,但缺乏面向3D坐标检测任务的适配,难以支持具身智能中的直观人机交互。 Method: 构建包含10万+图像的异构数据集;采用QLoRA对通用VLM进行微调,并添加定制化回归头;引入条件路由机制,在保留通用视觉查询能力的同时,增强3D位置估计能力。 Result: 测试集上3D位置预测中位绝对误差(MAE)为13 mm,相较未微调基线提升5倍;约25%的预测结果精度满足机器人直接操作物体的要求。 Conclusion: 微调后的VLM可有效支持基于单目视觉与语言指令的3D定位任务,为具身智能提供了可行的端到端感知-理解-定位新范式。 Abstract: Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.[315] Towards Policy-Adaptive Image Guardrail: Benchmark and Method
Caiyong Piao,Zhiyuan Yan,Haoming Xu,Yunzhen Zhao,Kaiqing Lin,Feiyang Xu,Shuigeng Zhou
Main category: cs.CV
TL;DR: 本文提出SafeGuard-VL方法,通过强化学习与可验证奖励机制提升视觉语言模型在动态安全策略下的跨策略泛化能力,并构建SafeEditBench评测基准以评估模型对未见安全策略的适应性。
Details
Motivation: 现有基于视觉语言模型(VLM)的安全守卫方法在固定安全策略下过拟合,难以泛化至新策略,甚至丧失基本指令遵循与常识能力,亟需提升其政策适应性与鲁棒性。 Method: 1)构建SafeEditBench评测基准:利用图像编辑模型生成策略对齐的安全-不安全图像对,并由人工标注五种不同策略下的标签;2)提出SafeGuard-VL方法:采用基于可验证奖励的强化学习(RLVR),以策略接地奖励替代固定策略下的监督微调(SFT),增强模型对演化策略的适应能力。 Result: SafeGuard-VL在多个安全策略下显著优于基线方法,在跨策略泛化、指令遵循与常识保持方面表现更优;SafeEditBench验证了现有VLM在政策迁移上的严重局限性。 Conclusion: 强化学习结合策略接地奖励是构建动态、鲁棒视觉安全守卫的有效范式;SafeEditBench为评估VLM安全泛化能力提供了新标准。 Abstract: Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.[316] AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Changwoo Baek,Jouwon Song,Sohyeon Kim,Kyeongbo Kong
Main category: cs.CV
TL;DR: 本文通过有效秩(erank)和注意力熵分析视觉token剪枝方法,发现多样性剪枝易导致幻觉,而注意力剪枝在简单图像中更优、多样性剪枝在复杂图像中更优;据此提出自适应剪枝机制,提升性能并抑制幻觉。
Details
Motivation: 现有LVLM视觉token剪枝方法(注意力/多样性导向)缺乏深入特性与局限性分析,尤其在特征多样性保持与幻觉关联方面尚不清晰。 Method: 采用有效秩(erank)量化特征多样性,结合注意力分数熵分析处理机制;在CHAIR数据集上对比两类剪枝方法;基于发现设计图像感知的自适应混合剪枝机制。 Result: 揭示多样性剪枝实际保留多样性不足且与更高幻觉率相关;注意力剪枝适合简单图像,多样性剪枝适合复杂图像;所提自适应剪枝在标准基准及幻觉评测中均表现优异。 Conclusion: 视觉token剪枝策略需兼顾图像复杂度与多样性保真度;自适应、图像感知的剪枝机制是提升LVLM效率与可靠性的有效路径。 Abstract: Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.[317] The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
Lidia Garrucho,Smriti Joshi,Kaisar Kushibar,Richard Osuala,Maciej Bobowicz,Xavier Bargalló,Paulius Jaruševičius,Kai Geissler,Raphael Schäfer,Muhammad Alberb,Tony Xu,Anne Martel,Daniel Sleiman,Navchetan Awasthi,Hadeel Awwad,Joan C. Vilanova,Robert Martí,Daan Schouten,Jeong Hoon Lee,Mirabela Rusu,Eleonora Poeta,Luisa Vargas,Eliana Pastor,Maria A. Zuluaga,Jessica Kächele,Dimitrios Bounias,Alexandra Ertl,Katarzyna Gwoździewicz,Maria-Laura Cosaka,Pasant M. Abo-Elhoda,Sara W. Tantawy,Shorouq S. Sakrana,Norhan O. Shawky-Abdelfatah,Amr Muhammad Abdo-Salem,Androniki Kozana,Eugen Divjak,Gordana Ivanac,Katerina Nikiforaki,Michail E. Klontzas,Rosa García-Dosdá,Meltem Gulsun-Akpinar,Oğuz Lafcı,Carlos Martín-Isla,Oliver Díaz,Laura Igual,Karim Lekadir
Main category: cs.CV
TL;DR: MAMA-MIA Challenge构建了一个多中心、跨洲际的乳腺MRI基准,联合评估肿瘤分割与pCR预测,并强调模型在不同亚组中的公平性与泛化能力。
Details
Motivation: 现有乳腺MRI AI模型多基于单中心数据开发,泛化能力差,且忽视不同人口统计亚组(如年龄、绝经状态、乳腺密度)间的性能差异。 Method: 组织MAMA-MIA挑战赛:使用来自美国多个机构的1506例患者训练集,及欧洲三个独立中心的574例外部测试集;采用统一评分框架,兼顾主任务性能(肿瘤分割与pCR预测)和亚组一致性(年龄、绝经状态、乳腺密度)。 Result: 26支国际团队参与;结果显示外部测试下性能波动大,总体准确率与亚组公平性存在权衡;验证了跨中心/跨洲泛化难度。 Conclusion: 该挑战提供了标准化数据、评估协议与公共资源,推动鲁棒、公平的乳腺癌AI影像系统发展。 Abstract: Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.[318] Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography
Timofey Efimov,Singanallur Venkatakrishnan,Maliha Hossain,Haley Duba-Sullivan,Amirkoushyar Ziabari
Main category: cs.CV
TL;DR: 本文提出一种无需重新训练扩散模型即可融合辅助成像模态(如X射线CT)以提升稀疏视角中子CT重建质量的新方法,并分析了辅助模态不完美时的影响。
Details
Motivation: 中子CT数据采集成本高、样本稀疏,即使使用扩散模型也难以获得高质量重建;而引入易获取的互补模态(如X射线CT)通常需大规模重训练,不切实际。 Method: 在不修改或重训练原有扩散先验的前提下,设计跨模态引导机制,将辅助模态(X射线CT)信息融入中子CT的扩散重建过程。 Result: 在稀疏视角中子CT任务上,融合X射线CT作为侧信息显著提升了重建质量;同时验证了该方法对辅助模态噪声或失配具有一定的鲁棒性。 Conclusion: 无需重训练扩散模型即可实现跨模态引导,为高成本成像模态的加速重建提供了实用、灵活且鲁棒的新范式。 Abstract: Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.[319] FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration
Yizhou Huang,Gengze Jiang,Yihua Cheng,Kezhi Wang
Main category: cs.CV
TL;DR: FoSS是一种双分支轨迹预测框架,结合频域分析(通过傅里叶变换与螺旋重排序)和线性时间序列建模(选择性状态空间模型),在保持高精度的同时显著降低计算开销和参数量。
Details
Motivation: 现有轨迹预测方法难以兼顾建模能力与计算效率:注意力机制复杂度为O(N²),RNN难以建模长程依赖与局部动态。 Method: 提出FoSS双分支框架:频域分支对轨迹做DFT分解,用Helix重排序和两个选择性SSM(Coarse2Fine-SSM、SpecEvolve-SSM)处理振幅与相位;时域分支采用动态选择性SSM模拟自注意力;再通过跨注意力融合双域表征,并用可学习查询与加权融合头生成多候选轨迹及不确定性表达。 Result: 在Argoverse 1/2上达到SOTA精度,计算量减少22.5%,参数量减少超40%;消融实验证明各模块必要。 Conclusion: 频域建模与线性时序建模的协同设计可有效提升轨迹预测的精度-效率权衡,FoSS为高效安全的自动驾驶提供了新范式。 Abstract: Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.[320] Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis
Abdullah Al Shafi,Md Kawsar Mahmud Khan Zunayed,Safin Ahmmed,Sk Imran Hossain,Engelbert Mephu Nguifo
Main category: cs.CV
TL;DR: 本文提出了一种面向乳腺超声图像的多任务学习框架,通过多级解码器交互与不确定性感知的自适应协调机制,同时提升病灶分割与组织分类性能。
Details
Motivation: 传统多任务学习方法存在任务干扰和固定协调策略问题,难以适应不同样本的预测难度差异。 Method: 设计了多级任务交互模块(在解码器各层实现双向分割-分类通信)和不确定性代理注意力机制(基于特征激活方差自适应加权基础/增强特征),并引入多尺度上下文融合以捕获不同大小病灶的形态学线索。 Result: 在多个公开乳腺超声数据集上达到领先性能,如BUSI数据集上病灶IoU为74.5%,分类准确率为90.6%;消融实验证明多级解码器交互显著优于编码器级共享等传统方式。 Conclusion: 多级解码器内的双向任务交互与不确定性驱动的动态协调是提升乳腺超声多任务学习性能的有效范式。 Abstract: Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi-task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance-specific prediction difficulty. We propose a multi-task framework addressing these limitations through multi-level decoder interaction and uncertainty-aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation-classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single-level or encoder-only approaches, this multi-level design captures scale specific task synergies across semantic-to-spatial scales, producing complementary task interaction streams. Uncertainty-Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per-level and per-sample task balancing without heuristic tuning. To support instance-adaptive prediction, multi-scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi-level task interaction provides significant performance gains, validating that decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing. The code is available at: https://github.com/C-loud-Nine/Uncertainty-Aware-Multi-Level-Decoder-Interaction.[321] When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
Ahmadreza Jeddi,Kimia Shaban,Negin Baghbanzadeh,Natasha Sharan,Abhishek Moturu,Elham Dolatabadi,Babak Taati
Main category: cs.CV
TL;DR: 本文通过控制实验研究了强化学习(RL)在医学视觉语言模型(VLMs)后训练中的作用,发现RL主要优化已有能力的输出分布(提升Acc@1和采样效率),而非从零构建推理能力;其效果依赖于监督微调(SFT)预先提供的基础支持;据此提出边界感知的RL训练策略,并在多个医学VQA基准上验证了有效性。
Details
Motivation: 当前RL被广泛用于医学VLM的后训练,但尚不清楚RL是真正提升了医学视觉推理能力,还是仅强化了监督微调(SFT)已诱导的行为,亟需解耦分析RL与SFT、视觉模态各自的作用。 Method: 在MedMNIST多模态测试平台上开展控制实验,从视觉能力(对比VLM视觉塔与纯视觉基线)、推理支持与采样效率(Accuracy@1 vs. Pass@K)、RL增益的模态迁移性三个维度进行解耦分析;进一步提出边界感知的RL训练策略,并以OctoMed初始化模型在PMC小规模平衡多选VQA数据上进行RL后训练。 Result: RL最有效时需模型已具备非平凡的支持能力(高Pass@K);RL主要优化输出分布,提升Acc@1和采样效率,而SFT负责扩展支持范围并为RL奠定基础;所提策略在六个医学VQA基准上取得强平均性能。 Conclusion: RL并非独立构建医学视觉推理能力,而是对SFT已建立的能力进行精细化校准;SFT与RL具有功能互补性:SFT拓展能力边界,RL提升边界内性能;因此应采用边界感知的协同训练范式。 Abstract: Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.[322] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Zhen Qu,Xian Tao,Xiaoyi Bao,Dingrong Wang,ShiChen Qu,Zhengtao Zhang,Xingang Wang
Main category: cs.CV
TL;DR: 本文提出AG-VAS框架,通过引入可学习的语义锚点标记([SEG]、[NOR]、[ANO])和语义-像素对齐模块(SPAM),提升大语言多模态模型在零样本视觉异常分割任务中的性能。
Details
Motivation: 现有基于大语言多模态模型(LMM)的异常分割方法受限于异常概念抽象、缺乏稳定视觉原型,以及高层语义与像素级特征对齐弱的问题。 Method: 提出AG-VAS框架:1)引入三个可学习语义锚点标记;2)设计语义-像素对齐模块(SPAM)增强跨模态对齐;3)构建锚点引导掩码解码器(AGMD)实现精准定位;4)构建Anomaly-Instruct20K指令数据集。 Result: 在六个工业与医学基准数据集上,AG-VAS在零样本设置下达到一致的SOTA性能。 Conclusion: AG-VAS通过锚点引导的统一范式和跨模态对齐机制,有效缓解了LMM在零样本异常分割中语义抽象性与空间定位精度之间的矛盾,为通用视觉异常理解提供了新思路。 Abstract: Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.[323] Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
Anna Michailidou,Georgios Angelidis,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
Main category: cs.CV
TL;DR: 本文对比评估了监督学习与开放词汇视觉模型在灾后场景理解中的性能,发现监督学习在标签空间固定且标注可用时仍是最可靠的方法,尤其适用于小物体检测和杂乱场景中的精细边界划分。
Details
Motivation: 灾后航拍图像自动解析面临杂乱、视觉变化大及跨事件域偏移等挑战,而监督方法依赖昂贵且覆盖有限的任务特定标注;开放词汇和基础视觉模型通过大规模预训练和视觉-语言表征减少对固定标签集和大量标注的依赖,更适合灾后视觉概念模糊、数据稀缺的场景。 Method: 对监督学习与开放词汇视觉模型在灾后语义分割与目标检测任务上进行对比评估,涵盖FloodNet+、RescueNet、DFire和LADD等多个数据集,并分析性能趋势、失败模式及实际权衡。 Result: 在所有基准测试中,监督学习在标签空间固定且标注可用时表现最稳定可靠,尤其在小物体识别和杂乱场景下的精细边界划分方面显著优于开放词汇模型。 Conclusion: 尽管开放词汇模型具有泛化潜力,但在当前灾后响应的实际应用中,监督学习仍是更可靠的选择;未来工作需结合两者优势以提升鲁棒性与适应性。 Abstract: Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.[324] You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
Taoyue Wang,Xiang Zhang,Xiaotian Li,Huiyuan Yang,Lijun Yin
Main category: cs.CV
TL;DR: 本文提出了一种名为NVB-Face的单阶段方法,直接从单张低质人脸图像生成一致的新视角人脸图像,跳过传统两阶段(先超分再重建)流程,通过提取单视图特征并转换为3D感知多视角隐表示,结合扩散模型实现高质量、高一致性合成。
Details
Motivation: 现有新视角合成方法依赖高分辨率输入,对退化图像需先超分再合成,易因恢复质量差导致结果不一致和失真。 Method: 提出NVB-Face:直接从盲人脸图像提取单视图特征,设计特征操纵器将其映射为3D-aware多视角隐表示,并利用扩散模型进行端到端新视角图像合成。 Result: 在一致性和保真度上显著优于传统两阶段方法。 Conclusion: 单阶段端到端框架可有效绕过图像恢复误差传播,提升新视角人脸合成的质量与一致性。 Abstract: We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.[325] Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
Andrew Wang,Mike Davies
Main category: cs.CV
TL;DR: 本文提出了一种无需真实标签(GT)的多光谱去马赛克方法PEFD,利用相机成像的射影几何结构和预训练基础模型,在内窥镜与自动驾驶数据集上实现了接近监督学习的性能。
Details
Motivation: 传统方法模糊,监督学习依赖昂贵且缓慢的线扫描获取的真实标签,亟需一种无需真实标签的高效多光谱去马赛克方法。 Method: 提出Perspective-Equivariant Fine-tuning for Demosaicing(PEFD)框架:a) 利用相机成像的射影几何,引入更丰富的群结构以恢复更多零空间信息;b) 基于1–3通道图像预训练的基础模型进行无监督微调。 Result: 在术中及车载多光谱数据集上,PEFD能恢复血管等精细结构并保持光谱保真度,性能显著优于近期方法,接近监督学习水平。 Conclusion: PEFD通过几何先验与迁移学习,实现了高质量、无监督的多光谱去马赛克,拓展了其在实时应用中的可行性。 Abstract: Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.[326] MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Zilong Zhao,Zhengming Ding,Pei Niu,Wenhao Sun,Feng Guo
Main category: cs.CV
TL;DR: 本文提出MixerCSeg混合架构,融合CNN、Transformer和Mamba思想,在单个编码器中协同建模局部纹理、全局依赖与序列上下文,结合DEGConv和SRF模块提升裂缝边缘敏感性与多尺度细节,以低计算开销实现SOTA性能。
Details
Motivation: 现有基于CNN、Transformer或Mamba的裂缝分割特征编码器各自仅捕获部分空间或结构信息,难以充分建模复杂裂缝模式。 Method: 提出MixerCSeg混合编码器架构,包含CNN-like局部路径、Transformer-style全局路径和Mamba-inspired序列路径;核心为TransMixer模块,挖掘Mamba隐式注意力行为并统一建模局部性与全局性;引入方向引导边缘门控卷积(DEGConv)和空间细化多级融合(SRF)模块;采用空间块处理策略增强结构保真度。 Result: 在多个裂缝分割基准上达到SOTA性能,仅需2.05 GFLOPs计算量和2.54 M参数。 Conclusion: MixerCSeg通过多范式协同建模与轻量高效模块设计,在保持低复杂度的同时显著提升了像素级裂缝分割的精度与结构表达能力。 Abstract: Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.[327] TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity
Xiao Cai,Lianli Gao,Pengpeng Zeng,Ji Zhang,Heng Tao Shen,Jingkuan Song
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的Image-to-3D多实例生成框架TIMI,通过实例感知分离引导(ISG)和空间稳定几何自适应更新(SGU)模块,在不增加训练开销的前提下显著提升空间保真度。
Details
Motivation: 预训练的Image-to-3D模型已具备有意义的空间先验,但未被充分利用,导致实例纠缠;现有微调方法训练开销大且难以保证空间保真度。 Method: 提出无训练框架TIMI,包含两个核心模块:1)实例感知分离引导(ISG),在去噪早期阶段促进实例解耦;2)空间稳定的几何自适应更新(SGU),保持实例几何特性及相对关系。 Result: 在全局布局与局部实例细节上均优于现有方法,无需额外训练,推理速度更快。 Conclusion: TIMI有效挖掘并利用预训练I23D模型中的空间先验,实现了高质量、高效率、免训练的多实例3D生成。 Abstract: Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.[328] Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Junwei Zeng,Dong Liang,Sheng-Jun Huang,Kun Zhan,Songcan Chen
Main category: cs.CV
TL;DR: 本文提出了一种曝光时间依赖的调制传递函数(ET-MTF)来更真实地建模大气湍流引起的模糊,并基于此构建了大规模合成数据集ET-Turb,显著提升了湍流图像复原模型在真实场景中的泛化能力。
Details
Motivation: 现有湍流效应合成方法对曝光时间与模糊关系的建模过于简化(如固定或二值化假设),导致合成数据不真实、模型泛化能力差。 Method: 重新审视MTF理论,提出曝光时间依赖的ET-MTF;从中导出倾斜不变点扩散函数(PSF),结合空间变化的模糊宽度场实现物理准确的模糊合成;据此构建包含5083个视频的大规模ET-Turb数据集。 Result: 在ET-Turb上训练的模型在真实湍流数据上恢复效果更逼真、泛化性能更优,显著超越其他合成数据集训练的模型。 Conclusion: 连续曝光时间建模是提升湍流图像合成真实性与下游任务泛化能力的关键,ET-Turb为该领域提供了高质量基准数据集。 Abstract: Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: github.com/Jun-Wei-Zeng/ET-Turb.[329] Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Jinlong Li,Liyuan Jiang,Haonan Zhang,Nicu Sebe
Main category: cs.CV
TL;DR: 本文提出AOT方法,通过局部-全局最优传输机制,在不训练的情况下高效压缩视频大语言模型中的冗余视觉token,兼顾时空信息保留与计算效率提升。
Details
Motivation: 现有视频大语言模型因冗余视觉token导致效率低下;已有剪枝方法仅关注帧内空间冗余或浅层LLM剪枝,未能充分挖掘时空联合压缩潜力,且易丢失细微但关键的上下文信息。 Method: 提出基于帧内与帧间token锚点(Anchors)的局部-全局最优传输(AOT)框架:首先在单帧内基于注意力引导构建局部与全局感知的token锚点,并用最优传输聚合被剪枝token的信息;再以每个时间片段首帧为关键帧锚点,通过最优传输融合相邻帧相似信息,同时保留差异token以建模时序动态。整个过程无需训练。 Result: 在多个短/长视频基准上,AOT在主流视频大语言模型上实现了具有竞争力的性能,显著提升计算效率,同时保持良好的时间与视觉保真度。 Conclusion: AOT提供了一种新颖、高效、无需训练的视频token压缩范式,有效平衡了压缩率、信息完整性与时序建模能力。 Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.[330] UETrack: A Unified and Efficient Framework for Single Object Tracking
Ben Kang,Jie Zhao,Xin Chen,Wanting Geng,Bin Zhang,Lu Zhang,Dong Wang,Huchuan Lu
Main category: cs.CV
TL;DR: UETrack是一种高效单目标跟踪框架,支持RGB、深度、热成像、事件相机和语言等多种模态输入,通过Token-Pooling混合专家机制与目标感知自适应蒸馏策略,在保持高速运行的同时提升多模态跟踪精度。
Details
Motivation: 现有跟踪方法多限于RGB模态,而多模态方法往往结构复杂、计算开销大,难以在资源受限设备上部署。 Method: 提出UETrack框架,包含两个核心组件:1)基于Token-Pooling的Mixture-of-Experts机制,实现特征聚合与专家专业化;2)目标感知自适应蒸馏策略,根据样本特性选择性蒸馏以减少冗余监督。 Result: 在12个基准数据集和3种硬件平台(GPU/CPU/AGX)上验证,UETrack-B在LaSOT上达69.2% AUC,运行速度分别达163/56/60 FPS,显著优于先前方法。 Conclusion: UETrack在多模态跟踪任务中实现了优异的速度-精度平衡,具备强实用性与跨平台部署能力。 Abstract: With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.[331] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
Hebeizi Li,Zihao Liang,Benyuan Sun,Zihao Yin,Xiao Sha,Chenliang Wang,Yi Yang
Main category: cs.CV
TL;DR: 本文提出了UniTalking,一个统一的端到端扩散框架,用于生成高保真语音和唇形同步视频,通过多模态Transformer块建模音视频时序对应关系,并支持个性化语音克隆。
Details
Motivation: 现有最先进的音视频生成模型(如Veo3、Sora2)为闭源,其架构与训练范式不可访问,亟需开放、高性能的替代方案。 Method: 提出基于多模态Transformer块的统一扩散框架,利用共享自注意力机制建模音视频潜变量间的细粒度时序对应,并复用预训练视频生成模型的强先验以提升视觉保真度和训练效率;同时集成个性化语音克隆模块。 Result: 在唇动同步精度、音频自然度和整体感知质量上均优于现有开源方法,生成高度逼真的说话人肖像。 Conclusion: UniTalking在保持开源可访问性的同时,实现了接近SOTA闭源模型的音视频生成性能,并拓展了个性化语音克隆能力。 Abstract: While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.[332] SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
Yingjian Zhu,Ying Wang,Yuyang Hong,Ruohao Guo,Kun Ding,Xin Gu,Bin Fan,Shiming Xiang
Main category: cs.CV
TL;DR: 本文提出了首个面向音频-视觉实例分割(AVIS)的在线框架SeaVIS,通过因果交叉注意力融合(CCAF)和音频引导对比学习(AGCL)策略,实现对连续视频流中发声实例的实时识别、分割与跟踪。
Details
Motivation: 现有AVIS方法多为离线范式,无法跨帧关联实例,且难以区分对象的发声与静音状态,导致静音对象被错误分割,不适用于真实连续视频流场景。 Method: 提出SeaVIS在线框架,包含因果交叉注意力融合(CCAF)模块以融合当前帧视觉特征与历史音频特征(满足因果约束),并引入音频引导对比学习(AGCL)生成融合外观与发声活动的实例原型,抑制静音实例在关联过程中的干扰。 Result: 在AVISeg数据集上,SeaVIS在多项指标上超越现有SOTA方法,同时保持适合实时处理的推理速度。 Conclusion: SeaVIS成功解决了AVIS任务中在线处理与发声状态建模的关键挑战,显著提升了音频跟随能力与实时性能,为实际应用提供了可行方案。 Abstract: Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.[333] DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis
Zengqi Zhao,Weidi Xia,Peter Wei,Yan Zhang,Yiyi Zhang,Jane Mo,Tiannan Zhang,Yuanqin Dai,Zexi Chen,Simiao Ren
Main category: cs.CV
TL;DR: 本文提出了DOCFORGE-BENCH,首个面向文档伪造检测的零样本统一基准,评估14种方法在8个数据集上的表现;发现现有方法普遍存在校准失败问题(高AUC但极低F1),根源在于伪造区域像素占比极低导致固定阈值(如0.5)严重失准;实验证明仅需少量图像微调阈值即可显著提升性能,表明校准而非重训练才是关键瓶颈;指出当前基准尚未覆盖生成式AI引发的新伪造类型,存在重要研究空白。
Details
Motivation: 现有文档伪造检测评估多依赖微调,脱离实际无标注部署场景;且缺乏统一、零样本、覆盖多类伪造的基准;同时未揭示低像素占比伪造下的模型校准本质问题。 Method: 构建DOCFORGE-BENCH零样本基准,统一评估14种预训练方法在8个文档伪造数据集(涵盖文本篡改、收据伪造、证件篡改)上的表现;引入Pixel-AUC与Pixel-F1对比分析;设计Oracle-F1与可控阈值校准实验(仅用10张图调阈值)以定位瓶颈。 Result: 所有方法在零样本下均表现不佳:Pixel-AUC≥0.76但Pixel-F1近零;AUC-F1巨大差距源于伪造像素占比仅0.27–4.17%,远低于自然图像基准;Oracle-F1是固定阈值F1的2–10倍;仅用10张图校准阈值即可恢复39–55%的Oracle-F1差距。 Conclusion: 文档伪造检测尚未解决,核心瓶颈是输出分数的校准而非表征学习;零样本部署必须引入轻量级阈值自适应;现有基准未覆盖生成式AI伪造(扩散模型/LLM),亟需更新。 Abstract: We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.[334] Unifying Language-Action Understanding and Generation for Autonomous Driving
Xinyang Wang,Qian Liu,Wenjie Ding,Zhao Yang,Wei Li,Chang Liu,Bailin Li,Kun Zhan,Xianpeng Lang,Wei Chen
Main category: cs.CV
TL;DR: 本文提出LinkVLA模型,通过统一语言与动作的离散码本、引入动作理解辅助任务、以及采用两步粗到细动作解码方法(C2F),显著提升视觉-语言-动作模型在自动驾驶中指令跟随准确性、驾驶性能和推理效率。
Details
Motivation: 现有视觉-语言-动作(VLA)模型在自动驾驶中存在语言指令与动作输出间持续错位、自回归动作生成效率低两大关键问题。 Method: 1)构建结构化链接:将语言与动作token映射至共享离散码本,在单个多模态模型中联合处理;2)构建语义链接:引入动作理解辅助任务,训练模型从轨迹生成描述性文本,建立双向语言-动作映射;3)采用两步粗到细(C2F)动作生成方法替代逐帧自回归解码。 Result: 在闭环驾驶基准测试中,LinkVLA在指令跟随准确率和驾驶性能上均取得一致提升,同时推理延迟降低86%。 Conclusion: LinkVLA通过结构与语义双重对齐机制及高效解码策略,有效克服了当前VLA模型在自动驾驶应用中的核心瓶颈,为端到端自主驾驶提供了更鲁棒、更高效的范式。 Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.[335] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection
Jianfeng Liao,Yichen Wei,Raymond Chan Ching Bon,Shulan Wang,Kam-Pui Chow,Kwok-Yan Lam
Main category: cs.CV
TL;DR: 本文提出了一种名为Deepfake Forensics Adapter(DFA)的双流框架,结合CLIP等视觉-语言基础模型与针对性的取证分析,无需微调CLIP参数即可实现高泛化能力的深度伪造检测。
Details
Motivation: 现有深度伪造检测方法难以泛化到新兴伪造模式,而伪造技术快速发展对公共安全和社会造成严重威胁。 Method: 提出DFA双流框架:1)全局特征适配器识别图像全局不一致性;2)局部异常流利用人脸结构先验增强局部伪造线索感知;3)交互融合分类器通过Transformer编码器深度融合全局与局部特征。全程冻结CLIP参数。 Result: 在DFDC数据集上达到帧级AUC/EER为0.816/0.256、视频级AUC/EER为0.836/0.251,视频级AUC较先前方法提升4.8%,显著优于现有方法。 Conclusion: DFA框架不仅实现了当前最优性能,还为构建面向不断演化的深度伪造威胁、具备强泛化能力的鲁棒检测系统提供了可行且有效的新方向。 Abstract: The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at https://github.com/Liao330/DFA.git[336] VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
Duoxun Tang,Dasen Dai,Jiyao Wang,Xiao Yang,Jianyu Wang,Siqi Cai
Main category: cs.CV
TL;DR: 本文提出VidDoS,首个面向视频大语言模型(Video-LLMs)的通用能量-延迟攻击(ELA)框架,通过掩码教师强制、拒绝惩罚和早期终止抑制等机制,生成无需推理时梯度计算的通用触发器,导致token膨胀超205倍、延迟增加超15倍,并引发自动驾驶等场景中的严重安全问题。
Details
Motivation: Video-LLMs在安全关键应用中日益广泛,但易受能量-延迟攻击(ELAs)影响;现有图像中心方法因时间聚合机制稀释单帧扰动而失效,且实时性要求使逐样本优化不适用于连续视频流。 Method: 提出VidDoS框架,采用通用优化生成实例无关触发器,核心包括掩码教师强制(引导模型生成高成本目标序列)、拒绝惩罚与早期终止抑制(覆盖简洁性先验),全程无需推理时梯度计算。 Result: 在三个主流Video-LLMs和三个视频数据集(含视频问答与自动驾驶场景)上验证,VidDoS导致token扩展超205倍、推理延迟增加超15倍;实时自动驾驶流仿真显示其引发关键安全违规。 Conclusion: VidDoS揭示了Video-LLMs中高危ELA的现实威胁,呼吁社区重视并采取缓解措施。 Abstract: Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.[337] UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation
Teng Wang,Haojun Jiang,Chenxi Li,Diwen Wang,Yihang Tang,Zhenguo Sun,Yujiao Deng,Shiji Song,Gao Huang
Main category: cs.CV
TL;DR: 本文提出UltraStar方法,将超声探头导航从路径回归转变为基于锚点的全局定位,通过构建星型图(Star Graph)利用历史关键帧作为空间锚点,并结合语义感知采样策略提升定位精度与鲁棒性。
Details
Motivation: 超声心动图诊断依赖熟练操作者,但专业人员短缺;现有自动探头导航方法对含噪历史轨迹建模能力差,尤其在长序列下易过拟合。 Method: 提出UltraStar框架:1)构建星型图,将历史关键帧作为直接连接当前视图的空间锚点,显式建模几何约束;2)引入语义感知采样策略,从海量历史日志中主动选取代表性地标以减少冗余。 Result: 在超131万样本的数据集上实验表明,UltraStar显著优于基线方法,且随输入长度增加性能下降更小,验证了其在噪声探索场景下更优的历史建模拓扑结构。 Conclusion: 基于锚点的全局定位范式比传统序列化路径回归更适合建模含噪临床扫描历史,UltraStar为鲁棒、可扩展的自动探头导航提供了新思路。 Abstract: Echocardiography is critical for diagnosing cardiovascular diseases, yet the shortage of skilled sonographers hinders timely patient care, due to high operational difficulties. Consequently, research on automated probe navigation has significant clinical potential. To achieve robust navigation, it is essential to leverage historical scanning information, mimicking how experts rely on past feedback to adjust subsequent maneuvers. Practical scanning data collected from sonographers typically consists of noisy trajectories inherently generated through trial-and-error exploration. However, existing methods typically model this history as a sequential chain, forcing models to overfit these noisy paths, leading to performance degradation on long sequences. In this paper, we propose UltraStar, which reformulates probe navigation from path regression to anchor-based global localization. By establishing a Star Graph, UltraStar treats historical keyframes as spatial anchors connected directly to the current view, explicitly modeling geometric constraints for precise positioning. We further enhance the Star Graph with a semantic-aware sampling strategy that actively selects the representative landmarks from massive history logs, reducing redundancy for accurate anchoring. Extensive experiments on a dataset with over 1.31 million samples demonstrate that UltraStar outperforms baselines and scales better with longer input lengths, revealing a more effective topology for history modeling under noisy exploration.[338] WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
Joshua Knights,Joseph Reid,Kaushik Roy,David Hall,Mark Cox,Peyman Moghadam
Main category: cs.CV
TL;DR: 本文提出了WildCross,一个面向大规模自然环境的跨模态基准数据集,用于地点识别和度量深度估计,包含47.6万帧RGB图像及配套深度、法向量、6DoF位姿和激光雷达子图,并在多模态感知任务上进行了全面实验验证。
Details
Motivation: 现有机器人数据集多采集于结构化城市环境,难以满足非结构化自然环境中机器人感知的需求,且2D与3D场景理解亟待融合。 Method: 构建WildCross跨模态基准数据集,包含大量带半稠密深度、表面法向量、精确6DoF位姿及同步稠密激光雷达子图的RGB序列,并开展视觉、激光雷达及跨模态地点识别与度量深度估计实验。 Result: WildCross被验证为一个具有挑战性的多模态机器人感知基准,在地点识别与度量深度估计等任务中展现出高价值。 Conclusion: WildCross填补了自然环境下多模态机器人感知基准的空白,推动2D/3D融合与野外机器人技术发展。 Abstract: Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.[339] SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
Brian Cheong,Letian Wang,Sandro Papais,Steven L. Waslander
Main category: cs.CV
TL;DR: 本文提出SCATR,一种新型LiDAR-based跟踪-注意力(TBA)模型,通过Second Chance Assignment和Track Query Dropout两种训练策略,显著降低漏检率,弥合了TBA与传统检测后跟踪(TBD)方法之间的性能差距。
Details
Motivation: LiDAR-based跟踪-注意力(TBA)框架存在高漏检率问题,导致其性能远低于检测后跟踪(TBD)方法。 Method: 提出两种架构无关的训练策略:Second Chance Assignment(在二分匹配前将未分配轨迹查询拼接到候选查询中,缓解检测与跟踪任务冲突)和Track Query Dropout(随机丢弃部分轨迹查询以增强解码器对新生/丢失轨迹的鲁棒性)。 Result: 在nuScenes跟踪基准上,SCATR达到LiDAR-based TBA方法SOTA,AMOTA提升7.6%,成功缩小与TBD方法的长期性能差距;消融实验验证了两种策略的有效性和泛化性。 Conclusion: SCATR通过针对性设计的训练策略,系统性解决了LiDAR-TBA的高漏检问题,推动该范式向实用化迈进。 Abstract: LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work's core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6\% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \href{https://github.com/TRAILab/SCATR}{https://github.com/TRAILab/SCATR}[340] ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models
Cheng Yang,Jianhao Jiao,Lingyi Huang,Jinqi Xiao,Zhexiang Tang,Yu Gong,Yibiao Ying,Yang Sui,Jintian Lin,Wen Huang,Bo Yuan
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、即插即用的隐式推理框架ATA,通过注意力引导与动作引导策略,在Vision-Language-Action(VLA)模型中提升视觉输入自适应性,从而提高任务成功率和鲁棒性,同时保持甚至提升推理效率。
Details
Motivation: 现有VLA模型依赖显式推理(如链式思维CoT或视觉定位标注),但存在数据标注成本高、构建耗时、需重新训练、推理变慢等问题,亟需更高效轻量的推理增强方法。 Method: 提出ATA框架,不依赖额外训练或标注,通过融合注意力图与基于动作定义的兴趣区域(action-guided RoI),实现隐式推理;采用互补的注意力引导和动作引导策略,动态优化视觉输入。 Result: 在多个VLA基准上验证了ATA的有效性:显著提升任务成功率和鲁棒性,同时维持或加快推理速度;具备即插即用、轻量、泛化性强等特点。 Conclusion: ATA为VLA模型提供了一种高效、免训练、免标注的隐式推理新范式,突破了显式推理方法的资源与效率瓶颈,推动VLA向更实用化方向发展。 Abstract: Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.[341] Radiometrically Consistent Gaussian Surfels for Inverse Rendering
Kyu Beom Han,Jaeyoon Kim,Woo Jae Kim,Jinhwan Seo,Sung-eui Yoon
Main category: cs.CV
TL;DR: 本文提出Radiometrically Consistent Gaussian Surfels (RadioGS),通过引入辐射一致性约束,解决高斯点阵在逆向渲染中难以准确解耦材质与间接光照的问题,显著提升间接反射建模精度,并支持快速重光照。
Details
Motivation: 现有基于高斯点阵的逆向渲染方法在建模未观测视角下的间接辐射时缺乏监督,导致材质与全局光照(尤其是间接光照)难以准确解耦。 Method: 提出辐射一致性(radiometric consistency)物理约束,最小化高斯图元学习辐射与其物理渲染结果之间的残差;构建RadioGS框架,结合高斯面元(Gaussian surfels)与2D高斯光线追踪实现高效集成;设计基于微调的快速重光照策略。 Result: 在主流逆向渲染基准上超越现有高斯基方法,同时保持高计算效率(重光照耗时<10ms);实现了对间接反射更准确的建模。 Conclusion: 辐射一致性为未观测视角提供有效物理监督,RadioGS框架在精度与效率间取得更好平衡,推动高斯点阵在逆向渲染中的实用化。 Abstract: Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive's learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (<10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.[342] Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
Kai Zheng,Hang-Cheng Dong,Zhenkai Wu,Fupeng Wei,Wei Zhang
Main category: cs.CV
TL;DR: 本文提出Tripath DINO架构,通过三路径互补特征学习策略,结合DINOv3骨干、辅助Siamese路径和多尺度注意力解码器,在遥感多类变化检测任务中实现最优性能,并提升可解释性。
Details
Motivation: 解决遥感影像中多类变化检测受复杂场景变化和精细标注稀缺制约的问题。 Method: 提出Tripath DINO架构:1)以DINOv3为骨干提取粗粒度特征;2)辅助Siamese路径逐级聚合中间特征以增强细粒度特征学习;3)解码器引入多尺度注意力机制,通过并行卷积自适应捕获不同感受野的上下文信息。 Result: 在Gaza change和SECOND两个数据集上均取得MCD任务最优性能;GradCAM可视化证实主辅路径分别聚焦于语义级与结构级变化,具备良好可解释性。 Conclusion: 三路径协同互补策略为遥感变化检测提供了鲁棒、可解释的解决方案,支撑快速准确的损毁评估。 Abstract: In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.[343] OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Jianqiang Ren,Lin Liu,Steven Hoi
Main category: cs.CV
TL;DR: OMG-Avatar是一种单图生成可动画3D头像的新方法,采用多级细节高斯表示与Transformer架构,在0.2秒内实现高质量重建与重演。
Details
Motivation: 解决现有3DMM方法难以建模非头部区域(如肩膀)以及难以兼顾不同硬件性能与推理速度需求的问题。 Method: 提出基于多级细节(Multi-LOD)高斯表示的单次生成方法;结合Transformer进行全局特征提取、投影采样获取局部特征,并以深度缓冲指导特征融合;引入由粗到精的学习范式和多区域分解(头+肩分别预测后交叉融合)策略。 Result: 在重建质量、表情重演性能和计算效率三方面均优于当前最先进方法。 Conclusion: OMG-Avatar实现了高效、高质量、硬件自适应的单图像3D头像重建与动画生成,为实时虚拟人应用提供了新思路。 Abstract: We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.[344] Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling
Zillur Rahman,Alex Sheng,Cristian Meo
Main category: cs.CV
TL;DR: 本文提出3R框架,一种无需训练、基于RAG的提示优化方法,通过RAG修饰提取、扩散偏好优化和时序帧插值三策略,提升T2V生成视频的静态保真度与动态一致性。
Details
Motivation: 现有T2V模型对输入提示高度敏感,而当前改进方法依赖复杂后编辑或昂贵微调,限制了可扩展性与可及性。 Method: 提出3R框架:1)RAG-based modifiers extraction增强上下文 grounding;2)diffusion-based Preference Optimization对齐人类偏好;3)temporal frame interpolation保证时序一致性;全程无需修改或训练T2V主模型。 Result: 实验表明3R显著提升了生成视频的静态保真度和动态连贯性,验证了提示优化的有效性。 Conclusion: 提示优化是提升T2V生成质量的关键路径,3R提供了一种高效、通用、免训练的解决方案。 Abstract: While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.[345] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
Hanxiao Wang,Yuan-Chen Guo,Ying-Tian Liu,Zi-Xin Zou,Biao Zhang,Weize Quan,Ding Liang,Yan-Pei Cao,Dong-Ming Yan
Main category: cs.CV
TL;DR: FACE是一种新的自回归自编码器框架,通过在面级别生成3D网格(每个三角面作为一个token),显著降低序列长度和计算成本,同时保持甚至提升重建质量,并支持高质量单图像到网格生成。
Details
Motivation: 现有自回归3D网格生成模型将网格展平为长顶点坐标序列,导致计算成本过高,难以高效生成高保真几何结构;根本问题在于操作语义层级错误。 Method: 提出FACE框架,采用自回归自编码器(ARAE)结构,以‘一个面一个token’策略在面级别建模网格;使用VecSet编码器和面级解码器,并在此基础上训练潜在扩散模型实现单图像到网格生成。 Result: 序列长度减少9倍,压缩比达0.11(较此前SOTA减半);在标准基准上达到SOTA重建质量;成功实现高保真单图像到网格生成。 Conclusion: FACE提供了一种简单、可扩展且强大的新范式,显著降低了高质量结构化3D内容生成的门槛。 Abstract: Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.[346] Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection
Qirui Wu,Shizhou Zhang,De Cheng,Yinghui Xing,Lingyan Ran,Dahu Shi,Peng Wang
Main category: cs.CV
TL;DR: 本文提出了一种针对DETR类检测器在增量学习中因匈牙利匹配强制分配导致的‘背景前景化’问题的新方法Q-MCMF匹配器,有效缓解灾难性遗忘,显著提升增量目标检测性能。
Details
Motivation: 现有增量目标检测方法在DETR架构中面临新的遗忘源——‘背景前景化’,源于匈牙利匹配的穷尽性约束,导致低IoU背景预测被错误监督为前景类,破坏已有表征。 Method: 提出质量引导的最小代价最大流(Q-MCMF)匹配器:构建流图,依据几何质量剪枝不可靠匹配,再联合优化匹配代价与有效分配数,避免强制分配。 Result: 在COCO数据集多种增量设定下,该方法持续超越现有SOTA方法。 Conclusion: Q-MCMF通过消除背景前景化带来的有害监督、保留有效前景学习信号,显著缓解DETR类模型在增量检测中的灾难性遗忘。 Abstract: Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.[347] Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case
Yutian Zhang,Zhongyi Pei,Yi Mao,Chen Wang,Lin Liu,Jianmin Wang
Main category: cs.CV
TL;DR: 本文提出了一种结合先验知识的流式推理管道,通过将目标检测模型与编码操作场景知识的有限状态机(FSM)集成,提升AI模型在工业视频监控中挖掘设备工作量计数任务的鲁棒性与准确性。
Details
Motivation: 工业中AI应用受限于其在训练数据未覆盖场景下的鲁棒性差,易产生预测偏差和脆弱性。 Method: 提出一种流式推理管道,将目标检测模型与基于领域知识构建的有限状态机(FSM)耦合,利用FSM对操作流程建模以指导并修正模型在视频流上的实时预测。 Result: 在包含7000+图像、12个工地视频、300+次挖掘作业的真实数据集上,该方法性能与鲁棒性均优于原有基于人工启发式规则的方案。 Conclusion: 显式引入结构化先验知识(如FSM)可有效增强数据驱动AI模型在动态工业场景中的可靠性,为边缘/流式AI部署提供了可行范式。 Abstract: The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI's predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at https://github.com/thulab/video-streamling-inference-pipeline.[348] Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
Zijin Yin,Bing Li,Kongming Liang,Hao Sun,Zhongjiang He,Zhanyu Ma,Jun Guo
Main category: cs.CV
TL;DR: 本文提出Gen4Seg自动数据生成流水线,通过扩散模型编辑真实图像的视觉属性(如颜色、材质、大小、位置、天气、风格等),以压力测试语义分割模型,并构建了Pascal-EA和COCO-EA两个新基准,揭示了当前模型在几何与外观变化下的鲁棒性局限及数据增强的有效性边界。
Details
Motivation: 现有评估范式仅关注全局天气或风格迁移,缺乏对对象级与图像级外观和几何属性变化的系统性压力测试;同时标注成本高,亟需能复用原标签的可控图像编辑方法。 Method: 基于扩散模型构建Gen4Seg流水线,实现对真实图像中对象颜色、材质、尺寸、位置及图像级天气、风格等属性的精准编辑,保持结构信息并复用原始分割标签;据此构建Pascal-EA和COCO-EA两个新基准。 Result: 1)先进开词汇模型在几何变化下鲁棒性不优于闭集模型;2)CutOut/CutMix等增强方法难以提升外观变化鲁棒性;3)该流水线作为数据增强可同时提升分布内与分布外性能。 Conclusion: 生成模型可作为高效自动化分析工具用于语义分割模型评估与增强,揭示了鲁棒性瓶颈,为构建更可靠分割模型提供实践指导与新基准支持。 Abstract: Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.[349] RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
Xinchang Wang,Yunhao Chen,Yuechen Zhang,Congcong Bian,Zihao Guo,Xingjun Ma,Hui Li
Main category: cs.CV
TL;DR: 本文提出了一种基于图像对扰动鲁棒性差异(robustness asymmetry)的行为式检测方法RA-Det,用于识别生成图像;该方法不依赖外观特征或模型指纹,具有跨模型泛化能力,并在14种生成模型和10余种强基线方法上取得显著性能提升。
Details
Motivation: 现有基于外观的伪造图像检测器因生成图像日益逼真而稳定性下降,需转向利用图像行为(即对可控扰动的响应)进行检测。 Method: 发现自然图像在小结构扰动下语义表征更稳定,而生成图像表现出更大特征漂移,称其为鲁棒性不对称;据此提出RA-Det框架,将该行为差异转化为可判别信号,并进行理论分析建立其与生成模型记忆倾向的联系。 Result: RA-Det在14种生成模型和10+强检测器对比中平均性能提升7.81%,具备数据/模型无关性、无需生成器指纹、可迁移至未见生成器。 Conclusion: 鲁棒性不对称是一种稳定、通用的合成图像检测线索,经合理设计的探针可将其转化为实用且普适的检测器。 Abstract: Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.[350] Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
Zhengtong Zhu,Jiaqing Fan,Zhixuan Liu,Fanzhang Li
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的时空解耦推理视频对象分割框架SDAM,利用预训练模型、自适应对象记忆模块和时空解耦策略,显著提升了分割精度与时间稳定性。
Details
Motivation: 现有方法依赖微调多模态大语言模型,资源消耗大;且时空信息处理耦合,影响时间稳定性。 Method: 提出训练免费的SDAM框架,包含自适应对象记忆模块(基于运动线索选择和记忆关键对象)和时空解耦机制(空间域精确定位分割,时间域利用关键对象信息实现稳定跨帧传播)。 Result: 在Ref-YouTubeVOS、Ref-DAVIS17、MeViS、ReasonVOS和ReVOS五个基准数据集上取得优异结果。 Conclusion: SDAM无需微调即可超越需训练的方法,在保持高效的同时显著提升推理视频对象分割的精度与时间稳定性。 Abstract: Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.[351] PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification
Jian Yu,Joakim Nguyen,Jinrui Fang,Awais Naeem,Zeyuan Cao,Sanjay Krishnan,Nicholas Konz,Tianlong Chen,Chandra Krishnan,Hairong Wang,Edward Castillo,Ying Ding,Ankita Shukla
Main category: cs.CV
TL;DR: 本文提出PathMoE,一种可解释的多模态框架,融合H&E切片、病理报告和细胞图,利用交互感知的混合专家架构提升儿童中枢神经系统肿瘤分类性能,并增强模型可解释性。
Details
Motivation: 儿童中枢神经系统肿瘤因组织学复杂性和训练数据有限,准确分类困难;现有病理基础模型未能充分利用临床文本与组织微结构等互补信息。 Method: 提出PathMoE框架,基于各模态先进基础模型,采用交互感知的混合专家(Mixture-of-Experts)架构,通过输入依赖门控机制动态加权模态间独特性、冗余性与协同性,并整合H&E全切片图像、病理文本与核级细胞图。 Result: 在内部PBT数据集上,融合三模态使macro-F1从0.762提升至0.799(+0.037);在外部TCGA数据集上,WSI+图模态使macro-F1从0.668提升至0.709(+0.041),显著优于单图像基线,并提供样本级可解释性。 Conclusion: PathMoE有效提升儿童脑瘤分类性能,尤其对罕见亚型具有关键临床价值;其模态交互可解释性增强了医生信任与诊断验证能力。 Abstract: Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.[352] Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Jisoo Kim,Jungbin Cho,Sanghyeok Chu,Ananya Bal,Jinhyung Kim,Gunhee Lee,Sihaeng Lee,Seung Hwan Kim,Bohyung Han,Hyunmin Lee,Laszlo A. Jeni,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出Pri4R方法,通过在训练阶段引入特权4D信息(3D点轨迹预测),增强视觉-语言-动作(VLA)模型对世界动力学的隐式理解,提升其物理交互建模能力,且不增加推理开销。
Details
Motivation: 现有VLA模型语义理解能力强,但缺乏对物理交互中时空动力学的建模能力;人类能学习身体运动及环境响应,而模型尚未具备类似的世界动态感知能力。 Method: 提出Pri4R,在VLA模型上增加轻量级3D点跟踪头,将VLA特征注入该头以联合预测未来3D点轨迹;训练时利用4D(3D+时间)点轨迹作为监督信号,使共享表征空间内融合演化的场景几何;推理时完全复用原始VLA架构,无额外输入、输出或计算开销。 Result: 在LIBERO-Long和RoboCasa等任务上分别提升10%和40%;验证了3D点轨迹预测是学习动作-世界动力学的有效监督目标;消融实验证明设计选择合理。 Conclusion: Pri4R是一种简单高效、即插即用的方法,显著提升VLA模型的物理交互能力,且保持推理零开销,为构建更符合物理规律的具身智能系统提供了新思路。 Abstract: Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.[353] Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder
Ayantika Das,Keerthi Ram,Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散自编码器的多模态对齐框架,用于更精确地建模阿尔茨海默病的脑影像纵向进展,通过显式对齐非影像条件与图像特征、结构化潜空间(分离进展相关与个体身份子空间),提升生成图像在疾病相关脑区上的解剖准确性。
Details
Motivation: 现有扩散模型在利用非影像模态(如临床数据)进行脑影像疾病进展生成时,缺乏对多模态信息与图像特征的显式对齐机制,且潜空间未针对疾病进展结构化,导致生成控制不精准、解剖合理性不足。 Method: 提出一种扩散自编码器框架:1)设计显式对齐目标函数,使模型聚焦于疾病进展相关脑区;2)将潜空间划分为两个子空间——分别编码进展相关条件和受试者身份信息,以实现可控生成。 Result: 所提方法在阿尔茨海默病纵向脑影像生成任务中展现出更高的解剖精度,能更准确调制进展特异性区域,验证了模态对齐与潜空间结构化的有效性。 Conclusion: 显式多模态对齐与结构化的潜表示是提升扩散模型在神经退行性疾病进展建模中可控性与生物学合理性的关键路径。 Abstract: Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer's. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer's disease progression.[354] TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
Muhammet Esat Kalfaoglu,Halil Ibrahim Ozturk,Ozsel Kilinc,Alptekin Temizel
Main category: cs.CV
TL;DR: 本文提出了TopoMaskV3,一种基于掩码的3D道路拓扑理解方法,通过密集偏移场和高度图实现亚网格校正与直接3D估计,并首次提出地理隔离数据划分与长程评测基准,显著提升泛化能力与性能。
Details
Motivation: 现有基于掩码的道路拓扑方法(如TopoMaskV2)局限于2D且存在严重离散化伪影,需融合参数化头;同时道路评估中存在地理数据泄露问题,影响模型泛化性评估。 Method: 提出TopoMaskV3:引入两个新型密集预测头——用于亚网格校正的密集偏移场(在BEV分辨率内)和用于直接3D估计的密集高度图;并首创地理上互斥的数据划分方式与±100米长程评测基准。 Result: 在地理隔离基准上达到28.5 OLS,为当前最优;分析表明掩码表示比Bezier更抗地理过拟合,LiDAR融合在长距离下增益更大,且在原始重叠划分上提升更显著,暗示重叠导致的记忆效应。 Conclusion: TopoMaskV3验证了密集掩码范式向稳健、独立3D道路拓扑预测演进的可行性,并揭示了地理数据划分对公平评估的重要性。 Abstract: Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.[355] Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
Saurabh Kaushik,Lalit Maurya,Beth Tellman
Main category: cs.CV
TL;DR: 本文提出了首个面向冰冻圈应用的地理基础模型(GFM)基准测试平台Cryo-Bench,涵盖冰川、冰湖、海冰和冰裂前沿四类关键要素,并在14个GFMs及UNet/ViT基线模型上系统评估其性能,发现冻结编码器下UNet表现最优,而少样本下部分GFMs更优;全量微调需配合学习率调优方可显著提升性能。
Details
Motivation: 现有地理基础模型(GFMs)在冰冻圈(Cryosphere)任务上的基准评估严重受限于缺乏适配的公开数据集,亟需构建专用评测基准以推动模型在该关键地球系统领域的应用与改进。 Method: 构建了多传感器、多区域、覆盖四大冰冻圈要素(碎屑覆盖冰川、冰川湖、海冰、冰裂前沿)的Cryo-Bench基准;对14个GFMs及UNet/ViT基线开展三类设置评估:冻结编码器、少样本(10%数据)、全量微调(含学习率优化);采用mIoU为主要指标,在五个数据集上进行跨模型对比分析。 Result: 冻结编码器下UNet平均mIoU达66.38,领先TerraMind(64.02);少样本场景中DOFA(59.53)、TerraMind(56.62)优于UNet(56.60);全量微调性能不稳定,但结合学习率优化后在GLID和CaFFe上平均相对提升12.77%;即使预训练数据中冰冻圈样本极少,GFMs仍展现较强域适应能力。 Conclusion: GFMs在冰冻圈任务中具备实用潜力,推荐采用编码器微调+超参优化策略以获取最优性能,而在快速部署场景下可采用冻结编码器方案;Cryo-Bench为后续研究提供了标准化评测框架与开源资源。 Abstract: Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).[356] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis
Chuqiao Wu,Jin Song,Yiyun Fei
Main category: cs.CV
TL;DR: 本文提出SkeleGuide框架,通过显式骨骼推理解决生成人体图像时肢体扭曲、姿态不自然的问题,引入PoseInverter模块实现用户对姿态的精细控制,实验表明其在保真度和场景一致性上显著优于现有方法。
Details
Motivation: 当前生成模型难以生成结构合理、姿态自然的人体图像,主要原因是缺乏对人体骨骼结构的显式推理能力。 Method: 提出SkeleGuide框架,联合训练推理与渲染阶段,学习生成作为强结构先验的内部姿态;并设计PoseInverter模块将隐式姿态解码为显式、可编辑的姿态表示。 Result: SkeleGuide在生成高保真、上下文感知的人体图像方面显著优于专用及通用生成模型。 Conclusion: 显式建模人体骨骼结构是实现鲁棒、合理人体图像合成的关键一步。 Abstract: Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.[357] InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
Yecong Wan,Fan Li,Chunwei Wang,Hao Wu,Mingwen Shao,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出InterCoG框架,通过文本内空间关系推理与视觉定位结合,实现复杂多实体场景下的细粒度图像编辑,并构建了GroundEdit-45K数据集和GroundEdit-Bench评测基准。
Details
Motivation: 现有统一编辑模型难以在复杂多实体、目标不显著的场景中进行需空间推理的细粒度编辑。 Method: 提出文本-视觉交错链式定位(InterCoG)框架:先在文本中进行空间关系推理以确定目标位置与身份,再通过生成边界框和掩码实现视觉定位,最后重写编辑描述;引入多模态定位重建监督与定位推理对齐两个辅助训练模块;构建GroundEdit-45K数据集与GroundEdit-Bench评测基准。 Result: 在空间复杂、多实体场景下实现了更高精度的图像编辑,实验验证了方法的优越性。 Conclusion: InterCoG有效提升了复杂真实场景中细粒度图像编辑的准确性与可解释性,为基于空间推理的编辑任务提供了新范式。 Abstract: Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.[358] PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Bo Ma,Jinsong Wu,Weiqi Yan,Catherine Shi,Minh Nguyen
Main category: cs.CV
TL;DR: 本文提出PPEDCRF框架,通过仅对视频中位置敏感的背景区域注入校准扰动,保护行车记录仪视频的位置隐私,同时保持前景目标检测与分割任务的性能。
Details
Motivation: 行车记录仪视频即使去除GPS元数据,仍可能通过背景视觉线索(如建筑、道路布局)被攻击者匹配街景图像而推断出拍摄位置,存在位置隐私泄露风险。 Method: 提出PPEDCRF——一种隐私保护增强型动态条件随机场框架,包含:(i) 动态CRF用于跨帧发现并跟踪位置敏感背景区域;(ii) 归一化控制惩罚(NCP)依据层次化敏感度模型分配扰动强度;(iii) 实用性保持的噪声注入模块,最小化对检测与分割任务的干扰。 Result: 在公开驾驶数据集上的实验表明,PPEDCRF显著降低基于背景检索的位置推断攻击成功率(如Top-k准确率),同时在目标检测(mAP)和分割指标上优于全局加噪、白噪声遮蔽、特征匿名化等基线方法。 Conclusion: PPEDCRF在保障行车视频位置隐私的同时,有效维持了下游视觉任务(如检测与分割)的实用性,为车载视觉数据共享提供了可行的隐私增强方案。 Abstract: Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git[359] Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference
Jiaqi Leng,Shuyuan Tu,Haidong Cao,Sicheng Xie,Daoguo Dong,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了一种无需3D训练数据、基于预训练2D奖励模型的优化框架Preference Score Distillation(PSD),用于提升文本到3D生成中的人类偏好对齐,其核心是将偏好对齐建模为类分类器自由引导(CFG)机制,并协同优化偏好分数与负向文本嵌入。
Details
Motivation: 人类偏好对齐在文本到3D生成中至关重要但缺乏探索;现有方法依赖任务特定微调,在数据稀缺的3D领域难以实施。 Method: 提出Preference Score Distillation(PSD)框架:1)构建隐式奖励模型,将偏好对齐类比为CFG机制以规避像素级梯度不兼容问题;2)在优化中联合更新偏好分数和负向文本嵌入,并在线动态调整负向文本嵌入以增强对齐。 Result: PSD在美学指标上优于现有方法,可无缝集成于多种生成流程,并具备强扩展性。 Conclusion: 本文首次在Score Distillation框架下将人类偏好对齐与CFG理论统一,实现了无需3D标注数据的高效偏好对齐。 Abstract: Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.[360] Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement
Xiwen Wang,Shichao Zhang,Hailun Zhang,Ruowei Wang,Mao Li,Chenyu Zhou,Qijun Zhao,Ji-Zhe Zhou
Main category: cs.CV
TL;DR: 本文提出Dehallu3D方法,通过引入多视角连续性约束与自适应平滑机制,缓解大尺度3D重建模型中的幻觉问题(如异常孔洞或凸起),并设计ORM指标量化几何保真度,显著提升3D网格生成质量。
Details
Motivation: 大型3D重建模型存在严重幻觉问题(如结构异常),导致3D打印失败或虚拟场景沉浸感不足;现有方法基于稀疏多视角图像重建,存在大视角间隙和不连续性,加剧幻觉。 Method: 提出Dehallu3D:包含邻接一致性约束(保证跨视角几何连续)和自适应平滑约束(保留尖锐几何细节)的即插即用优化模块;并设计Outlier Risk Measure(ORM)指标评估几何保真度。 Result: 实验表明Dehallu3D能有效去除幻觉产生的结构异常,同时高保真保留细节,提升3D生成质量。 Conclusion: 通过建模密集中间视角下的连续性与可控平滑性,可系统性缓解大3D模型幻觉问题;ORM为评估3D生成可靠性提供了新维度。 Abstract: Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details.We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.[361] YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection
PeiHuang Zheng,Yunlong Zhao,Zheng Cui,Yang Li
Main category: cs.CV
TL;DR: 本文提出YCDa方法,通过解耦色度与亮度信息并动态分配通道注意力,提升实时伪装目标检测性能。
Details
Motivation: 受人类视觉系统在伪装环境下从依赖颜色转向依赖亮度和纹理的启发,解决现有检测器在颜色不可靠时性能下降的问题。 Method: YCDa是一种早期特征处理策略,在输入阶段分离色度与亮度信息,并通过动态通道注意力增强判别性线索、抑制误导性颜色噪声;可即插即用,仅需替换首个下采样层。 Result: YCDa-YOLO12s在COD10K-D上mAP提升112%,并在COD-D系列数据集上达到实时伪装目标检测新SOTA。 Conclusion: YCDa有效提升了模型在复杂伪装场景下的鲁棒性与检测精度,且计算开销极小,具有强实用性与泛化能力。 Abstract: Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this "chrominance-luminance decoupling and dynamic attention" principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.[362] Sparse View Distractor-Free Gaussian Splatting
Yi Gu,Zhaorui Wang,Jiahang Cao,Jiaxu Wang,Mingle Zhao,Dongjun Ye,Renjing Xu
Main category: cs.CV
TL;DR: 本文提出了一种在稀疏视角下增强无干扰物3D高斯泼溅(3DGS)的方法,通过引入几何基础模型VGGT和视觉语言模型(VLM)提供先验信息,提升静态场景重建鲁棒性。
Details
Motivation: 现有无干扰物3DGS方法在稀疏输入下性能显著下降,主因是依赖不可靠的颜色残差启发式训练策略。 Method: 采用几何基础模型VGGT估计相机参数并生成稠密初始3D点;利用其注意力图实现语义实体匹配;结合视觉语言模型识别并保留大范围静态区域;并将这些先验无缝集成到现有无干扰物3DGS框架中。 Result: 大量实验验证了该方法在稀疏视角下有效抑制瞬态干扰物、提升3DGS训练鲁棒性和重建质量。 Conclusion: 引入多源先验(几何、语义、语言)可显著缓解稀疏视角下无干扰物3DGS的性能退化,为动态场景中的高效静态建模提供了新思路。 Abstract: 3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.[363] What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers
Qin Su,Tie Luo
Main category: cs.CV
TL;DR: 本文提出BiCAM方法,通过双向类激活映射同时捕捉支持性和抑制性特征贡献,提升ViT模型的可解释性,并引入正负比(PNR)用于轻量级对抗样本检测。
Details
Motivation: Vision Transformers (ViTs)在视觉识别中表现优异,但其决策过程难以解释;现有CAM方法忽略负向信号,导致解释不完整。 Method: 提出BiCAM——一种双向类激活映射方法,保留带符号的归因(正/负),并引入正负比(PNR)量化归因平衡性,支持无需重训练的对抗样本检测。 Result: 在ImageNet、VOC和COCO上,BiCAM提升了定位精度与归因忠实性,计算高效,且兼容DeiT、Swin等多种ViT变体。 Conclusion: 建模支持性与抑制性证据对理解Transformer视觉模型至关重要,BiCAM为ViT可解释性提供了更全面、对比性强且实用的新范式。 Abstract: Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.[364] Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
Yuchen Zou,Xiao Hu,Dexing Zhong,Yuqing Tang
Main category: cs.CV
TL;DR: 本文提出了一种基于OpenStreetMap(OSM)的分层语义对齐重定位框架,利用DINO-ViT提取图像语义并实现与OSM的跨模态对齐,结合由粗到精的搜索策略提升定位精度与效率。
Details
Motivation: 传统单目重定位依赖稠密地图,存在可扩展性差和隐私风险;OSM虽轻量、保护隐私且全球可用,但存在图像与OSM间的跨模态差异及全局匹配计算开销大等问题。 Method: 提出分层搜索框架:1)利用DINO-ViT的语义感知能力解构图像视觉元素,建立与OSM的语义关系;2)设计由粗到精的搜索范式替代全局稠密匹配,实现高效渐进式优化。 Result: 实验表明该方法显著提升定位精度与速度;在单数据集训练下,3°朝向召回率甚至超越SOTA方法的5°召回率。 Conclusion: 基于OSM的语义对齐与分层搜索框架有效缓解了跨模态差异与计算成本问题,为隐私保护、可扩展的单目重定位提供了新范式。 Abstract: Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.[365] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
Jiaqi Han,Juntong Shi,Puheng Li,Haotian Ye,Qiushan Guo,Stefano Ermon
Main category: cs.CV
TL;DR: 本文提出Spectrum方法,通过Chebyshev多项式全局预测扩散模型中的潜在特征,实现训练免费、误差可控的长程特征复用,显著提升推理速度并保持高质量生成效果。
Details
Motivation: 扩散模型推理速度慢,现有基于局部近似的特征缓存方法在大幅跳步时误差快速累积,导致生成质量下降。 Method: 将去噪器的潜在特征视为时间函数,用Chebyshev多项式建模,并通过岭回归拟合基系数,从而预测多个后续扩散步的特征;该方法无需额外训练。 Result: 在FLUX.1和Wan2.1-14B等先进模型上分别实现4.79×和4.67×推理加速,同时样本质量明显优于基线方法。 Conclusion: Spectrum是一种训练免费、具备理论误差保证的全局特征预测方法,有效缓解扩散模型推理瓶颈,兼顾速度与生成质量。 Abstract: Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.[366] DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
Enhui Ma,Jiahuan Zhang,Guantian Zheng,Tao Tang,Shengbo Eben Li,Yuhang Lu,Xia Zhou,Xueyang Zhang,Yifei Zhan,Kun Zhan,Zhihui Hao,Xianpeng Lang,Kaicheng Yu
Main category: cs.CV
TL;DR: 本文提出DriveCombo基准,用于评估多模态大语言模型(MLLMs)在复杂交通规则组合、并发与冲突场景下的推理能力,并设计五级认知阶梯和Rule2Scene Agent以支持视觉-语言联合推理;实验表明现有MLLMs在复杂规则场景下性能显著下降,而基于DriveCombo微调可提升规则理解和下游规划能力。
Details
Motivation: 现有自动驾驶相关评测基准多局限于单条交通规则识别(如交通标志识别),忽视真实驾驶中多规则并发与冲突的复杂性,导致模型在简单任务上表现良好但在实际复杂场景中易违规。 Method: 提出DriveCombo——一个基于文本与视觉的组合式交通规则推理基准;构建系统性的'五级认知阶梯'评估框架,覆盖从单规则理解到多规则集成与冲突解决的认知阶段;设计Rule2Scene Agent,将语言描述的交通规则映射为动态驾驶场景,实现场景级交通规则视觉推理。 Result: 对14个主流MLLMs的评测显示:随着任务复杂度上升(尤其在规则冲突时)模型性能显著下降;在DriveCombo数据集上进行划分与微调后,模型的交通规则推理能力及下游规划能力均获得明显提升。 Conclusion: DriveCombo有效弥补了当前评测在复杂交通规则推理方面的空白,为构建合规、智能的端到端自动驾驶系统提供了可量化评估与训练的新范式。 Abstract: Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.[367] MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification
Xiangyang He,Lin Wan
Main category: cs.CV
TL;DR: 本文提出MSP框架,通过发型导向增强、服装保留随机擦除和区域解析注意力机制,解决衣着变化下行人重识别中发型干扰问题,提升模型鲁棒性与性能。
Details
Motivation: 现有CC-ReID方法将头部视为整体,未区分面部与头发,导致过度依赖易变的发型特征,在发型变化时性能下降。 Method: 提出MSP框架,包含三个核心模块:1)Hairstyle-Oriented Augmentation(HSOA)生成身份内发型多样性以降低发型依赖;2)Cloth-Preserved Random Erasing(CPRE)在服装区域进行比例控制的擦除,抑制纹理偏差同时保留身体结构;3)Region-based Parsing Attention(RPA)引入解析引导先验,突出面部与肢体区域、抑制头发特征。 Result: 在多个CC-ReID基准上取得SOTA性能,验证了方法的有效性与鲁棒性。 Conclusion: MSP为长期行人重识别提供了一种兼顾稳定性与实用性的新思路,有效缓解发型干扰并保持结构信息。 Abstract: Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.[368] QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image
Rundong Wang,Wei Ba,Ying Zhou,Yingtai Li,Bowen Liu,Baizhi Wang,Yuhao Wang,Zhidong Yang,Kun Zhang,Rui Yan,S. Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出QCAgent框架,通过用户定义的检查清单引导的批判机制和基于文本-图像语义检索的迭代区域重识别,实现可控制、可验证、证据支撑的病理报告生成。
Details
Motivation: 现有方法虽能生成全切片图像(WSI)的诊断描述,但无法将细粒度陈述与局部视觉证据对齐,且缺乏对诊断细节选择与验证的控制能力。 Method: 提出QCAgent——一种基于代理(agentic)的框架:(i)引入由用户定义检查清单指导的定制化批判机制;(ii)根据批判反馈和文本-补丁语义检索,迭代重识别WSI中有信息量的区域以丰富并调和报告。 Result: 实验表明,QCAgent通过显式提示定义报告需求、约束感知及证据驱动的精炼过程,实现了临床意义明确、高覆盖度的可控病理报告生成。 Conclusion: QCAgent为WSI驱动的病理报告生成提供了质量可控、可验证、证据可追溯的新范式,更贴近病理医生的实际诊断流程。 Abstract: Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.[369] PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Xianqi Wang,Hao Yang,Hangtian Wang,Junda Cheng,Gangwei Xu,Min Lin,Xin Yang
Main category: cs.CV
TL;DR: 本文提出Prompt Recurrent Unit(PRU),利用单目深度基础模型解码器,通过引入单目结构与双目运动线索作为提示,增强迭代优化阶段的零样本泛化能力,实现SOTA零样本立体匹配性能。
Details
Motivation: 现有方法在立体匹配的迭代优化阶段对单目深度先验利用不足,传统GRU架构难以有效建模,限制了零样本泛化性能。 Method: 提出Prompt Recurrent Unit(PRU),将单目结构和双目运动线索作为提示注入单目深度基础模型的解码器中,以在保留单目深度先验的同时融入绝对双目尺度信息。 Result: PromptStereo在多个数据集上达到零样本泛化SOTA性能,同时推理速度相当甚至更快。 Conclusion: 提示引导的迭代优化是提升零样本立体匹配性能的有效且有前景的方向。 Abstract: Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.[370] A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs
Aryan Goyal,Shreshtha Singh,Ashish Mittal,Manoj Tadepalli,Piyush Kumar,Preetham Putha
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型与低秩自适应(LoRA)的合成肺结节生成框架,支持对结节大小、形状及多种放射学特征(如纹理、边界)的细粒度可控生成,并通过引入正交性损失改进LoRA组合,显著提升下游结节检测性能。
Details
Motivation: 肺结节在胸片中表现微弱且形态多变,真实标注数据获取成本高,现有合成方法缺乏对结节放射学特征的精细控制,难以缓解数据稀缺问题。 Method: 提出基于扩散模型的结节合成框架:1)以结节掩码为条件训练基础扩散模型实现尺寸与形状控制;2)为各放射学特征(如纹理、边界)分别训练专用LoRA模块;3)针对多特征协同控制,改进LoRA组合策略,引入正交性损失以解决注意力区域重叠与参数空间非正交问题。 Result: 在内部及公开数据集上实验表明,该方法显著提升下游结节检测性能;放射科医生评估证实其生成结节具备细粒度可控性;多项定量指标均优于现有CXRs结节生成方法。 Conclusion: 所提扩散+LoRA框架及其正交化组合策略,有效实现了肺结节多特征可控合成,为医学影像数据增强提供了新范式。 Abstract: Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.[371] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
Shao Shitong,Gu Yufei,Xie Zeke
Main category: cs.CV
TL;DR: 本文提出FastLightGen算法,通过协同蒸馏模型大小和推理步数,将大型视频生成模型压缩为快速轻量级版本,在保持视觉质量的同时显著提升推理效率。
Details
Motivation: 现有视频生成模型计算开销大,部署困难;已有加速方法仅单独优化采样步数或模型大小,未探索二者联合压缩的潜力。 Method: 提出FastLightGen算法,构建最优教师模型,在统一框架下同步蒸馏模型参数量和采样步数。 Result: 在HunyuanVideo-ATI2V和WanX-TI2V上验证:4步采样+30%参数剪枝的生成器在推理预算受限下达到最优视觉质量,且性能全面超越现有方法。 Conclusion: FastLightGen实现了高效视频生成的新SOTA,证明了联合压缩模型规模与采样步数的有效性与可行性。 Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.[372] DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs
Aryan Goyal,Ashish Mittal,Pranav Rao,Manoj Tadepalli,Preetham Putha
Main category: cs.CV
TL;DR: 本文提出DiffusionXRay,一种结合去噪扩散概率模型(DDPM)和生成对抗网络(GAN)的两阶段图像修复方法,用于提升合成胸部X光片(如DRR生成的低质量图像)的质量,以缓解肺结节诊断中高质量标注数据稀缺的问题。
Details
Motivation: 深度学习辅助肺癌诊断依赖大量高质量标注数据,但真实肺结节(尤其微小结节)标注困难、成本高;现有合成方法(如DRR)生成的X光图像质量差(模糊、结构丢失),限制模型泛化能力。 Method: 提出DiffusionXRay:第一阶段用DDPM-LQ和MUNIT-LQ两种方式生成低质量CXR(建模为风格迁移问题);第二阶段基于配对的低质/高质量图像,训练DDPM进行图像恢复。融合DDPM的建模能力和GAN的细节生成优势。 Result: 在定量指标和放射科医生评估中均显示显著提升:增强图像清晰度、对比度及诊断价值,同时保留细微但临床关键的影像特征(如微小结节)。 Conclusion: DiffusionXRay为解决医学影像合成与修复中的质量瓶颈提供了新范式,有望提升基于合成数据的AI诊断模型鲁棒性与临床适用性。 Abstract: Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.[373] CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
Gong Chen,Chaokun Zhang,Pengcheng Lv
Main category: cs.CV
TL;DR: 本文提出CoopDiff,一种基于扩散模型的协同感知框架,通过去噪机制提升在多样化和不可预测退化场景下的鲁棒性与泛化能力。
Details
Motivation: 现实场景中多样且不可预测的退化(如环境与传感器失真)严重削弱了协同感知的鲁棒性和泛化能力,亟需更鲁棒的建模方法。 Method: CoopDiff采用教师-学生范式:教师端基于质量感知的体素级早融合与语义引导,并利用扩散去噪器生成干净监督特征;学生端为双分支扩散结构,先分离自车与协同输入流以重建教师目标,再通过自车引导的交叉注意力实现退化下的平衡解码。 Result: 在两个含六类退化的多退化基准(OPV2Vn和DAIR-V2Xn)上,CoopDiff在所有退化类型下均超越先前方法,显著降低相对退化误差,并支持精度与推理效率的可调权衡。 Conclusion: 扩散模型的固有去噪特性可有效增强协同感知在复杂退化条件下的鲁棒性与适应性,CoopDiff为实际部署提供了新范式。 Abstract: Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.[374] MVR: Multi-view Video Reward Shaping for Reinforcement Learning
Lirui Luo,Guoxi Zhang,Hongming Xu,Yaodong Yang,Cong Fang,Qing Li
Main category: cs.CV
TL;DR: 本文提出Multi-View Video Reward Shaping (MVR)框架,利用多视角视频与冻结预训练视觉语言模型(VLM)的视频-文本相似度建模状态相关性,并设计状态依赖的奖励塑形方法,以提升复杂动态任务(如人形行走与操作)的强化学习策略性能。
Details
Motivation: 现有基于单张图像和VLM图像-文本相似度的奖励增强方法存在两大缺陷:一是线性叠加易改变最优策略;二是难以刻画涉及多状态、多视角的动态行为,易受遮挡影响。 Method: 提出MVR框架:1)采集多视角视频表征动态行为;2)利用冻结VLM计算视频-文本相似度,构建状态相关性函数;3)设计状态依赖的奖励塑形公式,随任务进展自动降低VLM引导权重。 Result: 在HumanoidBench(人形运动)和MetaWorld(操作)任务上验证了MVR显著优于基线方法;消融实验确认了多视角视频和状态依赖塑形设计的有效性。 Conclusion: MVR通过多视角视频建模与自适应奖励塑形,有效缓解图像基方法的静态偏差与视角局限,为复杂动态任务提供了更鲁棒、可解释的奖励设计范式。 Abstract: Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.[375] Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Haonan Jia,Shichao Dong,Xin Dong,Zenghui Sun,Jin Wang,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为跨模态身份映射(CIM)的强化学习框架,用于提升大视觉语言模型(LVLMs)的图像描述生成能力,通过衡量文本检索图像的相似性来量化模态转换中的信息损失,无需额外标注,在COCO-LN500等基准上显著提升关系推理性能。
Details
Motivation: LVLMs在图像描述中常遗漏或误表关键视觉内容,而现有方法难以度量视觉到文本转换过程中的信息损失,主因是模态鸿沟。 Method: 提出Cross-modal Identity Mapping(CIM)强化学习框架,从‘图库表征一致性’和‘查询-图库图像相关性’两方面定量评估信息损失,并以此监督LVLM实现图像到描述的近似恒等映射。 Result: 在COCO-LN500基准上,CIM使Qwen2.5-VL-7B的关系推理能力提升20%,性能优于监督微调方法。 Conclusion: 通过引入可学习、无标注依赖的跨模态一致性目标,CIM能有效缓解LVLM的信息损失问题,提升图像描述的准确性与细节忠实度。 Abstract: Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.[376] Towards Principled Dataset Distillation: A Spectral Distribution Perspective
Ruixi Wu,Shaobo Wang,Jiahuan Chen,Zhiyuan Liu,Yicun Yang,Zhaorun Chen,Zekai Li,Kaixin Li,Xinming Wang,Hongzhu Yi,Kai Wang,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为Class-Aware Spectral Distribution Matching (CSDM)的新方法,用于解决数据集蒸馏在长尾数据上的性能下降问题。该方法通过频谱分布距离(SDD)进行类别感知的分布对齐,并利用幅值-相位分解自适应增强尾部类别的真实性,显著提升了长尾数据下的蒸馏效果与稳定性。
Details
Motivation: 现有数据集蒸馏(DD)方法在长尾数据集上性能显著下降,主要源于分布差异度量设计经验化及对不平衡类别统一处理两大问题。 Method: 提出Class-Aware Spectral Distribution Matching(CSDM),将样本映射至频域,定义Spectral Distribution Distance(SDD);进一步对SDD进行幅值-相位分解,实现对尾部类别的自适应增强。 Result: 在CIFAR-10-LT上,每类仅用10张图像时,CSDM相比SOTA方法提升14.0%;当尾类图像数从500降至25时,性能仅下降5.7%,展现出强稳定性。 Conclusion: CSDM通过频谱视角建模类别感知分布匹配,有效缓解了长尾场景下数据蒸馏的性能退化问题,为不平衡数据的高效模型训练提供了新思路。 Abstract: Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.[377] Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
Tianqi Shen,Huakao Lin,Ning An
Main category: cs.CV
TL;DR: 本文提出了一种基于多层感知机(MLP)的轻量级Siamese视觉跟踪器,通过可微分神经架构搜索(DNAS)自动优化通道宽度与深度,在保持实时性的同时实现精度-效率最优权衡。
Details
Motivation: 现有基于卷积或Transformer的Siamese跟踪器难以在资源受限设备上高效实现像素级交互,导致精度与效率失衡。 Method: 设计基于MLP的Siamese neck融合模块,并构建分层MLP搜索空间;引入定制化松弛策略,使DNAS能解耦通道宽度与其他结构选择,自动平衡宽度与深度。 Result: 所提跟踪器在四个通用和三个航拍跟踪基准上达到SOTA精度-效率权衡,且在GPU和NPU上均保持实时性能。 Conclusion: MLP结构结合解耦式DNAS可有效缓解Siamese跟踪中的精度-效率矛盾,为边缘部署提供新范式。 Abstract: Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).[378] WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Gong Chen,Chaokun Zhang,Xinyan Zhao
Main category: cs.CV
TL;DR: WhisperNet 提出一种带宽感知的协同感知框架,通过发送端生成轻量显著性元数据、接收端动态规划全局特征请求与分配,并结合协同特征路由模块对齐信息,显著降低通信开销的同时提升检测性能。
Details
Motivation: 现有协同感知方法受限于通信带宽:固定码率压缩缺乏环境自适应性,空间选择方法虽提升效率但牺牲全局上下文理解。 Method: 提出接收端中心范式(receiver-centric paradigm):发送端生成轻量显著性元数据;接收端制定全局请求计划,动态分配各智能体及特征通道的带宽预算;引入协同特征路由模块对齐跨智能体特征后再融合。 Result: 在OPV2V数据集上AP@0.7提升2.4%,仅需0.5%通信开销;作为即插即用模块,在5%带宽下仍能增强强基线模型,并在定位噪声下保持鲁棒性。 Conclusion: 全局协调‘共享什么’与‘何处共享’是实现高效协同感知的关键。 Abstract: Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce \textit{WhisperNet}, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency. Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4\% with only 0.5\% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5\% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across \textit{what} and \textit{where} to share is the key to achieving efficient collaborative perception.[379] Dual Distillation for Few-Shot Anomaly Detection
Le Dong,Qinzhong Tan,Chunlei Li,Jingliang Hu,Yilei Shi,Weisheng Dong,Xiao Xiang Zhu,Lichao Mou
Main category: cs.CV
TL;DR: 本文提出D²4FAD,一种用于医学影像少样本异常检测的双蒸馏框架,仅需少量正常参考图像即可在新任务中检测异常,并通过动态加权机制提升性能,在涵盖多器官、多模态、多疾病的基准数据集上达到SOTA。
Details
Motivation: 现有无监督异常检测方法需要大量正常训练数据,且难以跨解剖结构泛化,而医学场景中异常标注稀缺、正常数据获取受限,亟需少样本下有效检测异常的方法。 Method: 提出双蒸馏框架D²4FAD:以预训练编码器为教师网络提取多尺度特征;学生解码器对查询图像进行知识蒸馏、对支持图像进行自蒸馏;并引入基于查询条件的可学习加权机制,动态评估各支持图像的参考价值。 Result: 在自建包含4个器官、4种模态、5类疾病的13,084张图像的综合基准上,D²4FAD显著优于现有方法,达到少样本医学异常检测新SOTA。 Conclusion: D²4FAD验证了双蒸馏与动态加权策略在少样本医学异常检测中的有效性,为低资源临床场景提供了实用、鲁棒的新范式。 Abstract: Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at https://github.com/ttttqz/D24FAD.[380] Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints
Ruize Cui,Jialun Pei,Haiqiao Wang,Jun Zhou,Jeremy Yuen-Chun Teoh,Pheng-Ann Heng,Jing Qin
Main category: cs.CV
TL;DR: 本文提出Land-Reg框架,通过显式学习可解释的2D-3D解剖标志点对应关系,提升腹腔镜肝手术中AR导航的配准精度与稳定性。
Details
Motivation: 现有腹腔镜肝手术AR配准方法缺乏对可靠2D-3D几何对应关系的显式建模,导致可解释性差、临床配准不稳定。 Method: 提出Land-Reg:1)跨模态隐空间对齐模块实现刚性配准;2)不确定性增强的重叠标志点检测器估计2D-3D对应;3)基于重投影一致性和局部等距正则化的形变约束监督策略实现非刚性配准,并引入渲染掩码对齐保证全局形状一致性。 Result: 在P2ILF数据集上,Land-Reg在刚性位姿估计与非刚性形变任务上均优于现有方法。 Conclusion: Land-Reg通过可解释的标志点对应建模,显著提升了跨模态配准的精度、鲁棒性与临床适用性。 Abstract: In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at https://github.com/cuiruize/Land-Reg.[381] Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration
Guanglu Dong,Chunlei Li,Chao Ren,Jingliang Hu,Yilei Shi,Xiao Xiang Zhu,Lichao Mou
Main category: cs.CV
TL;DR: 本文提出DATPRL-IR,首个面向多领域的全合一图像恢复方法,通过域感知任务提示表示学习(Domain-Aware Task Prompt Representation Learning)实现跨域、多任务统一建模,在多个图像域上显著超越现有SOTA方法。
Details
Motivation: 现有全合一图像恢复(AiOIR)方法通常局限于单一图像域(如自然图像、医学图像或遥感图像),缺乏跨域泛化能力;本文旨在拓展AiOIR至多领域场景,解决模型在不同图像域间知识迁移与任务适配的难题。 Method: 提出Domain-Aware Task Prompt Representation Learning框架:构建任务提示池与域提示池;通过Prompt Composition Mechanism(PCM)分别为输入图像自适应选择并组合任务提示和域提示,生成实例级任务表示与域表示;二者融合形成域感知任务提示表示,指导统一恢复网络。其中,域提示从多模态大语言模型中蒸馏域先验知识。 Result: 在多个图像域(自然、医学、遥感等)的多种恢复任务(去噪、超分、去模糊等)上,DATPRL-IR显著优于现有SOTA方法,展现出强泛化性与跨域适应能力。 Conclusion: DATPRL-IR首次实现了多领域全合一图像恢复,验证了域感知提示表示学习的有效性,为构建通用图像恢复模型提供了新范式。 Abstract: Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at https://github.com/GuangluDong0728/DATPRL-IR.[382] Action-Guided Attention for Video Action Anticipation
Tsung-Ming Tai,Sofia Casarin,Andrea Pilzer,Werner Nutt,Oswald Lanz
Main category: cs.CV
TL;DR: 本文提出了一种名为Action-Guided Attention(AGA)的新注意力机制,用于视频动作预测任务。该机制利用预测的动作序列作为查询和键,引导模型关注与未来动作相关的历史时刻,并通过门控函数融合当前帧特征,从而提升对潜在意图的建模能力和泛化性。
Details
Motivation: 现有基于Transformer的方法依赖像素级点积注意力,缺乏高层语义,易过拟合于显式视觉线索,难以捕捉潜在意图,泛化能力差。 Method: 提出Action-Guided Attention(AGA),以预测的动作序列为query和key,引导注意力聚焦于与未来动作相关的历史片段,并通过专用门控函数融合当前帧嵌入;支持训练后分析动作依赖与反事实证据。 Result: 在EPIC-Kitchens-100基准上验证了AGA具有良好的跨验证集到未见测试集的泛化能力;训练后分析可揭示模型捕获的动作依赖关系与内在化的反事实证据,提升可解释性。 Conclusion: AGA通过引入动作语义引导注意力机制,有效提升了视频动作预测中的意图建模与泛化能力,并提供了透明、可解释的预测依据。 Abstract: Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.[383] An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
Alexandru Manole,Laura Diosan
Main category: cs.CV
TL;DR: 本文研究了在汽车品牌和型号的层次化多标签分类任务中,多任务学习(并行与级联架构)对CNN和Transformer等深度学习模型性能的影响,并在StanfordCars和CompCars数据集上验证了其有效性。
Details
Motivation: 现实世界中的信息大多具有层次化结构,而当前许多深度学习方法未充分利用这一语义丰富的结构;受人类学习利用层次结构启发,本文探索多任务学习在层次化分类任务中的潜力。 Method: 采用并行和级联两种多任务学习架构,结合CNN与Transformer模型,在StanfordCars和CompCars数据集上系统评估不同dropout率和损失权重设置下的性能表现。 Result: 多任务学习在两个数据集上均有效提升CNN性能;在CompCars上,对CNN和Transformer均有显著性能提升。 Conclusion: 多任务学习能有效利用层次化语义结构,在汽车细粒度分类任务中具有普适性和实用性,尤其在数据复杂度较高的CompCars上效果更突出。 Abstract: Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.[384] NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
Rong Fu,Yiqing Lyu,Chunlei Meng,Muge Qi,Yabin Jin,Qi Zhao,Li Bao,Juntao Gao,Fuqian Shi,Nilanjan Dey,Wei Luo,Simon Fong
Main category: cs.CV
TL;DR: 本文提出NeuroSymb-MRG框架,融合神经符号溯因推理与主动不确定性最小化,生成结构化、临床可信的放射科报告,显著提升事实一致性与语言质量。
Details
Motivation: 现有放射科报告自动生成方法存在视觉-语言偏差、事实不一致及缺乏显式多跳临床推理等问题,亟需更可靠、可解释的生成范式。 Method: NeuroSymb-MRG将图像特征映射为概率性临床概念,构建可微逻辑推理链,解码为模板化语句,并通过检索与约束语言模型编辑优化文本;引入基于规则级不确定性和多样性的主动采样循环,支持医生参与校验与提示库迭代优化。 Result: 在标准基准测试中,该方法在事实一致性及标准语言指标(如BLEU、CIDEr)上均显著优于代表性基线模型。 Conclusion: NeuroSymb-MRG验证了神经符号推理与人机协同不确定性驱动优化在临床报告生成中的有效性,为高可靠性AI辅助诊断文档提供了新范式。 Abstract: Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.[385] StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
Keli Liu,Zhendong Wang,Wengang Zhou,Houqiang Li
Main category: cs.CV
TL;DR: 本文提出StepVAR,一种无需训练的token剪枝框架,通过联合考虑结构和纹理重要性来加速视觉自回归(VAR)模型的推理,显著提升高分辨率生成效率而不牺牲质量。
Details
Motivation: 现有VAR模型在高分辨率下推理成本呈二次增长,且传统剪枝方法忽视全局结构一致性,导致语义退化。 Method: StepVAR采用轻量高通滤波提取局部纹理细节,结合PCA保留全局结构信息,并引入最近邻特征传播策略重建稀疏token下的稠密特征图。 Result: 在多个SOTA文本到图像/视频VAR模型上验证,StepVAR实现显著推理加速,同时保持生成质量,性能优于现有加速方法。 Conclusion: StepVAR是一种通用、高效、无需训练的VAR推理加速方案,兼顾结构完整性与纹理保真度。 Abstract: Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.[386] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
Yuxuan Li,Yuming Chen,Yunheng Li,Ming-Ming Cheng,Xiang Li,Jian Yang
Main category: cs.CV
TL;DR: 本文提出BabelRS框架,通过语言作为语义枢纽解耦多模态对齐与下游任务学习,提升异构遥感目标检测的训练稳定性与泛化性能。
Details
Motivation: 现有方法采用晚对齐范式,导致模态对齐与任务优化耦合紧密,造成训练不稳定和泛化不佳。 Method: 提出BabelRS:包含概念共享指令对齐(CSIA)——以语言为枢纽对齐各传感器模态;以及层式视觉-语义退火(LVSA)——渐进聚合多尺度视觉特征以匹配检测粒度。 Result: 实验表明BabelRS显著提升训练稳定性,并在多个基准上持续超越SOTA方法,且无需额外技巧。 Conclusion: 语言驱动的预训练框架能有效解耦模态对齐与任务学习,为异构多模态遥感检测提供新范式。 Abstract: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.[387] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Minseok Seo,Wonjun Lee,Jaehyuk Jang,Changick Kim
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的零样本深度补全测试时自适应方法,仅更新解码器中的低维子空间,显著提升效率与精度平衡。
Details
Motivation: 现有零样本深度补全方法依赖计算昂贵的扩散式测试时优化或仍需多次前向-反向传播的视觉提示法,推理速度慢。 Method: 发现深度基础模型将深度相关信息集中在解码器的低维子空间中,因此仅对该子空间进行测试时自适应更新,并利用稀疏深度监督进行优化。 Result: 在五个室内外数据集上取得SOTA性能,建立了测试时自适应在精度与效率间的新Pareto前沿。 Conclusion: 仅适配解码器低维子空间即可实现高效准确的零样本深度补全,验证了该策略的有效性与实用性。 Abstract: Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.[388] Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design
Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du
Main category: cs.CV
TL;DR: 本文提出了一种面向下游任务的水下图像增强框架DTI-UIE,通过结合人类视觉感知模型与任务驱动损失,提升语义分割、目标检测等任务性能。
Details
Motivation: 现有水下图像增强方法多面向人眼视觉,忽视对下游识别任务(如分割、检测)至关重要的高频细节重建。 Method: 设计双分支网络与任务感知注意力模块;采用多阶段训练与任务驱动的感知损失;并基于多种任务网络自动构建任务导向的UIE数据集TI-UIED。 Result: 在语义分割、目标检测和实例分割等下游任务上显著提升性能;代码已开源。 Conclusion: DTI-UIE验证了将下游任务先验融入图像增强过程的有效性,为任务导向的图像增强提供了新范式。 Abstract: In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at https://github.com/oucailab/DTIUIE.[389] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
Dragos Costea,Alina Marcu,Cristina Lazar,Marius Leordeanu
Main category: cs.CV
TL;DR: 本文提出首个实时生成人类与AI之间自然非语言交互的框架,基于2D关键点,使用轻量模型实现高达100FPS运行;实验证明预训练可显著降低运动误差,但AI生成动作与人类动作仍存在统计可区分的现实差距,且性能下降主要源于时序连贯性而非图像保真度。
Details
Motivation: 探究当前生成式AI模型是否在非语言身体动作层面超越表面模仿,真正参与具表现力的身体语言对话。 Method: 构建首个基于2D身体关键点的实时人-AI非语言交互生成框架,采用四种轻量架构,在NVIDIA Orin Nano上达100FPS;在437段人类视频上训练,并用合成序列预训练;评估其在SORA和VEO等文本生成视频系统输出的关键点上的泛化性能。 Result: 预训练显著降低运动误差且不牺牲速度;但在SORA生成的关键点上性能明显下降,而在VEO上下降较小,表明时序一致性比图像保真度更影响实际性能;AI与人类动作仍存在统计上可区分的差异。 Conclusion: 当前AI生成的身体运动尚未达到人类水平的统计保真度,尤其在时序建模方面存在关键瓶颈,真实世界交互需优先提升运动的动态连贯性而非静态帧质量。 Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.[390] Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications
Ruoyang Su,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yisi Luo,Michael K. Ng
Main category: cs.CV
TL;DR: 本文提出了一种基于神经算子的连续张量函数表示方法(NO-CTR),用连续非线性模-n算子替代传统的离散线性模-n积,以更真实地建模现实世界数据并缓解离散化误差;理论证明其具备通用逼近能力,并在多类跨网格数据补全任务中验证了其优越性。
Details
Motivation: 现有连续张量函数表示受限于离散、线性的模-n积,难以充分发挥其表达潜力,且易引入离散化伪影;需一种真正连续、非线性的模-n操作机制。 Method: 提出基于神经算子的连续非线性模-n算子,直接映射连续核心张量函数到连续目标张量函数;构建NO-CTR表示框架,并设计基于NO-CTR的多维数据补全模型。 Result: NO-CTR在多光谱图像、彩色视频、Sentinel-2多分辨率图像及点云等跨网格数据补全任务中均显著优于经典离散张量与现有连续张量方法。 Conclusion: NO-CTR提供了一种更具表达力与理论保证的连续张量函数表示范式,可统一处理规则网格、变分辨率网格及无网格数据,有效克服传统方法的离散化局限。 Abstract: Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode-$n$ product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode-$n$ operators as a continuous and nonlinear alternative of discrete and linear mode-$n$ product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode-$n$ operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode-$n$ operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.[391] Affine Correspondences in Stereo Vision: Theory, Practice, and Limitations
Levente Hajder
Main category: cs.CV
TL;DR: 本文综述了仿射变换与对极几何的基础理论,研究了仿射变换精度对三维重建质量的影响,并提出了基于图像方向和基础矩阵估计局部仿射变换的新方法;通过合成与真实数据实验(使用含棋盘格的三正交平面标定物)验证,表面法向量重建误差约为几度。
Details
Motivation: 仿射变换在立体视觉中具有广泛应用潜力(如估计表面法向、基础矩阵、本质矩阵及三维重建),但其精度对重建质量的影响尚需系统分析,且现有局部仿射变换估计方法有待改进。 Method: 1)理论综述仿射变换与对极几何基础;2)分析仿射变换估计误差对3D重建质量的影响;3)提出融合图像对应方向与基础矩阵的新局部仿射变换估计算法;4)设计含三正交棋盘格平面的专用标定物用于真实场景定量评估。 Result: 在合成与真实数据上,基于表面法向量重建精度的定量评估表明:典型测试场景下法向量估计误差为几度;并对特殊立体姿态与平面朝向进行了细致分析。 Conclusion: 仿射变换精度显著影响三维重建质量;所提新方法在真实场景中实现了几度量级的表面法向量重建精度,验证了其有效性与实用性。 Abstract: Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.[392] LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
Kuangyi Chen,Jun Zhang,Yuxi Hu,Yi Zhou,Friedrich Fraundorfer
Main category: cs.CV
TL;DR: LEAR是一种双任务学习框架,用于在GPS拒止和视觉退化环境中,通过联合估计边缘结构和稠密事件-深度光流场来对齐事件相机与LiDAR点云,从而提升定位精度。
Details
Motivation: 事件相机虽具高时间分辨率和鲁棒性,但其稀疏异步事件与密集LiDAR地图之间的模态差异导致直接配准困难且病态。 Method: 提出LEAR框架,通过跨模态融合机制将模态无关的几何线索注入运动表征,并采用迭代优化策略实现边缘检测与事件-深度光流估计任务间的相互一致性。 Result: 在多个主流挑战性数据集上,LEAR性能超越现有最优方法;代码、模型与演示视频已开源。 Conclusion: LEAR通过边缘感知与深度对齐的光流场,显著提升了PnP位姿求解的鲁棒性与精度,有效弥合了事件相机与LiDAR之间的感知模态鸿沟。 Abstract: Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.[393] FireRed-OCR Technical Report
Hao Wu,Haoran Lou,Xinyue Li,Zuodong Zhong,Zhaojun Sun,Phellon Chen,Xuanhe Zhou,Kai Zuo,Yibo Chen,Xu Tang,Yao Hu,Boxiang Zhou,Jian Wu,Yongji Wu,Wenxin Yu,Yingmiao Liu,Yuhao Huang,Manjie Xu,Gang Liu,Yidong Ma,Zhichao Sun,Changhao Qiao
Main category: cs.CV
TL;DR: FireRed-OCR 提出一种三阶段渐进式训练框架,将通用视觉语言模型(Qwen3-VL)专精化为高精度结构化OCR模型,通过几何+语义数据工厂构建高质量数据,并结合格式约束的强化学习优化,在OmniDocBench上达到92.94% SOTA性能。
Details
Motivation: 通用VLM在复杂文档解析中易出现“结构幻觉”,难以满足工业级OCR对像素级精度和结构完整性的要求;同时高质量结构化标注数据稀缺。 Method: 1)构建“几何+语义”数据工厂,基于几何特征聚类与多维标注合成平衡数据集;2)提出三阶段渐进训练:多任务预对齐→专用监督微调(SFT)生成全图Markdown→格式约束的组相对策略优化(GRPO)强化学习确保语法与结构正确性。 Result: 在OmniDocBench v1.5上整体得分92.94%,显著超越DeepSeek-OCR 2和OCRVerse,在文本、公式、表格及阅读顺序等指标均达SOTA;开源代码与模型权重。 Conclusion: FireRed-OCR验证了将通用VLM高效转化为结构化文档解析专家的可行性,为‘通用VLM→专用结构专家’范式提供了系统性方法论与实践路径。 Abstract: We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.[394] GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection
Yutong Yang,Katarina Popović,Julian Wiederer,Markus Braun,Vasileios Belagiannis,Bin Yang
Main category: cs.CV
TL;DR: 本文提出GroupEnsemble方法,通过在DETR类模型的解码器中并行输入多组独立查询,单次前向传播即可高效估计空间与语义不确定性,兼顾精度与效率。
Details
Motivation: 现有DETR类模型仅提供语义置信度,缺乏空间不确定性估计;Deep Ensembles虽好但内存开销大;MC Dropout则推理延迟高。亟需一种高效、准确的不确定性估计方案。 Method: GroupEnsemble:在Transformer解码器中并行注入多组互不交互的物体查询(通过注意力掩码隔离),每组独立预测完整检测结果,利用解码器固有并行性实现单次前向传播的集成式不确定性估计。 Result: 在Cityscapes(自动驾驶)和COCO(日常场景)上验证有效;MC-Dropout与GroupEnsemble混合策略在多项指标上超越Deep Ensembles,且计算与内存成本显著更低。 Conclusion: GroupEnsemble是一种轻量、高效、实用的不确定性估计新范式,特别适用于资源受限的实时检测系统(如自动驾驶),为DETR类模型落地提供了关键可靠性支撑。 Abstract: Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.[395] Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
Alexander Prutsch,David Schinagl,Horst Possegger
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、高精度的流式轨迹预测方法,利用历史预测终点作为锚点传播时序上下文,避免多阶段优化,在低延迟下实现连续、一致的实时预测,并在Argoverse 2上达到SOTA性能。
Details
Motivation: 现有轨迹预测方法多基于独立快照,忽视全局时序连续性;而自动驾驶需在低延迟、持续流式输入下做出一致、实时的预测。 Method: 提出端点感知的时序上下文传播机制,将前序预测的轨迹终点作为锚点,引导场景编码器高效提取相关上下文,无需迭代优化或多阶段解码。 Result: 在Argoverse 2多智能体与单智能体基准上达到流式预测SOTA,显著降低推理延迟,资源消耗更少。 Conclusion: 流式建模结合端点引导的上下文传播可兼顾精度、速度与一致性,更适合真实自动驾驶系统部署。 Abstract: Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.[396] CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection
Yiheng Li,Zichang Tan,Guoqing Xu,Yijun Ye,Yang Yang,Zhen Lei
Main category: cs.CV
TL;DR: 本文提出CTForensics数据集和ESF-CTFD检测模型,旨在提升CT图像伪造检测的泛化能力与性能,通过多尺度空间-频率特征融合实现对CT特有伪造痕迹的高效识别。
Details
Motivation: 现有CT伪造检测研究受限于缺乏反映真实场景泛化需求的数据集,且沿用自然图像检测方法,难以捕捉CT特有伪造伪影。 Method: 构建涵盖十种CT生成方法的CTForensics数据集;提出基于CNN的ESF-CTFD模型,包含小波增强中心主干、空间处理块(多尺度特征融合)和频率处理块(频域建模)。 Result: ESF-CTFD在多个CT生成模型上均优于现有方法,展现出更强的跨模型泛化能力。 Conclusion: CTForensics数据集与ESF-CTFD模型共同推动了医学影像伪造检测向更贴近临床实际的方向发展,为生成式AI在医疗领域的安全应用提供了关键技术支撑。 Abstract: With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.[397] Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling
Muyu Liu,Xuanyu Tian,Chenhe Du,Qing Wu,Hongjiang Wei,Yuyao Zhang
Main category: cs.CV
TL;DR: 本文提出CaMB-Diff方法,通过引入具有单调性硬约束的级联单调Bernstein(CaMB)算子建模未知动态范围压缩的前向过程,并结合即插即用扩散模型,实现零样本下高保真、物理一致的辐射度恢复。
Details
Motivation: UDRC(如低光增强、HDR重建)是具有未知前向模型和不可逆信息损失的盲逆问题,难以恢复辐射度保真度。 Method: 提出级联单调Bernstein(CaMB)算子作为单调性驱动的前向模型参数化方法,并将其嵌入即插即用扩散框架(CaMB-Diff),利用扩散模型提供几何先验,CaMB显式校正辐射畸变。 Result: 在低光增强、低场MRI增强和HDR重建等零样本UDRC任务上,CaMB-Diff显著优于现有SOTA方法,兼顾信号保真度与物理一致性;实验验证了CaMB对未知前向算子建模的有效性。 Conclusion: 单调性是UDRC任务的关键物理不变量;CaMB作为硬编码归纳偏置可稳定估计前向模型;CaMB-Diff实现了物理引导与数据驱动的协同优化,为盲逆问题提供了新范式。 Abstract: Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbf{cascaded monotonic Bernstein} (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbf{CaMB-Diff}. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.[398] Generative Visual Chain-of-Thought for Image Editing
Zijin Yin,Tiankai Hang,Yiji Cheng,Shiyi Zhang,Runze He,Yu Xu,Chunyu Wang,Bing Li,Zheng Chang,Kongming Liang,Qinglin Lu,Zhanyu Ma
Main category: cs.CV
TL;DR: 本文提出Generative Visual Chain-of-Thought (GVCoT),一种端到端联合优化视觉推理与编辑的统一框架,通过先生成空间线索定位目标区域再执行编辑,提升复杂场景下基于细粒度空间指令的图像编辑精度。
Details
Motivation: 现有图像编辑方法在复杂场景和细微空间指令下难以准确定位待编辑区域。 Method: 提出GVCoT框架,联合优化推理阶段(生成空间线索定位)与编辑阶段的视觉token;构建大规模数据集GVCoT-Edit-Instruct(1.8M样本),采用渐进式训练策略(监督微调+强化学习);并提出新基准SREdit-Bench用于严格评测。 Result: 在SREdit-Bench和ImgEdit上持续超越当前最优模型。 Conclusion: GVCoT实现了更可解释、更精准的图像编辑,为未来研究提供了新方向。 Abstract: Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.[399] Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport
Muyu Liu,Chenhe Du,Xuanyu Tian,Qing Wu,Xiao Wang,Haonan Zhang,Hongjiang Wei,Yuyao Zhang
Main category: cs.CV
TL;DR: 本文提出DACT框架,利用预训练的高场扩散先验和物理信息驱动的自适应前向模型,在无配对监督下实现低场MRI到高场MRI图像的高质量重建,通过可微Sinkhorn最优传输模块动态校正强度分布偏移,显著提升结构细节与组织对比度。
Details
Motivation: 低场MRI受限于低信噪比和组织对比度失真,而从低场数据重建高场质量图像面临无配对训练数据稀缺及未知非线性对比度变换算子的挑战。 Method: 提出基于扩散模型的自适应对比度传输(DACT)框架,结合预训练高场扩散先验与物理建模的自适应前向模型,并引入可微Sinkhorn最优传输模块,在反向扩散过程中显式建模并校正低场与高场间的强度分布偏移。 Result: 在模拟和真实临床低场数据集上实验表明,DACT达到当前最优性能,重建图像具有更优的结构细节和准确的组织对比度。 Conclusion: DACT是一种有效的零样本高场质量重建方法,突破了传统线性退化假设的局限,兼顾解剖保真度与对比度真实性。 Abstract: Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.[400] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
Yuechen Luo,Fang Li,Shaoqing Xu,Yang Ji,Zehan Zhang,Bing Wang,Yuannan Shen,Jianwei Cui,Long Chen,Guang Chen,Hangjun Ye,Zhi-Xin Yang,Fuxi Wen
Main category: cs.CV
TL;DR: 本文提出LaST-VLA框架,将视觉-语言-动作模型的推理范式从离散符号转向物理 grounded 的潜在时空链式推理,通过双特征对齐机制融合几何约束与动态前瞻,并结合渐进式监督微调和基于组相对策略优化的强化学习,在多个自动驾驶基准上取得新纪录。
Details
Motivation: 现有VLA模型依赖显式文本链式推理,导致语义与感知脱节及感知-符号冲突;而无约束的潜在链式推理又缺乏物理意义。 Method: 提出Latent Spatio-Temporal VLA(LaST-VLA),引入双特征对齐机制(融合3D基础模型的几何约束与世界模型的动态前瞻),采用渐进式监督微调(SFT)训练策略,并用Group Relative Policy Optimization(GRPO)进行强化学习优化以保障安全与规则合规。 Result: 在NAVSIM v1(91.3 PDMS)和v2(87.1 EPDMS)上创纪录,在SURDS和NuDynamics上展现出优异的时空推理能力。 Conclusion: 将潜在推理物理化、时空化可有效弥合感知与决策鸿沟,LaST-VLA为端到端自动驾驶提供了更鲁棒、可解释且符合物理规律的新范式。 Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.[401] BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation
Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Dustin Severtson,Ajmal Mian
Main category: cs.CV
TL;DR: 本文提出VISA双流网络,解耦光谱辐射与植被指数特征,结合注意力机制提升无人机多光谱图像中小尺度杂草的像素级分割精度,并发布BAWSeg数据集支持训练与评估。
Details
Motivation: 现有基于多光谱的杂草分割方法受限于辐射漂移、混合像元干扰以及辐射量纲与归一化指数特征相互干扰,难以稳定识别嵌入作物冠层中的小簇杂草。 Method: 提出VISA(Vegetation-Index and Spectral Attention)双流分割网络:辐射流处理五波段反射率,采用残差光谱-空间注意力;指数流处理植被指数图,融合窗口自注意力、状态空间层与Slot Attention;同时构建BAWSeg——四年期、辐射定标、带精细标注的商用大麦田无人机多光谱数据集。 Result: 在BAWSeg上,VISA达75.6% mIoU和63.5% 杂草IoU,参数量22.8M;跨地块/跨年泛化下mIoU分别保持71.2%和69.2%,均优于SegFormer-B1基线。 Conclusion: VISA通过解耦并协同建模辐射与指数线索,在保持计算效率的同时显著提升了复杂农田场景下杂草分割的精度与鲁棒性,BAWSeg为该任务提供了首个面向部署评估的大规模基准数据集。 Abstract: Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop--weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.[402] MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
Dinh Nam Pham,Leonard Prokisch,Bennet Meyer,Jonas Thumbs
Main category: cs.CV
TL;DR: 本文介绍了MobileMold——一个面向食品霉变检测与食品分类的开源智能手机显微镜图像数据集,并基于该数据集建立了霉变检测与食品类型分类(含多任务学习)的基准模型,取得了接近上限的性能,同时提供显著性可视化解释以增强模型可解释性。
Details
Motivation: 提升食品安全检测的可及性,利用低成本、便携的智能手机夹式显微镜实现日常场景下的食品霉变识别,弥补肉眼检查的不足。 Method: 构建包含4941张手持显微图像的MobileMold数据集(涵盖11类食品、4款手机、3种显微镜及多种真实环境),并基于多种预训练深度学习模型与数据增强策略,建立霉变检测、食品分类及多任务联合预测的基线方法,辅以显著性图进行可视化解释。 Result: 在霉变检测与食品分类多任务设定下,模型达到准确率0.9954、F1值0.9954、马修斯相关系数0.9907的近天花板性能;显著性图成功定位模型关注的霉变区域。 Conclusion: MobileMold数据集有效支持移动端食品腐败检测研究,推动可及性食品安全传感、移动成像及智能附件应用的发展。 Abstract: Smartphone clip-on microscopes turn everyday devices into low-cost, portable imaging systems that can even reveal fungal structures at the microscopic level, enabling mold inspection beyond unaided visual checks. In this paper, we introduce MobileMold, an open smartphone-based microscopy dataset for food mold detection and food classification. MobileMold contains 4,941 handheld microscopy images spanning 11 food types, 4 smartphones, 3 microscopes, and diverse real-world conditions. Beyond the dataset release, we establish baselines for (i) mold detection and (ii) food-type classification, including a multi-task setting that predicts both attributes. Across multiple pretrained deep learning architectures and augmentation strategies, we obtain near-ceiling performance (accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907), validating the utility of our dataset for detecting food spoilage. To increase transparency, we complement our evaluation with saliency-based visual explanations highlighting mold regions associated with the model's predictions. MobileMold aims to contribute to research on accessible food-safety sensing, mobile imaging, and exploring the potential of smartphones enhanced with attachments.[403] physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection
Yuting Wan,Liguo Sun,Jiuwu Hao,Zao Zhang,Pin LV
Main category: cs.CV
TL;DR: 本文提出PhysFusion,一种物理信息驱动的雷达-图像融合检测框架,用于无人水面艇(USV)在复杂海况下的水面目标检测,通过物理建模提升稀疏、多变雷达点云的利用效率,并在多个数据集上取得SOTA性能。
Details
Motivation: 水面目标检测面临波浪杂波、镜面反射和远距离外观线索弱等挑战;毫米波雷达虽可弥补光照不足,但其点云稀疏、间歇、雷达截面积(RCS)变化剧烈,传统融合方法难以有效利用雷达信息。 Method: 提出PhysFusion框架,包含三个核心模块:(1) 物理信息雷达编码器(PIR Encoder),含RCS映射器与质量门控,将点云属性转为散射先验并估计点可靠性;(2) 雷达引导交互式融合模块(RIFM),采用双流雷达主干(点基局部流+基于Scattering-Aware Self-Attention的全局Transformer流)实现查询级雷达-图像融合;(3) 时序查询聚合模块(TQA),在短时间窗内聚合帧间融合查询以增强时序一致性。 Result: 在WaterScenes数据集上达到59.7% mAP50:95和90.3% mAP50(T=5),仅用5.6M参数和12.5G FLOPs;在FLOW数据集雷达+相机设置下达94.8% mAP50和46.2% mAP50:95;消融实验证明PIR Encoder、SASA和RIFM均具显著贡献。 Conclusion: PhysFusion通过引入物理先验(如RCS建模)和散射感知注意力机制,显著提升了雷达-图像跨模态融合在水面场景中的鲁棒性与精度,为USV感知提供了高效、可解释的新范式。 Abstract: Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.[404] PreSight: Preoperative Outcome Prediction for Parkinson's Disease via Region-Prior Morphometry and Patient-Specific Weighting
Yand Wang,Chen Zhang,Lanyun Zhu,Yixin Chen,Qunbo Wang,Yutong Bai,Jurgen Germann,Yinghong Wen,Shuai Shao
Main category: cs.CV
TL;DR: 本文提出PreSight模型,融合临床先验知识与术前MRI及基于形变的形态测量(DBM),通过患者特异性加权模块自适应调整区域重要性,实现帕金森病手术后运动获益的个体化预测,并提供可解释、校准良好的决策支持。
Details
Motivation: 术前准确预测帕金森病手术改善率具有重要临床价值,但因影像信号微弱且患者异质性强而极具挑战。 Method: 提出PreSight模型,融合临床先验、术前MRI和DBM特征,引入患者特异性权重模块进行区域重要性自适应调整,端到端输出校准后的预测结果及患者级解释。 Result: 在400例真实双中心队列上验证,内部验证准确率达88.89%,外部中心测试达85.29%;优于临床、纯影像及多模态基线模型;概率校准更优,决策曲线净收益更高;消融实验证实DBM与权重模块的有效性,并显示模型能以患者特异性方式关注疾病相关脑区。 Conclusion: 将临床先验知识与区域自适应形态测量相结合,可在常规临床实践中提供可靠的术前决策支持。 Abstract: Preoperative improvement rate prediction for Parkinson's disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.[405] Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling
Luu Le,Hoang-Loc Cao,Ha-Hieu Pham,Thanh-Huy Nguyen,Ulas Bagci
Main category: cs.CV
TL;DR: 本文提出了一种针对白细胞分类的染色归一化、解耦训练框架,通过实例平衡采样学习可迁移特征,再用类别感知采样与混合损失(有效样本数加权+焦点调制)进行分类器重平衡,并结合多骨干模型集成与测试时增强提升鲁棒性,在WBCBench 2026挑战赛中取得榜首成绩。
Details
Motivation: 真实世界白细胞数据存在染色与扫描导致的外观差异大、以及常见类别主导而稀有但临床重要类别严重不平衡的问题。 Method: 采用染色归一化预处理;解耦训练:第一阶段用实例平衡采样学习可迁移表征,第二阶段用类别感知采样和融合有效数加权与焦点调制的混合损失优化分类器;推理阶段采用多骨干模型集成与测试时增强。 Result: 在ISBI 2026举办的WBCBench 2026白细胞鲁棒分类挑战赛排行榜上获得第一名。 Conclusion: 所提框架能有效缓解染色变异与类别不平衡问题,显著提升白细胞分类模型在真实场景下的鲁棒性与泛化能力。 Abstract: White blood cell (WBC) classification is fundamental for hematology applications such as infection assessment, leukemia screening, and treatment monitoring. However, real-world WBC datasets present substantial appearance variations caused by staining and scanning conditions, as well as severe class imbalance in which common cell types dominate while rare but clinically important categories are underrepresented. To address these challenges, we propose a stain-normalized, decoupled training framework that first learns transferable representations using instance-balanced sampling, and then rebalances the classifier with class-aware sampling and a hybrid loss combining effective-number weighting and focal modulation. In inference stage, we further enhance robustness by ensembling various trained backbones with test-time augmentation. Our approach achieved the top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.[406] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
Yuchen Zhang,Yaxiong Wang,Kecheng Han,Yujiao Wu,Lianwei Wu,Li Zhu,Zhedong Zheng
Main category: cs.CV
TL;DR: 本文提出了一种名为REFORM的推理驱动框架,用于提升生成式AI媒体操纵检测的泛化能力,通过三阶段课程学习建模取证推理过程,并构建了带丰富推理标注的大规模数据集ROM。
Details
Motivation: 现有操纵检测方法依赖结果导向的操纵类型分类,缺乏可解释性且易过拟合表面伪影,难以泛化到未见过的操纵模式。 Method: 提出REFORM框架,采用三阶段课程学习:先诱导取证推理依据,再对齐推理与最终判断,最后通过强化学习优化逻辑一致性;并构建带推理标注的数据集ROM。 Result: 在ROM、DGM4和MMFakeBench上分别达到81.52%准确率、76.65%准确率和74.9 F1值,性能领先现有方法。 Conclusion: 引入显式取证推理可显著提升操纵检测的泛化性与可解释性,REFORM验证了从结果拟合转向过程建模的有效性。 Abstract: Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.[407] Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
Hari Prasanth S. M.,Pejman Habibiroudkenar,Eerik Alamikkotervo,Dimitrios Bouzoulas,Risto Ojala
Main category: cs.CV
TL;DR: 本文提出了一种仅基于事件相机数据的无人机轨迹预测方法,通过直接从事件流中提取螺旋桨转速(RPM),并将其融入RPM感知的卡尔曼滤波框架,实现了无需RGB图像和训练数据的鲁棒短/中期轨迹预测。
Details
Motivation: 事件相机具有高时间分辨率,适合观测快速运动的空中目标,但其在无人机轨迹预测中的应用仍有限;现有方法多依赖RGB图像或需大量训练数据。 Method: 从原始事件数据中直接提取螺旋桨旋转速度(RPM),并设计RPM感知的卡尔曼滤波框架进行融合建模,实现纯事件驱动的轨迹预测。 Result: 在FRED数据集上,该方法在0.4s和0.8s预测时域下的平均距离误差与最终距离误差均优于学习型方法和标准卡尔曼滤波。 Conclusion: 所提事件-only、无监督、无需RGB的预测方法在短至中期轨迹预测中具备鲁棒性与高精度,拓展了事件相机在动态空中目标预测中的实用价值。 Abstract: Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.[408] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising
Peiyuan Jing,Chun-Wun Cheng,Liutao Yang,Zhenxuan Zhang,Thiago V. Lima,Klaus Strobel,Antoine Leimgruber,Angelica Aviles-Rivero,Guang Yang,Javier A. Montoya-Zegarra
Main category: cs.CV
TL;DR: 本文提出MAP-Diff,一种多锚点引导的扩散模型,用于低剂量PET图像的渐进式三维全身去噪,通过引入临床中间剂量扫描作为轨迹锚点,并对反向过程施加时间步依赖监督,实现剂量对齐的中间状态重建,在多个数据集上显著优于现有CNN、Transformer、GAN及扩散模型方法。
Details
Motivation: 低剂量PET虽可降低辐射暴露,但存在严重噪声和定量退化问题;现有扩散去噪模型的反向轨迹未受约束,且未与PET剂量逐步形成的过程对齐。 Method: 提出MAP-Diff框架:利用临床获取的中间剂量PET扫描作为‘轨迹锚点’,通过模拟退化与真实多剂量配对数据匹配来校准锚点时间步,并设计时间步加权的锚点损失以稳定分阶段学习,使反向扩散过程朝向剂量一致的中间状态正则化。 Result: 在内部(Siemens Biograph Vision Quadra)和外部(United Imaging uEXPLORER)数据集上均超越CNN、Transformer、GAN及3D DDPM等基线;内部数据集PSNR提升至43.71 dB(+1.23 dB),SSIM达0.986,NMAE降至0.103;外部数据集PSNR达34.42 dB,NMAE为0.141,性能跨设备泛化良好。 Conclusion: MAP-Diff通过剂量对齐的多锚点监督,有效提升了低剂量PET图像去噪的定量精度与临床一致性,为渐进式、可解释的医学影像生成提供了新范式。 Abstract: Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.[409] NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis
Manuel Serna-Aguilera,Raegan Anderes,Page Dobbs,Khoa Luu
Main category: cs.CV
TL;DR: 本文提出NICO数据集和NICO-RAG框架,旨在应对尼古丁产品创新带来的公共卫生挑战,通过多模态数据(图像与文本)支持公共健康研究,并在不增加图像token处理开销的前提下实现高效、事实准确的图像检索与问答。
Details
Motivation: 尼古丁成瘾危机持续严峻,烟草行业推出新型风味尼古丁产品(如尼古丁袋)削弱了多年控烟成效;现有研究受限于数据规模小、多模态关联能力弱。 Method: 构建含20万+多模态样本(图像+文本)的NICO数据集(覆盖55个品牌),并提出NICO-RAG框架:利用超图组织图像与文本中抽取的实体与关系,在检索阶段避免语言模型及图像token的高成本处理,支持基于视觉与语义相似性的联合图像检索。 Result: 实验表明,NICO-RAG在无需额外处理图像token的情况下,对百余个问题的回答性能媲美适配图像的最先进RAG方法。 Conclusion: NICO数据集与NICO-RAG框架为公共卫生领域提供了可扩展、低成本、事实驱动的多模态分析工具,有助于更有效地监测和应对尼古丁产品创新风险。 Abstract: The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.[410] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
Yisu Zhang,Chenjie Cao,Tengfei Wang,Xuhui Zuo,Junta Wu,Jianke Zhu,Chunchao Guo
Main category: cs.CV
TL;DR: 本文提出WorldStereo框架,通过两个几何记忆模块(全局几何记忆和空间立体记忆)实现相机可控的多视角一致视频生成,并支持高质量3D重建,兼具高效性与泛化能力。
Details
Motivation: 现有视频扩散模型虽能生成高质量视频,但在相机可控性和多视角内容一致性方面不足,导致3D场景重建困难。 Method: 提出WorldStereo框架,包含全局几何记忆(基于增量更新点云提供粗略结构先验和精确相机控制)和空间立体记忆(利用3D对应关系约束注意力感受野以聚焦细节),并采用控制分支架构,复用蒸馏后的VDM主干而无需联合训练。 Result: 在相机引导视频生成和3D重建多个基准上取得优异性能;能从透视图或全景图出发生成高保真3D结果,展现出强大世界建模能力。 Conclusion: WorldStereo成功桥接可控视频生成与3D重建,是一种高效、灵活且可扩展的世界模型新范式。 Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.[411] ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks
Joël Küchler,Ellen van Maren,Vaiva Vasiliauskaitė,Katarina Vulić,Reza Abbasi-Asl,Stephan J. Ihle
Main category: cs.CV
TL;DR: 本文提出ORGAN,一种基于循环一致生成对抗网络(CycleGAN)的对象中心表示学习新方法,相较于主流自编码器架构,在合成和真实世界数据集上均表现出色,尤其擅长处理多物体、低视觉对比度的复杂场景,并支持对象操控及良好的可扩展性。
Details
Motivation: 对象中心表示学习旨在无监督地从图像中提取信息,但现有方法(主要为自编码器)在处理真实世界复杂数据(如多物体、低对比度)时存在局限。 Method: 提出ORGAN,一种基于循环一致生成对抗网络(CycleGAN)的对象中心表示学习框架,替代主流的自编码器架构。 Result: 在合成数据集上性能媲美SOTA;首次在多物体、低对比度真实世界数据集上取得成功;生成具有表达力的潜在空间,支持对象操控;在物体数量和图像尺寸上具有良好可扩展性。 Conclusion: ORGAN为对象中心表示学习提供了更鲁棒、更具表达力和可扩展性的新范式,显著拓展了该技术在真实场景中的适用边界。 Abstract: Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.[412] MMNavAgent: Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis
Zhengyang Xu,Han Li,Jingsong Liu,Linrui Xie,Xun Ma,Xin You,Shihui Zu,Ayako Ito,Xinyu Hao,Hongming Xu,Shaohua Kevin Zhou,Nassir Navab,Peter J. Schüffler
Main category: cs.CV
TL;DR: 本文提出了一种临床一致的多倍率全切片图像(WSI)导航智能体MMNavAgent,通过跨倍率导航工具(CMT)和倍率选择工具(MST),模拟病理医生动态切换倍率、整合全局与细胞级信息的诊断流程,在公开数据集上显著提升了诊断性能。
Details
Motivation: 现有AI导航方法多在单一固定倍率下运行或依赖预设倍率顺序,无法建模病理医生在实际诊断中动态、选择性地跨倍率观察并融合多尺度证据的行为,导致与临床流程不一致。 Method: 提出多倍率WSI导航智能体MMNavAgent,包含两个核心工具:1)跨倍率导航工具(CMT),聚合相邻倍率的上下文信息以增强导航路径上的判别表征;2)倍率选择工具(MST),在智能体框架中利用记忆驱动推理实现交互式、自适应的倍率选择。 Result: 在公开数据集上实验表明,相比非智能体基线,AUC提升1.45%,BACC提升2.93%。 Conclusion: MMNavAgent更贴合真实病理诊断工作流,通过显式建模跨倍率交互与自适应倍率选择,有效提升了WSI自动诊断性能。 Abstract: Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.[413] From Pixels to Patches: Pooling Strategies for Earth Embeddings
Isaac Corley,Caleb Robinson,Inbal Becker-Reshef,Juan M. Lavista Ferres
Main category: cs.CV
TL;DR: 本文研究了在地理空间基础模型中,如何将像素级嵌入聚合为块级表示,提出了多种池化方法并验证其在地理泛化能力上的优势,推荐GeM作为均值池化的直接替代方案。
Details
Motivation: 随着地理空间基础模型从块级转向像素级嵌入,需将大量像素向量聚合为保留类别判别性且匹配下游标签分辨率的块表示;而默认的均值池化会丢失块内变异性,在空间偏移下精度下降超10%。 Method: 构建EuroSAT-Embed数据集(81,000个GeoTIFF嵌入),涵盖AlphaEarth、OlmoEarth和Tessera三种基础模型;系统评估11种无训练与2种参数化池化方法,在随机与地理分离测试划分下进行基准测试。 Result: 更丰富的池化方案可将地理泛化差距相对均值池化减少最多40%,空间划分下准确率提升最高达5%;GeM池化是零成本替换方案;Stats池化效果最优但维度增至4倍;池化效果因嵌入源和维度而异,高维嵌入更受益于分布统计。 Conclusion: 均值池化并非最优选择;GeM是实用性强、零开销的升级方案;Stats池化适用于追求极致精度的场景;池化策略应与嵌入来源及维度协同设计。 Abstract: As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.[414] Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction
Harikrishnan Unnikrishnan
Main category: cs.CV
TL;DR: 本文提出了一种检测门控的深度学习流水线(YOLOv8+U-Net+时序一致性模块),在小样本训练下实现高精度、强泛化性的HSV glottal分割,支持零样本跨数据集迁移和实时临床应用。
Details
Motivation: 现有深度学习模型在HSV视频中glottal分割易产生伪影、泛化能力差,难以满足不同临床场景需求。 Method: 提出检测门控流水线:YOLOv8检测glottal区域,U-Net进行精细分割,并引入时序一致性包装器抑制闭合期与器械遮挡导致的误检;仅用GIRAFE数据集600帧训练,在BAGLS上零样本迁移评估。 Result: 在GIRAFE上DSC达0.81,在BAGLS上DSC达0.85(分布内);下游临床队列验证显示自动提取的Open Quotient和CV等指标符合临床标准;CV值可显著区分健康与病理嗓音(p=0.006)。 Conclusion: 该轻量级架构运行速度约35帧/秒,支持实时临床使用和跨平台标准化生物标志物提取,并开源代码与模型。 Abstract: Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.[415] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Yiweng Xie,Bo He,Junke Wang,Xiangyu Zheng,Ziyi Ye,Zuxuan Wu
Main category: cs.CV
TL;DR: FluxMem是一种无需训练的流式视频理解框架,通过自适应的两阶段视觉记忆压缩(时间邻接选择和空间域整合)显著提升效率与性能。
Details
Motivation: 解决流式视频理解中冗余视觉信息导致的高计算开销和内存占用问题,同时避免手动调参,实现对动态场景的自适应响应。 Method: 提出FluxMem框架,包含两个核心模块:(1) 时间邻接选择(TAS)模块,去除相邻帧间的冗余视觉token;(2) 空间域整合(SDC)模块,在单帧内合并重复空间区域;二者均引入基于场景统计的自适应token压缩机制。 Result: 在StreamingBench和OVO-Bench在线基准上分别达76.4和67.2(实时设置),延迟降低69.9%,GPU峰值显存减少34.5%;离线MLVU上达73.1,仅用35%的视觉token。 Conclusion: FluxMem实现了高效、自适应、训练无关的流式视频理解,在线与离线性能均达新SOTA,验证了无训练压缩范式的有效性与实用性。 Abstract: This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.[416] A 3D mesh convolution-based autoencoder for geometry compression
Germain Bregeon,Marius Preda,Radu Ispas,Titus Zaharia
Main category: cs.CV
TL;DR: 本文提出了一种基于3D网格卷积的自编码器,用于不规则网格几何压缩,无需预处理或流形/水密性假设,通过面特征学习和专用池化/反池化操作保持连通性,在重建与潜在空间分类任务中均优于现有方法。
Details
Motivation: 解决不规则3D网格数据压缩问题,避免对预处理、流形结构或水密性的依赖。 Method: 设计基于面的3D网格卷积自编码器,采用保持连通性的池化与反池化操作,将网格压缩至紧凑基网格空间并重建原始连通性与几何细节。 Result: 在多类别数据集上,几何重建精度和潜在空间分类性能均优于当前最优方法。 Conclusion: 该方法为不规则网格提供了高效、通用且可学习的压缩框架,兼顾几何保真度与语义可分性。 Abstract: In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: github.com/germainGB/MeshConv3D[417] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
Hualiang Wei,Shunran Jia,Jialun Liu,Wenhui Li
Main category: cs.CV
TL;DR: LiftAvatar 是一种新范式,通过在运动学空间中补全稀疏的单目观测(如面部表情和头部姿态),并利用补全后的信号驱动高保真虚拟人动画。它是一个细粒度、表情可控的大规模视频扩散Transformer模型,能基于单张或多张参考图像合成高质量、时间连贯的表情序列。
Details
Motivation: 解决基于3D高斯泼溅(Gaussian Splatting)的虚拟人方法在日常单目视频中因运动学信号稀疏而导致的表情表现力不足和重建伪影问题。 Method: 提出多粒度表情控制方案(结合阴影图与表情系数)和多参考帧条件机制,将稀疏输入提升为更丰富的运动学表征,并作为即插即用模块增强下游3D虚拟人管线。 Result: 在各类实验中显著提升现有SOTA 3D虚拟人方法的动画质量与定量指标,尤其在极端或未见过的表情下效果突出;同时支持从大规模视频生成模型中蒸馏先验知识。 Conclusion: LiftAvatar通过运动学空间补全与可控扩散建模,有效 bridging 2D生成与3D虚拟人重建之间的鸿沟,提升了表达多样性、时序一致性与3D几何一致性。 Abstract: We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.[418] Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
Tutian Tang,Xingyu Ji,Yutong Li,MingHao Liu,Wenqiang Xu,Cewu Lu
Main category: cs.CV
TL;DR: 本文提出Stereo-Inertial Poser系统,利用单个双目相机和六个IMU实现实时、度量准确且形状感知的3D人体动作捕捉,解决了单目视觉的深度模糊与形状无关建模问题。
Details
Motivation: 现有单目视觉-惯性动捕系统存在全局平移尺度不准(源于单目深度模糊)和局部运动估计忽略人体形态差异的问题。 Method: 采用双目相机替代单目RGB,利用标定基线几何解深度歧义,实现直接3D关键点提取与体型参数估计;融合IMU数据与视觉线索预测去漂移关节位置与根部运动;引入新型形状感知融合模块动态协调人体测量差异与全局平移。 Result: 端到端流程达200 FPS以上,无需优化后处理;定量评估显示达到SOTA性能;定性结果表明长时录制下全局平移无漂移、足部滑动效应减少。 Conclusion: Stereo-Inertial Poser在精度、实时性与鲁棒性上显著优于现有方法,为低成本高精度动捕提供了新范式。 Abstract: Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.[419] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Chong Xia,Kai Zhu,Zizhuo Wang,Fangfu Liu,Zhizheng Zhang,Yueqi Duan
Main category: cs.CV
TL;DR: 本文提出SimRecon框架,通过“感知-生成-仿真”三阶段流水线实现杂乱场景的物体中心重建,并引入主动视角优化与场景图合成器两个桥梁模块,提升生成资产的视觉保真度和最终场景的物理合理性。
Details
Motivation: 传统基于外观的组合式重建方法在真实世界场景中泛化能力有限,难以兼顾视觉保真与物理合理性。 Method: 提出SimRecon框架,包含场景级语义重建、单物体生成、仿真器中装配三阶段;引入主动视角优化(提升视觉保真)和场景图合成器(保障物理合理性)两个桥梁模块。 Result: 在ScanNet数据集上实验表明,该方法性能优于现有最先进方法。 Conclusion: SimRecon通过结构化分阶段设计与针对性桥梁模块,有效提升了杂乱真实场景下物体中心重建的质量与可用性,为仿真与交互应用提供了更可靠的3D资产生成方案。 Abstract: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.[420] OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
Chong Xia,Fangfu Liu,Yule Wang,Yize Pang,Yueqi Duan
Main category: cs.CV
TL;DR: 本文提出OnlineX,一种在线式3D高斯泼溅框架,支持仅用流式图像实时重建3D视觉外观与语言场,通过解耦活跃-稳定状态机制解决累积漂移问题,并联合建模视觉与语言信息,显著提升新视角合成与语义理解性能。
Details
Motivation: 现有通用3D高斯泼溅方法多为离线重建,无法支持机器人、VR/AR等需持续重建的在线场景;同时在线重建中存在因记忆状态双重角色冲突导致的累积漂移问题。 Method: 提出OnlineX框架:1)采用解耦的活跃态(动态更新局部几何)与稳定态(保守累积全局结构)并融合二者信息;2)联合建模3D视觉外观场与语言场;3)引入隐式高斯融合模块提升重建质量。 Result: 在主流数据集上,OnlineX在新视角合成和语义理解任务上均超越先前方法,对不同长度输入序列具有鲁棒性,并支持实时推理。 Conclusion: OnlineX成功实现了高效、稳定、语义增强的在线3D场景重建,为真实世界动态交互应用提供了可行技术路径。 Abstract: Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.[421] OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Yiying Yang,Wei Cheng,Sijin Chen,Honghao Fu,Xianfang Zeng,Yujun Cai,Gang Yu,Xingjun Ma
Main category: cs.CV
TL;DR: OmniLottie 是一个基于多模态指令生成高质量矢量动画的框架,通过专为 Lottie JSON 设计的结构化分词器,结合预训练多模态大模型,并依托大规模数据集 MMLottie-2M 进行训练与验证。
Details
Motivation: 原始 Lottie JSON 文件包含大量不变结构元数据和格式标记,难以直接用于学习矢量动画生成;需支持灵活的运动与视觉内容控制。 Method: 提出一种结构化 Lottie 分词器,将 JSON 转为形状、动画函数与控制参数的命令序列;基于该表示,构建以预训练视觉语言模型为基础的 OmniLottie 框架,并发布大规模多模态数据集 MMLottie-2M。 Result: 实验表明 OmniLottie 能生成生动、语义对齐且高度符合多模态人类指令的矢量动画。 Conclusion: OmniLottie 有效解决了 Lottie 动画生成中的结构冗余难题,为多模态驱动的矢量动画生成提供了新范式与实用基准。 Abstract: OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.[422] Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
Kwame Mbobda-Kuate,Gabriel Kasmi
Main category: cs.CV
TL;DR: 本文挑战了计算机视觉中'更大模型+更多数据=更好性能'的默认假设,在地球观测(EO)资源受限场景下,通过系统性效率分析发现:在屋顶光伏检测任务中,小模型(YOLO11N)反而比大模型(YOLO11X)更高效且精度更高;输入分辨率是最关键的资源分配杠杆,而增加数据量在低分辨率下收益甚微;小尺寸高分辨率配置在精度与吞吐量联合空间中始终帕累托最优。
Details
Motivation: 验证缩放定律(scaling laws)在资源受限的地球观测(EO)任务中是否成立,因该假设虽驱动CV领域模型选择,却未在EO中实证检验。 Method: 在马达加斯加屋顶光伏检测任务上,系统分析模型规模、数据集大小和输入分辨率三个缩放维度对模型效率(mAP50/模型大小)的影响,并评估其在精度-吞吐量联合空间中的帕累托最优性。 Result: YOLO11N达到最高效率(比YOLO11X高24倍)和最高绝对mAP50(0.617);分辨率提升带来+120%效率增益,而低分辨率下增大数据量几乎无效;所有44种配置中,小模型+高分辨率方案均帕累托占优。 Conclusion: 在数据稀缺的地球观测场景中,“更大”不仅不必要,反而可能损害效率与性能;应优先优化输入分辨率,而非盲目扩大模型或数据规模。 Abstract: Scaling laws assume larger models trained on more data consistently outperform smaller ones -- an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP$_{50}$ per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ($24\times$ higher than YOLO11X) and the highest absolute mAP$_{50}$ (0.617). Resolution is the dominant resource allocation lever ($+$120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.[423] 3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems
Namhoon Kim,Narges Moeini,Justin Romberg,Sara Fridovich-Keil
Main category: cs.CV
TL;DR: 本文提出了一种全新的全三维Field of Junctions(3D FoJ)表示方法,用于体积去噪及作为结构先验,无需训练数据,能保持并增强低信噪比下的3D边缘和角结构,并在低剂量CT、冷冻电镜断层扫描和恶劣天气下的激光雷达点云去噪等任务中优于经典与神经网络方法。
Details
Motivation: 许多3D成像反问题面临高测量噪声,而现有2D图像去噪方法(如Field of Junctions)展现出强大性能,因此需要扩展至3D体积域。 Method: 提出全三维Field of Junctions(3D FoJ)表示,通过优化解释每个3D块的3D楔形接合点,并强制重叠块间一致性;将其作为无监督结构先验,结合投影或近端梯度下降用于各类低信噪比体数据反问题。 Result: 在低剂量X射线CT、冷冻电镜断层扫描(cryo-ET)和恶劣天气下lidar点云去噪三个低信噪比3D成像任务中,3D FoJ均优于经典与神经网络方法。 Conclusion: 3D FoJ是一种有效、通用且无需训练的体积结构先验,在保持几何细节和鲁棒去噪方面具有显著优势,适用于多种低信噪比3D成像逆问题。 Abstract: Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.[424] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction
Aniek Eijpe,Soufyan Lakbir,Melis Erdal Cesur,Sara P. Oliveira,Angelos Chatzimparmpas,Sanne Abeln,Wilson Silva
Main category: cs.CV
TL;DR: 本文提出DIMAFx框架,用于癌症生存预测,通过解耦组织病理学全切片图像和转录组数据的模态特异性和共享表征,实现高精度与高可解释性的统一。
Details
Motivation: 多模态生存预测模型虽日益精准,但其复杂性削弱了可解释性,限制了对不同数据源如何影响预测的理解。 Method: 提出DIMAFx可解释多模态框架,结合SHAP方法,从组织病理学全切片图像和转录组数据中学习解耦的模态特异性和共享表征,并系统揭示关键多模态交互及生物学意义。 Result: 在多个癌症队列中达到SOTA性能并提升表征解耦效果;在乳腺癌预测中发现最具预测性的特征为模态共享信息(如固态肿瘤形态与晚期雌激素响应通路关联),同时识别出模态特异性微环境信号(如脂肪与基质形态互作)。 Conclusion: 多模态模型可兼顾性能与可解释性,为精准医学提供有力支持。 Abstract: While multimodal survival prediction models are increasingly more accurate, their complexity often reduces interpretability, limiting insight into how different data sources influence predictions. To address this, we introduce DIMAFx, an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. Leveraging its interpretable design and SHapley Additive exPlanations, DIMAFx systematically reveals key multimodal interactions and the biological information encoded in the disentangled representations. In breast cancer survival prediction, the most predictive features contain modality-shared information, including one capturing solid tumor morphology contextualized primarily by late estrogen response, where higher-grade morphology aligned with pathway upregulation and increased risk, consistent with known breast cancer biology. Key modality-specific features capture microenvironmental signals from interacting adipose and stromal morphologies. These results show that multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine.[425] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
Srikumar Sastry,Dan Cher,Brian Wei,Aayush Dhakal,Subash Khanal,Dev Gupta,Nathan Jacobs
Main category: cs.CV
TL;DR: 本文提出了GeoDiT,一种基于点控制的文本到卫星图像生成的扩散Transformer模型,通过空间点位置和对应文本描述实现语义丰富的生成控制,并引入自适应局部注意力机制提升性能。
Details
Motivation: 现有受控卫星图像生成模型依赖耗时且语义有限的像素级标注地图,亟需更高效、语义更丰富的控制方式。 Method: 提出点基条件框架,结合点空间位置与文本描述进行生成控制;设计自适应局部注意力机制以根据点查询调节注意力分数;系统评估了卫星图像表征与地理定位表征等遥感领域特定设计选择。 Result: GeoDiT在生成性能上超越当前遥感领域最先进生成模型。 Conclusion: 点基控制框架兼顾灵活性、标注友好性与计算简洁性,为文本到卫星图像生成提供了新范式。 Abstract: We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.[426] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin,Guoqiang Liang,Ziyun Zeng,Zechen Bai,Yanzhe Chen,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了一种新的视频编辑方法Kiwi-Edit,通过构建高质量参考引导数据集RefVIE和多阶段训练策略,在指令遵循与参考保真度上实现新SOTA。
Details
Motivation: 现有基于指令的视频编辑方法难以精确控制视觉细节,而参考引导方法受限于高质量配对训练数据稀缺。 Method: 设计可扩展的数据生成流程,利用图像生成模型合成参考骨架,构建RefVIE数据集和RefVIE-Bench评测基准;提出统一架构Kiwi-Edit,融合可学习查询与潜在视觉特征,并采用渐进式多阶段训练。 Result: 在可控视频编辑任务中显著提升指令遵循能力和参考保真度,达到新SOTA;开源全部数据、模型与代码。 Conclusion: 结合高质量合成数据与协同优化的统一架构,能有效突破当前视频编辑中语言表达局限与数据瓶颈,推动参考引导编辑范式发展。 Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.[427] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham
Main category: cs.CV
TL;DR: 本文提出了一种结合CoAtNet架构与模型汤(model soups)的鲁棒框架,用于解决湄公河三角洲非物质文化遗产(ICH)图像分类中数据稀缺、类间相似度高和领域异质性等挑战;通过多样性感知的检查点平均策略,在不增加推理开销的前提下显著降低方差、提升泛化性能,并在ICH-17数据集上达到SOTA结果。
Details
Motivation: 湄公河三角洲非物质文化遗产(ICH)图像分类面临标注数据少、类别间视觉相似度高、领域异质性强等低资源挑战,传统深度学习模型易出现高方差或过拟合虚假相关性,泛化能力差。 Method: 提出融合CoAtNet(结合卷积与自注意力的混合架构)与模型汤(轻量级权重空间集成方法,对单次训练轨迹中的检查点进行平均)的框架;采用贪心汤和均匀汤两种策略选择多样化检查点;结合偏差-方差分解、基于交叉熵的距离度量与多维缩放(MDS)分析集成效果。 Result: 在ICH-17数据集(7406张图像,17类)上达到72.36% top-1准确率和69.28% macro F1-score,优于ResNet-50、DenseNet-121和ViT等强基线。 Conclusion: 多样性感知的检查点平均是一种原理清晰且高效的方差抑制方法,可显著提升文化丰富但数据稀缺场景下的模型泛化能力。 Abstract: The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.[428] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Divyanshu Daiya,Aniket Bera
Main category: cs.CV
TL;DR: Sketch2Colab是一种将2D故事板草图转化为物理合理、对象感知的3D多人运动的方法,通过结合草图驱动扩散先验、潜空间整流流蒸馏、可微能量约束与连续时间马尔可夫链事件规划,实现高精度交互控制与快速推理。
Details
Motivation: 现有基于扩散的运动生成方法在满足复杂多主体交互约束(如接触、时序、关节控制)方面存在训练成本高、后验引导昂贵、强条件下降级等问题,亟需一种兼顾精确性、物理合理性和推理效率的新范式。 Method: 提出两阶段框架:1)学习草图驱动的扩散先验;2)蒸馏为潜空间整流流学生模型;引入可微关键帧/轨迹/物理能量函数直接优化传输场;耦合连续时间马尔可夫链(CTMC)规划器建模离散交互事件(如抓取、交接),动态调制运动相位。 Result: 在CORE4D和InterHuman数据集上达到SOTA的约束遵循度与感知质量,推理速度显著快于纯扩散基线。 Conclusion: Sketch2Colab通过联合优化几何先验、物理约束与事件逻辑,在保持运动真实性的同时实现了对草图指令的细粒度、鲁棒可控生成,为多智能体协同运动合成提供了新思路。 Abstract: We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.[429] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Mateus Karvat,Bram Adams,Sidney Givigi
Main category: cs.CV
TL;DR: 本文首次大规模实证研究了自动驾驶感知模型代码的软件质量,分析了178个KITTI和NuScenes榜单模型,发现仅7.3%满足基本生产就绪标准,并提出安全问题治理指南与CI/CD实践建议。
Details
Motivation: 现有AV感知模型评估过度依赖基准性能指标,忽视代码质量、生产就绪性与长期可维护性,导致研究成果难以满足安全关键系统(如自动驾驶)的国际安全标准要求。 Method: 对KITTI和NuScenes 3D目标检测榜单上的178个独立模型仓库开展系统性静态分析,使用Pylint、Bandit和Radon工具评估代码错误、安全漏洞、可维护性及开发实践。 Result: 仅7.3%的仓库满足零严重错误且无高危安全漏洞的基本生产就绪标准;80%的安全问题由前五大漏洞类型集中导致;CI/CD流水线采用率与代码可维护性呈正相关。 Conclusion: 排行榜性能不能反映生产就绪性;需通过针对性干预(如制定安全编码指南、推广CI/CD)提升AV感知代码的质量与安全性。 Abstract: Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.[430] Adaptive Confidence Regularization for Multimodal Failure Detection
Moru Liu,Hao Dong,Olga Fink,Mario Trapp
Main category: cs.CV
TL;DR: 本文提出自适应置信度正则化(ACR)框架,用于检测多模态模型的失败情形,通过自适应置信度损失和多模态特征交换技术提升模型可靠性。
Details
Motivation: 高风险领域(如自动驾驶、医疗诊断)中多模态模型的部署不仅需高性能预测,还需可靠失败检测机制;而多模态失败检测问题此前尚未被充分探索。 Method: 提出ACR框架,包含两个核心组件:1)自适应置信度损失,惩罚多模态预测置信度低于任一单模态分支的‘置信度退化’现象;2)多模态特征交换,一种新型异常样本合成技术,生成具有失败特性的训练样本。 Result: 在四个数据集、三种模态和多种评估设置下的大量实验表明,ACR实现了一致且稳健的性能提升。 Conclusion: ACR有效提升了多模态模型对失败情形的识别与拒绝能力,显著增强其在关键应用中的可靠性。 Abstract: The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.[431] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Yichen Liu,Donghao Zhou,Jie Wang,Xin Gao,Guisheng Liu,Jiatong Li,Quanwei Zhang,Qiang Lyu,Lanqing Guo,Shilei Wen,Weiqiang Wang,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文提出HiFi-Inpaint框架,通过Shared Enhancement Attention(SEA)和Detail-Aware Loss(DAL)提升参考图像引导的修复中产品细节保真度,并构建新数据集HP-Image-40K,显著提升人-产品图像生成质量。