Skip to content

Table of Contents

cs.CL [Back]

[1] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Haobo Xu,Sirui Chen,Ruizhong Qiu,Yuchen Yan,Chen Luo,Monica Cheng,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出arrol方法,通过在线剪枝rollout并平衡正确率分布来提升RLVR训练效率与效果,在多个模型和算法上显著提升准确率并加速训练。

Details Motivation: 现有RLVR方法(如GRPO、DAPO)计算开销大,且因奖励相对优势稀疏导致组内奖励方差低、学习信号弱。 Method: arrol在生成过程中在线剪枝rollout,训练轻量级质量头预测部分rollout的成功概率以实现早期剪枝,并利用该质量头加权候选以提升测试时缩放性能;系统层面在推理引擎内完成剪枝与重批处理。 Result: 在Qwen-3和LLaMA-3.2(1B–8B)上,arrol使平均准确率提升+2.30至+2.99,训练速度最高提升1.7倍,测试时缩放额外带来+8.33准确率增益。 Conclusion: arrol是一种高效、实用的RLVR优化方法,兼顾训练加速与性能提升,适用于多种基础模型与RLVR算法。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

[2] LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis

Baraa Hikal,Jonas Becker,Bela Gipp

Main category: cs.CL

TL;DR: LogSigma 是一个用于 SemEval-2026 Task 3(维度化方面级情感分析,DimABSA)的系统,通过引入学习得到的同方差不确定性(learned homoscedastic uncertainty)自动平衡连续 Valence 和 Arousal 回归任务,在多语言数据集上取得第一名。

Details Motivation: 传统 ABSA 预测离散情感标签,而 DimABSA 要求预测连续的 Valence 和 Arousal 分数(1–9 分),且二者在不同语言和领域中预测难度差异显著,需动态任务平衡。 Method: 采用学习同方差不确定性(即任务特定的 log-variance 参数)来加权回归损失;结合语言特异性编码器与多随机种子集成。 Result: 在五个数据集的两个赛道均获第 1 名;学习到的方差权重跨语言差异显著(如德语 0.66x,英语 2.18x),验证了任务平衡需依语言自适应。 Conclusion: 最优的 Valence-Arousal 任务平衡是语言依赖的,无法预先设定;LogSigma 所用的不确定性建模方法有效提升了多语言连续情感维度预测性能。 Abstract: This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.

[3] Toward domain-specific machine translation and quality estimation systems

Javad Pourmostafa Roshan Sharami

Main category: cs.CL

TL;DR: 本文研究了机器翻译(MT)和质量估计(QE)在领域不匹配下的域适应问题,提出了基于数据选择、子词分词对齐、轻量级数据增强及QE引导的上下文学习等方法,显著提升跨领域、低资源及零样本场景下的性能。

Details Motivation: MT和QE在通用领域表现良好,但在领域不匹配时性能下降,亟需针对专业领域的高效域适应方法。 Method: 提出四种数据驱动方法:1)基于相似性的MT数据选择;2)结合域适应与轻量增强的分阶段QE训练;3)分析子词分词与词表对齐对微调的影响;4)QE引导的大语言模型上下文学习。 Result: 小规模目标领域数据超越大规模通用数据;QE方法在多语言、零样本和跨语言设置中性能提升;对齐的分词-词表配置提升稳定性与翻译质量;QE引导的检索优于标准检索,支持无参考评估。 Conclusion: 域适应效果高度依赖于数据选择策略、表示一致性(如分词对齐)以及高效适应机制;所提方法可构建在专业领域中鲁棒运行的MT与QE系统。 Abstract: Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

[4] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

Yuhang Zhou,Zhuokai Zhao,Ke Li,Spilios Evmorfos,Gökalp Demirci,Mingyi Wang,Qiao Liu,Qifei Wang,Serena Li,Weiwei Li,Tingting Wang,Mingze Gao,Gedi Zhou,Abhishek Kumar,Xiangjun Fan,Lizhu Zhang,Jiayi Liu

Main category: cs.CL

TL;DR: 本文提出MoFA框架,利用大语言模型进行基于语义和定量信息的可解释、约束感知的序列式特征选择,适用于标注数据稀缺的工业场景,并在三个真实应用中验证了其有效性。

Details Motivation: 传统特征选择方法依赖标注数据和统计启发式,在生产环境中因标注数据有限且需满足多种运行约束而难以应用。 Method: 提出Model Feature Agent(MoFA)模型驱动框架,将特征定义、重要性得分、相关性及元数据等融入结构化提示,通过可解释、约束感知的推理进行序列式特征选择。 Result: 在真实工业场景中验证:1)提升兴趣与时间价值预测准确率并降低特征组复杂度;2)在价值模型增强中发现高阶交互项,显著提升线上用户参与度;3)在通知行为预测中选出紧凑高效特征子集,兼顾准确率与推理效率。 Conclusion: LLM驱动的推理式特征选择在实际生产系统中具备实用性和有效性,为标注稀缺场景提供了新范式。 Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

[5] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

Xiaowei Zhu,Yubing Ren,Fang Fang,Shi Wang,Yanan Cao,Li Guo

Main category: cs.CL

TL;DR: 本文提出Exons-Detect,一种无需训练的AI生成文本检测方法,通过外显子感知的token重加权机制提升检测鲁棒性与可解释性,在多个指标上达到SOTA。

Details Motivation: 现有无训练检测方法假设各token贡献均匀,难以应对短文本或局部篡改,且面临AI生成内容泛滥带来的误信息、版权等社会风险,亟需更鲁棒可靠的检测技术。 Method: 提出Exons-Detect:在双模型设定下度量隐状态差异,识别并放大信息丰富的“外显子”token;基于重要性加权token序列计算可解释的翻译得分。 Result: 在DetectRL数据集上平均AUROC较最强基线提升2.2%相对性能,显著增强对对抗攻击和不同输入长度的鲁棒性。 Conclusion: Exons-Detect验证了token级差异化建模对无训练检测的有效性,为可解释、鲁棒的AI文本检测提供了新范式。 Abstract: The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2\% relative improvement in average AUROC over the strongest prior baseline on DetectRL.

[6] Closing the Confidence-Faithfulness Gap in Large Language Models

Miranda Muqing Miao,Lyle Ungar

Main category: cs.CL

TL;DR: 本文通过机制可解释性分析发现,大语言模型(LLMs)的口头置信度与真实准确率呈线性但正交编码关系;推理过程会干扰置信度表征,造成‘推理污染效应’;据此提出两阶段自适应引导方法,显著提升校准一致性。

Details Motivation: 大型语言模型(LLMs)常输出与实际准确性脱节的口头置信度,其背后几何机制尚不清楚。 Method: 采用线性探针(linear probes)和对比激活添加(CAA)引导技术,对三个开源权重模型和四个数据集开展机制可解释性分析,并提出两阶段自适应引导流水线。 Result: 发现校准信号与口头置信度信号在线性空间中正交;识别出‘推理污染效应’;所提两阶段引导方法在所有评估模型上显著提升了校准对齐效果。 Conclusion: 口头置信度与内部校准信号虽线性可分但彼此正交,推理过程会破坏该方向;利用内部准确率估计进行定向引导可有效缓解模型校准失配问题。 Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

[7] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Suraj Racha,Prashant Harish Joshi,Utkarsh Maurya,Nitin Yadav,Mridul Sharma,Ananya Kunisetty,Saranya Darisipudi,Nirmal Punjabi,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: 本文提出oMind框架,通过构建高质量多任务SFT数据集(约164k样本)和新型多轮对话评估基准oMind-Chat,解决大语言模型在心理健康领域面临的训练数据质量低、训练范式局限及多轮对话评估难三大挑战,并验证了其在核心能力与对话表现上的显著优势。

Details Motivation: 心理健康问题日益严重,LLM在该领域具有巨大潜力,但面临高质量可解释且知识支撑的训练数据缺乏、训练范式受限于基础能力、多轮对话场景难以评估三大挑战。 Method: 提出oMind框架,包括:1)基于结构化知识检索、LLM剪枝和人工审核的数据生成流程,构建约164k样本的多任务监督微调(SFT)数据集;2)发布oMind-Chat——首个含专家标注的多轮对话基准,涵盖回合级与会话级评估标准;3)对LLM进行训练与对齐,支持多样化能力(尤其是对话)。 Result: oMind LLM在核心能力与对话任务上持续超越基线模型;oMind-LLM在推理能力上显著提升,胜率达80%。 Conclusion: oMind框架有效缓解了LLM在心理健康领域落地的关键瓶颈,为构建可信、可解释、知识驱动的医疗对话系统提供了新范式与实用资源。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.

[8] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: 本文提出了一种基于二型信号检测理论(Type-2 SDT)的新评估框架,用meta-d'和M-ratio解耦LLM的识别能力(Type-1)与元认知敏感性(Type-2),揭示不同模型在‘知不知’上的本质差异,而非仅依赖传统校准指标。

Details Motivation: 现有LLM置信度评估(如ECE、Brier分数)混淆了模型‘知道多少’(Type-1敏感性)和‘知道自己知道多少’(Type-2元认知敏感性)两种能力,缺乏对元认知能力的独立、可解释量化。 Method: 引入Type-2信号检测理论,使用meta-d'衡量元认知敏感性,M-ratio衡量元认知效率;在22.4万次事实性问答中对4个主流LLM进行系统评估,并分析温度调节、领域特异性及指标一致性等维度。 Result: (1)模型间元认知效率差异显著,且与Type-1性能无关(如Mistral d'最高但M-ratio最低);(2)元认知效率具有领域特异性;(3)温度主要影响置信阈值(criterion),meta-d'相对稳定;(4)AUROC₂与M-ratio给出完全相反的模型排序。 Conclusion: meta-d'框架能真正区分‘具备元认知能力’与‘仅靠策略性置信调整显得校准良好’的模型,为LLM可信部署、人机协作及模型选择提供更本质的评估依据。 Abstract: Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

[9] Goodness-of-pronunciation without phoneme time alignment

Jeremy H. M. Wong,Nancy F. Chen

Main category: cs.CL

TL;DR: 本文提出了一种利用弱监督ASR模型进行低资源语言语音评估的新方法,通过构建音素混淆网络、采用词级语速与持续时间、以及跨注意力机制融合音素与帧级特征,避免了音素时间对齐,实现了与标准帧同步特征相当的性能。

Details Motivation: 现有ASR模型在低资源语言中因训练数据不足而难以扩展语音评估;开源弱监督模型虽支持多语言但帧异步且非音素级,不利于语音评估特征提取。 Method: 1)将ASR假设映射到音素混淆网络以计算音素后验概率;2)使用词级而非音素级的语速和持续时间;3)通过跨注意力架构融合音素级和帧级特征,规避音素时间对齐需求。 Result: 在英语Speechocean762和低资源泰米尔语数据集上,所提方法性能与标准帧同步特征相当。 Conclusion: 该方法有效解决了弱监督ASR模型与语音评估特征提取之间的不兼容问题,为低资源语言语音评估提供了可行路径。 Abstract: In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

[10] To Write or to Automate Linguistic Prompts, That Is the Question

Marina Sánchez-Torrón,Daria Akselrod,Jason Rauchwerk

Main category: cs.CL

TL;DR: 本文系统比较了人工设计的零样本提示、基础DSPy签名和GEPA优化的DSPy签名在翻译、术语插入和语言质量评估任务中的效果,发现结果因任务而异,且GEPA优化在多数情况下可媲美专家提示。

Details Motivation: 探索自动提示优化是否能在语言学任务中替代专家提示工程,填补该领域研究空白。 Method: 对五种模型配置,在翻译、术语插入和语言质量评估三个任务上,系统比较人工零样本提示、基础DSPy签名与GEPA优化DSPy签名的表现,并进行统计显著性检验。 Result: 术语插入中优化与人工提示质量无显著差异;翻译中各方法在不同模型上表现各异;语言质量评估中专家提示更擅错误检测,优化提示更擅特征刻画;GEPA显著提升基础DSPy签名性能,且多数专家-优化对比无统计显著差异。 Conclusion: 自动提示优化(如GEPA)在多数语言学任务中可达到与专家提示相当的效果,但二者适用场景与数据依赖不同:GEPA需黄金标注数据,专家提示依赖领域知识与迭代调优。 Abstract: LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

[11] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le,Benjamin Goh,Quy Anh Tang

Main category: cs.CL

TL;DR: 本文提出了一种利用轻量级通用大语言模型(如gemini-2.0-flash-lite-001)作为低延迟安全裁判的方法,通过结构化推理流程(意图分解、安全信号验证、危害评估与自省)有效检测提示攻击,并已在新加坡公共服务聊天机器人中部署。

Details Motivation: 现有轻量级分类器和规则系统在分布偏移下泛化能力差,而高容量LLM裁判又难以满足实时防护的低延迟与低成本要求,存在部署鸿沟。 Method: 设计结构化推理流程(显式意图分解、安全信号验证、危害评估、自我反思),并结合精心设计的提示与输出格式,引导轻量级通用LLM执行安全判断;同时探索多模型融合(MoM)策略。 Result: 轻量级通用LLM(如gemini-2.0-flash-lite-001)可作为高效低延迟的安全裁判,在真实红队生成与良性查询混合数据集上表现优异;MoM方案仅带来小幅性能提升。 Conclusion: 轻量级通用LLM经合理提示工程与结构化推理设计后,可在严苛生产约束下可靠承担实时安全守门人角色,填补了当前LLM安全防护的部署缺口。 Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

[12] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

Ying Li,Xinglin Lyu,Junhui Li,Jinlong Yang,Hengchao Shang,Min Zhang,Shimin Tao,Daimeng Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为Cross-Preference Learning(CPL)的偏好学习框架,用于提升上下文感知机器翻译(MT)性能。CPL通过建模句子级与上下文级翻译之间的内在及跨条件偏好关系,显式引导模型自适应地利用上下文信息,在不修改模型结构的前提下显著提升了翻译质量与鲁棒性。

Details Motivation: 上下文感知机器翻译虽利用文档级信息,但其性能不稳定,因上下文信号对不同句子的增益不均;现有训练目标未显式建模这种变异性,限制了模型自适应利用上下文的能力。 Method: 提出Cross-Preference Learning(CPL),一种偏好学习框架,将句内(intra-condition)和跨条件(cross-condition)偏好整合进偏好优化目标,以显式监督上下文在何时、如何提升翻译质量。 Result: 在多个公开上下文感知MT任务上,使用Qwen3-4B、Qwen3-8B和Llama-3-8B等模型验证,CPL在两种输入条件下均带来一致的质量与鲁棒性提升,且无需任何架构改动。 Conclusion: CPL通过显式建模句子级与上下文级翻译的互补性,有效提升了上下文感知MT的自适应能力与泛化性能,是一种通用、即插即用的训练范式。 Abstract: Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model's ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.

[13] Probing the Lack of Stable Internal Beliefs in LLMs

Yifan Luo,Kangping Xu,Yanzhen Lu,Yang Yuan,Andrew Chi-Chih Yao

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)在多轮对话中维持隐式一致性(即对未明说目标的持续遵守)的能力,发现当前LLMs在缺乏显式上下文提示时难以稳定保持隐含目标,揭示其在人格驱动型建模中的关键局限。

Details Motivation: Persona-driven LLMs需在多轮交互中保持行为一致性以模拟人格特质(如坚持性、可靠性),但现有模型缺乏稳定锚定响应的内部表征。 Method: 设计了一个20问式谜题游戏范式:LLM秘密选定一个目标,并仅用'是/否'回答用户猜测;通过系统评估其在多轮中隐式目标(即所选目标)的保持能力。 Result: LLMs在未将所选目标显式写入上下文时,其隐式目标频繁漂移,难以维持隐式一致性;只有显式提供目标信息,一致性才显著提升。 Conclusion: 当前LLMs缺乏长期锚定隐含目标的机制,制约了真实人格建模;亟需开发能跨轮次稳定维持隐式目标的技术,以支撑对话系统等应用。 Abstract: Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.

[14] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

Main category: cs.CL

TL;DR: 本文系统整理了当代巴斯克语方言数据资源,区分了原生在线方言数据和标准语到方言的适配数据,并手动构建了高质量的方言XNLI测试集,同时评估了自动适配数据的质量。

Details Motivation: 解决方言自然语言处理中数据稀缺的核心问题。 Method: 系统性地收集和分类巴斯克语方言数据资源,包括原生在线方言数据(如新闻、社交媒体、词典等)和标准语到方言的适配数据(含人工与自动两种方式);人工将XNLI测试集适配为三种巴斯克方言;对自动适配的BasPhyCowest数据集进行母语者质量评估。 Result: 构建了首个全面的巴斯克方言数据资源目录;生成了覆盖西部、中部和纳瓦拉-拉普迪安三种方言的高质量XNLI平行评测数据集;验证了自动适配数据在一定条件下可作为银标数据使用。 Conclusion: 该资源目录为巴斯克方言NLP研究提供了坚实的数据基础,人工适配数据具有高可靠性,而经人工验证的自动适配数据可作为补充性银标资源,缓解数据稀缺问题。 Abstract: Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

[15] A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan,Shuyu Dai,Jinglu Wang,Fengtao Zhou,Yan Lu,Xi Wang,Yingcong Chen,Can Yang,Shujie Liu,Hao Chen

Main category: cs.CL

TL;DR: 本文提出了CPGBench,一个用于评估大语言模型(LLM)在多轮临床对话中检测和遵循临床实践指南(CPGs)能力的自动化基准框架。基于32,155条来自全球多来源的临床推荐生成测试对话,发现LLM在检测指南内容上表现尚可(71.1%-89.6%),但在准确引用来源(3.6%-29.7%)和实际遵循指南(21.8%-63.2%)方面存在显著不足;并通过56名临床医生的人工评估验证了结果可靠性。

Details Motivation: 尽管LLM在医疗场景中日益广泛应用,但其在真实对话中识别与遵循临床实践指南(CPGs)的能力尚不明确,亟需系统性评估框架以保障临床应用的安全性与可靠性。 Method: 构建CPGBench基准:收集全球9国/地区及2个国际组织近十年发布的3418份CPG文档,提取32155条结构化临床推荐(含机构、日期、专科、强度、证据等级等元信息),为每条推荐生成一条多轮模拟临床对话;评估8个主流LLM在指南检测、来源引用和临床遵循三方面的表现,并辅以56名跨专科临床医生的人工验证。 Result: LLM对临床推荐的检测率较高(71.1%-89.6%),但正确引用对应指南标题的比例极低(3.6%-29.7%);遵循率差异大,介于21.8%-63.2%之间;人工评估证实自动评估结果有效;CPGBench首次系统揭示LLM在哪些具体推荐上失败。 Conclusion: 当前LLM虽能部分理解指南内容,但在溯源与实际临床应用层面仍存在严重缺陷,亟需针对性改进以支撑其在真实临床环境中的安全落地。 Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

[16] SafeMath: Inference-time Safety improves Math Accuracy

Sagnik Basu,Subhrajit Mitra,Aman Juneja,Somnath Banerjee,Rima Hazra,Animesh Mukherjee

Main category: cs.CL

TL;DR: 本文揭示了数学应用题可能隐含偏见、有害或不道德内容的问题,构建了ToxicGSM数据集,并提出SafeMath方法在保障安全性的同时维持甚至提升数学推理准确性。

Details Motivation: 现有研究表明大语言模型易受对抗性或看似无害输入的操纵,而数学应用题作为自然语言叙述形式,可能成为传播偏见、有害或心理伤害内容的隐蔽渠道,尤其在儿童教育场景中风险更高。 Method: 构建包含1900个嵌入有害/敏感语境但数学任务明确的算术题数据集ToxicGSM;对现有大语言模型进行安全与数学正确性联合审计;提出名为SafeMath的安全对齐技术。 Result: 实验证明SafeMath能显著降低有害输出,同时保持甚至提升数学推理性能;揭示语言危害与数学推理可解耦,安全对齐无需以牺牲准确性为代价。 Conclusion: 数学应用题中的隐性危害需被重视,ToxicGSM为相关研究提供基准,SafeMath为兼顾安全与准确性的对齐方法提供了可行路径。 Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.

[17] Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

Danlu Chen,Ka Sing He,Jiahe Tian,Chenghao Xiao,Zhaofeng Wu,Taylor Berg-Kirkpatrick,Freda Shi

Main category: cs.CL

TL;DR: 本文提出了FRED难度指标(Fertility Ratio, Retrieval Proxy, Pre-training Exposure, Corpus Diversity),用于量化极低资源机器翻译数据集的内在难度,揭示性能差异主要源于训练-测试重叠和预训练暴露程度,而非模型能力本身;特别指出消亡语言和非拉丁系土著语言因分词覆盖率低(高token fertility)而面临根本性迁移瓶颈。

Details Motivation: 极低资源机器翻译中不同语言对间报告性能差异大、难以比较,尤其对古代语言等特定语群研究者而言,无法判断其他语境(如非洲或美洲原住民语言)的突破是方法优越还是基准构建偏差所致。 Method: 提出FRED四个数据集固有难度指标:Fertility Ratio(F)、Retrieval Proxy(R)、Pre-training Exposure(E)和Corpus Diversity(D),用以解释性能变异来源,并分析tokenization覆盖问题。 Result: 发现结果变异性主要由train-test overlap和pre-training exposure解释,而非模型能力;消亡及非拉丁土著语言存在高token fertility问题,导致从高资源语言迁移效果受限。 Conclusion: FRED指标可提升跨语言迁移评估的透明性与可靠性,为XLR MT社区提供更坚实的基础。 Abstract: The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages -- particularly extinct and non-Latin indigenous languages -- suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

[18] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

Giuseppe Samo,Paola Merlo

Main category: cs.CL

TL;DR: 本研究比较了自然数据和合成数据对训练和评估大语言模型(LLM)的影响,聚焦法语和意大利语中被动动词交替现象;结果表明,基于自然数据训练的模型在自然与合成测试集上均表现稳健,而仅用合成数据训练的模型难以泛化到自然句子,凸显了自然数据在语言学评估中的重要性。

Details Motivation: 探究自然数据与合成数据在训练和评估大语言模型语言能力(特别是句法与语义知识)中的相对作用,尤其是针对被动动词交替这一特定语言现象。 Method: 采用Blackbird Language Matrices(BLMs)作为结构化探针数据集,对比基于Universal Dependencies提取的自然句子模板与人工构建的合成句子模板;在法语和意大利语的被动动词交替任务上开展训练与泛化实验。 Result: 仅用合成数据训练的模型在合成测试上达天花板性能,但无法可靠泛化至自然句子;而用自然数据训练的模型在自然和合成测试中均表现稳健,展现出更强的抽象语言模式捕捉能力。 Conclusion: 自然数据比合成数据更有利于训练出具备真实语言理解与泛化能力的LLM;结构化的评估设置(如BLMs)对有效探测LLM的语言知识至关重要。 Abstract: This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

[19] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Taolin Han,Shuang Wu,Jinghang Wang,Yuhao Zhou,Renquan Lv,Bing Zhao,Wei Hu

Main category: cs.CL

TL;DR: 本文提出MolQuest,一种基于真实化学实验数据的代理式评估框架,用于评估大语言模型在分子结构解析中的动态推理能力。该框架强调多轮交互、多源光谱整合与假设迭代优化,揭示了当前最先进模型在此类科学任务中表现有限(仅约50%准确率),凸显其在策略性科学推理上的显著不足。

Details Motivation: 现有科学评估基准多为静态单轮问答形式,难以衡量模型在需多步迭代和实验交互的复杂科学任务中的动态推理能力。 Method: 构建MolQuest——一个基于真实化学实验数据的代理式评估框架,将分子结构解析建模为多轮交互任务,要求模型主动规划实验步骤、融合NMR/MS等异构谱图数据,并迭代修正结构假设。 Result: 实证结果表明,当前前沿大语言模型在真实科学场景中表现不佳:SOTA模型准确率仅约50%,多数模型低于30%。 Conclusion: MolQuest为面向科学的LLM评估提供了可复现、可扩展的新范式;研究揭示了当前LLMs在策略性科学推理上的关键短板,为发展能真正参与科研过程的AI指明方向。 Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

[20] CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Abhijnan Nath,Hannah VanderHoeven,Nikhil Krishnaswamy

Main category: cs.CL

TL;DR: 本文提出了CRAFT多智能体基准,用于在严格的部分信息条件下评估大语言模型的实用交际能力。多个具有互补但不完整视角的智能体需通过自然语言协作构建一个共享的3D结构,而该结构单个智能体无法完全观察。研究发现更强的推理能力并不必然带来更好的协作效果,多智能体协调仍是当前语言模型的根本性挑战。

Details Motivation: 现有大语言模型在多智能体协作与实用交际能力方面的评估缺乏严格的部分信息设定下的基准;需要分解协作失败的具体原因(如空间定位、信念建模、实用交际错误)并刻画行为失败模式。 Method: 构建CRAFT多智能体基准,形式化为多发送者实用推理任务;设计诊断框架,将失败归因于空间接地、信念建模和实用交际三类错误,并建立前沿与开源模型的行为失败分类体系。 Result: 在8个开源模型和7个前沿模型(含推理模型)上的实验表明:更强的推理能力不保证更好协调;小型开源模型常匹敌或超越前沿系统;个体沟通能力提升不等于协作成功。 Conclusion: 多智能体协调仍是当前语言模型未解决的根本性挑战,需超越单智能体能力评估,发展面向协作的评测与建模方法。 Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

[21] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

Nicolás Benjamín Ocampo,Tommaso Caselli,Davide Ceolin

Main category: cs.CL

TL;DR: 本文提出了一种结合仇恨言论(HS)与可核查性(check-worthiness)的新视角,构建了首个融合两类标签的数据集WSF-ARG+,并设计了一个LLM-in-the-loop标注框架以提升标注效率与质量;实验证明该框架可显著降低人工负担,且引入可核查性信息能有效提升大模型在HS检测任务上的性能。

Details Motivation: 仇恨言论常以貌似事实的形式传播,尤其在协同骚扰和极端主义宣传中;单独处理仇恨言论或虚假信息会加剧偏见、强化刻板印象、损害公共讨论,并增加内容审核难度,因此需联合建模二者。 Method: 构建首个融合仇恨言论与可核查性标签的数据集WSF-ARG+;提出LLM-in-the-loop标注框架,集成12个开源LLM进行可核查声明识别,并通过大规模人工评估验证其有效性;在HS检测任务中引入check-worthiness标签作为辅助信号。 Result: LLM-in-the-loop框架在不损失标注质量前提下显著减少人工工作量;含可核查声明的HS消息表现出更高程度的骚扰与仇恨;加入check-worthiness标签后,LLM-based HS检测的macro-F1最高提升0.213,大模型平均提升0.154。 Conclusion: 可核查性是理解与检测仇恨言论的关键维度;联合建模HS与事实核查不仅理论合理,而且在数据构建、标注效率与模型性能上均具实践价值。 Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.

[22] Separate Before You Compress: The WWHO Tokenization Architecture

Kusal Darshana

Main category: cs.CL

TL;DR: 本文提出了一种名为SGPE(音节感知图形单元对编码)的新分词算法和WWHO三层架构,以解决BPE分词器在处理Abugida文字(如僧伽罗语、天城文)时因拆分复合作字而产生的‘Token Tax’问题,显著降低token数量并保证音节不被切分。

Details Motivation: 标准BPE分词器在处理结构复杂的Abugida文字(如僧伽罗语、天城文)时,会将多码点的合体字拆分为无意义子单元,导致模型推理效率下降、开销增加,对全球南方语言造成显著‘Token Tax’负担。 Method: 提出WWHO(Where-What-How Often)三层架构与SGPE(Syllable-aware Grapheme Pair Encoding)算法,将文字语言规则与统计压缩解耦,支持无缝多语言分词,并确保音节零断裂(Linguistic Zero-Breakage Guarantee)。 Result: 在僧伽罗语上TWR达1.274(较o200k减少61.7%);在印地语上TWR为1.181(减少27.0%);混合脚本下整体TWR为1.240,相较o200k、Llama 4 Scout、DeepSeek V3分别减少36.7%、39.6%、60.2%,上下文窗口等效扩展达4.38倍。 Conclusion: SGPE有效缓解Abugida文字的token膨胀问题,在保持语言学完整性的同时大幅提升分词效率,为低资源及复杂脚本语言的大模型应用提供了可扩展、公平的分词基础。 Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.

[23] Beyond Detection: Rethinking Education in the Age of AI-writing

Maria Marina,Alexander Panchenko,Vasily Konovalov

Main category: cs.CL

TL;DR: 本文探讨了生成式AI(如ChatGPT)对写作教育的深层影响,强调写作作为思维训练过程不可被替代的价值,并主张通过提升教学法与培养AI文本辨识能力来应对挑战,而非简单禁止。

Details Motivation: 随着生成式AI广泛进入教育与日常场景,写作正面临被工具化、自动化、丧失认知价值的风险;作者担忧人类深度学习与思维发展因此受损。 Method: 综合运用认知心理学理论、教育学原理,并结合真实课堂教学实践进行分析与论证。 Result: 揭示写作过程本身(非仅结果)对人类深度学习的关键作用;评估当前AI文本检测技术的局限性;提出以‘智能教学法’和‘AI素养’为核心的教育应对路径。 Conclusion: 写作是思维的具身化过程,不可外包;在AI可伪造文本的时代,真正的学习仍需人亲身经历;识别机器生成语言将成为21世纪关键素养。 Abstract: As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality -- outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing -- messy, slow, often frustrating -- is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.

[24] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero

Main category: cs.CL

TL;DR: 本文提出自适应分块(Adaptive Chunking)框架,通过五个基于文档的内在指标(如引用完整性、块内凝聚性等)评估并为每篇文档选择最优分块策略,显著提升RAG系统性能。

Details Motivation: 现有‘一刀切’分块方法无法适配多样文本结构,且缺乏独立于下游任务的分块质量评估框架。 Method: 提出五个新内在指标(RC、ICC、DCC、BI、SC)构建评估体系;设计两种新分块器(LLM-regex分割器和split-then-merge递归分割器)及配套后处理技术;实现指标驱动的自适应分块。 Result: 在法律、技术、社会科学多领域数据集上,答案正确率从62–64%提升至72%,成功回答问题数增加超30%(65 vs. 49),且不改变模型与提示。 Conclusion: 基于文档感知与内在指标引导的自适应分块是提升RAG鲁棒性的实用有效路径。 Abstract: The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.

[25] Large Language Model as Token Compressor and Decompressor

Wenbing Li,Zikai Song,Jielei Zhang,Tianhao Zhao,Junkai Lin,Yiran Wang,Wei Yang

Main category: cs.CL

TL;DR: 本文提出一种利用现成大语言模型(LLM)作为高效token编解码器的新方法,通过微调使其将长文本压缩为自定义的离散Z-token,并能精确重建原文,在多个长文本数据集上实现最高18倍token压缩率,同时保持重建精度和下游任务性能。

Details Motivation: 现有长文本处理面临token开销大、上下文扩展成本高的问题,亟需轻量、内容自适应的token压缩机制。 Method: 设计自表达式自编码学习框架,微调预训练LLM,引入轻量LoRA适配器头,将其转化为可生成变长离散潜码(Z-tokens)并精确重建原文的编解码器。 Result: 在Wikipedia、CNN/DailyMail、HotpotQA和Qulac等长文本数据集上实现最高18倍token压缩;重建保真度高,下游任务性能无损;支持prompt压缩与Z-token空间内的自回归生成。 Conclusion: 现成LLM可被高效改造为内容自适应的token编解码器,为token高效型长上下文推理提供新路径。 Abstract: In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

[26] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

Xu Huang,Zhejian Lai,Zixian Huang,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出了一种名为TAPO的翻译增强型策略优化框架,通过以英语为中介、理解-推理范式及步骤级相对优势机制,提升大语言模型在多语言数学推理任务中的表现。

Details Motivation: 大型语言模型在英语数学推理中表现优异,但在多语言场景下性能差距显著,主要源于语言理解能力不足。 Method: 基于GRPO构建翻译增强型策略优化(TAPO)框架,采用英语为枢纽语言、理解-推理范式,并引入步骤级相对优势机制解耦理解与推理过程,融合翻译质量奖励。 Result: TAPO在多语言数学推理和翻译任务上均优于基线方法,且对未见语言和跨领域任务具有良好泛化能力。 Conclusion: TAPO能有效协同语言理解与推理能力,具备模型无关性与强泛化性,为多语言推理提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

[27] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

Erkan Gunes,Christoffer Florczak,Tevfik Murat Yildirim

Main category: cs.CL

TL;DR: 本文探讨了如何通过优化提示工程(prompt engineering)来提升大语言模型(LLM)在社会科学文本分类任务中的性能,发现适度增加提示上下文可显著提升准确率,但过度增加反而可能降低性能,且效果因模型、任务和批量大小而异,强调需针对具体任务进行单独验证。

Details Motivation: 当前LLM在社会科学研究中的文本分类应用虽具成本效益,但性能波动大,亟需探索如何系统性提升其准确性。 Method: 系统性地调整提示工程的三个维度:标签描述、指令引导(instructional nudges)和少样本示例(few-shot examples),并在两个不同案例中进行实证测试。 Result: 轻微增加提示上下文带来最大性能提升;进一步增加仅带来边际增益,甚至有时降低准确率;模型、任务与批处理规模间存在显著异质性。 Conclusion: 不能依赖通用提示设计规则,每个LLM文本编码任务都必须进行独立验证与调优。 Abstract: Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

[28] Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Jannis Vamvas,Ignacio Pérez Prat,Angela Heldstab,Dominic P. Fischer,Sina Ahmadi,Rico Sennrich

Main category: cs.CL

TL;DR: 本文提出了一种针对低资源语言罗马什语(Romansh)机器翻译的新数据增强策略,通过沿资源梯度方向进行数据增强,显著提升了最低资源方言的翻译质量(+23 BLEU),并首次实现了各罗马什语方言的流利翻译。

Details Motivation: 现有基于大语言模型(LLM)生成合成数据的低资源机器翻译方法在罗马什语上失败,因LLM难以区分其6种不同方言。 Method: 提出按源语言与目标语言之间的资源梯度方向进行数据增强,而非简单依赖LLM从高资源语言生成数据。 Result: 在罗马什语最低资源方言上,性能超越Gemini 3 Pro达23 BLEU;人工评估证实所提方法首次生成出各罗马什语方言的流利译文。 Conclusion: 数据增强的方向应与语言资源梯度对齐,这对多方言低资源语言的机器翻译至关重要。 Abstract: Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Pietro Dell'Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro

Main category: cs.CL

TL;DR: 本文对12种代表性假新闻检测方法进行了系统评估,涵盖传统机器学习、深度学习、Transformer及跨域架构,在10个英文文本数据集上进行统一二分类(真实/虚假)实验,考察了域内、多域和跨域泛化能力,并指出微调模型泛化性差、跨域模型需大量数据、而大语言模型在零样本/少样本下更具潜力。

Details Motivation: 假新闻生成与传播因大语言模型和社会媒体而日益复杂,现有检测方法在真实场景(如领域偏移、分布外数据)下的鲁棒性和泛化能力尚不清晰,亟需系统性、标准化的评估。 Method: 选取12种代表性假新闻检测方法,统一在10个公开英文文本数据集上进行二分类(Real/Fake)评估;采用统一标签协议,将各数据集原始标签归一化;开展域内、多域和跨域实验;对比分析传统模型、深度模型、Transformer、跨域架构及大语言模型(零/少样本)的表现。 Result: 微调模型在域内表现优异但跨域泛化能力差;跨域专用架构可缩小泛化差距但依赖大量标注数据;大语言模型在零样本和少样本设置下展现出更强的适应性和潜力;所有结果受限于英文文本模态及数据集固有偏差。 Conclusion: 当前假新闻检测方法在真实复杂场景中仍面临显著泛化挑战;需更关注鲁棒性评估而非单纯性能提升;大语言模型为解决数据稀缺与领域迁移问题提供了新路径,但其预训练暴露和数据混淆效应需谨慎对待。 Abstract: In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into "Real" and "Fake" to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

[30] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh,Hyewon Jang,Shalom Lappin,Asad Sayeed,Sharid Loáiciga

Main category: cs.CL

TL;DR: 本文研究视觉叙事连贯性,通过对比人类撰写与视觉语言模型(VLMs)生成的故事,发现VLMs虽表面流畅,但在话语组织上系统性偏离人类叙事。

Details Motivation: 理解视觉语言模型在生成视觉叙事时是否具备与人类相当的叙事连贯性,揭示其潜在缺陷。 Method: 在Visual Writing Prompts数据集上,使用涵盖共指、篇章关系类型、主题连续性、角色持续性及多模态角色定位等维度的指标,计算叙事连贯性得分并进行对比分析。 Result: VLMs展现出与人类相似但系统性不同的连贯性分布;单个指标差异细微,联合分析则更显著;模型叙事在视觉叙事的话语组织上明显偏离人类。 Conclusion: 当前VLMs生成的视觉叙事虽具表面流畅性,但在深层叙事结构和话语组织方面仍与人类存在系统性差距。 Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

[31] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Minseo Kim,Sujeong Im,Junseong Choi,Junhee Lee,Chaeeun Shim,Edward Choi

Main category: cs.CL

TL;DR: 本文提出PICon框架,通过逻辑链式多轮提问评估基于大语言模型的 persona 代理在内部一致性、外部一致性和重测一致性三个维度的表现,并发现现有系统仍显著落后于人类基线。

Details Motivation: 缺乏系统性方法验证 persona 代理在交互过程中是否保持无矛盾、无事实错误的响应;受审讯学中‘复杂虚构身份终将暴露矛盾’原理启发,需构建可揭示潜在不一致性的评估机制。 Method: 提出PICon评估框架,采用逻辑关联的多轮提问策略,从内部一致性(自洽性)、外部一致性(与真实世界事实对齐)、重测一致性(重复提问下的响应稳定性)三方面量化评估 persona 代理;对比7组代理与63名真人参与者的表现。 Result: 所有被测 persona 代理均未达到人类在三个一致性维度上的基线水平,暴露出明显矛盾与回避性回答;验证了PICon能有效揭示现有系统的脆弱性。 Conclusion: PICon为 persona 代理提供了概念基础与实用评估方法,强调在将其作为人类替代者前必须进行严格的一致性验证。 Abstract: Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

[32] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

Mingmeng Geng,Yuhang Dong,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文通过分析arXiv论文,揭示了大语言模型(LLMs)对学术文本用词习惯的隐性影响,如标题中“beyond”和“via”使用增多、摘要中“the”和“of”减少;指出当前分类器难以准确识别具体生成模型,并采用可解释的线性方法量化了LLM使用在现实中的异质性与动态性。

Details Motivation: 现有研究尚未充分关注大语言模型(LLMs)对学术写作语言习惯的潜在、细微影响,尤其是跨模型共性与差异所导致的真实世界文本变化模式。 Method: 基于arXiv论文语料进行词频统计与趋势分析;设计直接且高可解释性的线性模型,控制模型类型与提示(prompt)差异,定量评估LLM对用词变化的影响;辅以多类分类实验检验模型可识别性。 Result: 发现显著且未被充分讨论的用词偏移现象(如标题中' b eyond'/'via'上升,摘要中'the'/'of'下降);证实主流LLM分类器在多类任务中性能受限;验证LLM实际使用具有异质性与动态演化特征。 Conclusion: LLMs正悄然重塑学术文本的语言生态,其影响具有跨模型共性但又因模型与提示而异;需发展更鲁棒、可解释的方法来监测和理解这种真实、动态的技术渗透效应。 Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

[33] Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Cole Walsh,Rodica Ivan

Main category: cs.CL

TL;DR: 本研究探讨了构建无关因素对基于大语言模型(LLM)的双架构自动评分系统的影响,发现该系统对无意义文本填充、拼写错误和写作复杂度具有较强鲁棒性,但对大段重复文本和离题作答敏感,总体表现出良好的构念相关性。

Details Motivation: 随着大语言模型在自动评分中的广泛应用,其对构念无关因素(如无关文本、拼写错误等)和对抗性条件的鲁棒性成为关键问题,尤其需关注‘幻觉’及评分偏差风险。 Method: 采用双架构LLM-based评分系统,评估其对短文式情境判断测试开放题的评分表现,系统性测试多种构念无关干扰:无意义文本填充、拼写错误、写作复杂度变化、大段重复文本及离题作答,并对比分析评分变化。 Result: 系统对无意义填充、拼写错误和写作复杂度基本不敏感;大段重复文本导致平均得分下降(与以往非LLM系统相反);离题作答被显著降分。 Conclusion: 若在设计中注重构念相关性,LLM-based自动评分系统可具备良好鲁棒性,为未来教育测评中可信AI应用提供支持。 Abstract: Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.

[34] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Haoyan Yang,Mario Xerri,Solha Park,Huajian Zhang,Yiyang Feng,Sai Akhil Kogilathota,Jiawei Zhou

Main category: cs.CL

TL;DR: 本文提出了一种面向自改进大语言模型的系统级统一框架,将自改进过程建模为包含数据获取、数据选择、模型优化和推理精炼四个环节的闭环生命周期,并引入自主评估层进行全程监控与引导。

Details Motivation: 人类监督成本高、可扩展性差,且在模型接近人类水平时反馈信息量不足;同时模型自主决策与执行能力增强,为开发流程自动化提供了基础。 Method: 构建一个以模型自身为核心驱动的闭环自改进系统框架,包含数据获取、数据选择、模型优化、推理精炼四阶段及一个自主评估层,并对各阶段代表性技术进行系统性技术分析。 Result: 提出了首个系统性、结构化的自改进LLM统一框架,厘清了各组件的技术脉络,并识别出当前方法的关键局限。 Conclusion: 自改进是通向完全自主演进大模型的关键路径,需从系统层面统筹设计各环节协同机制,未来研究应聚焦于提升评估可靠性、闭环稳定性与跨任务泛化能力。 Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

[35] S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han,Hao Wang,Han Gao,Kai Xu,Akash Srivastava

Main category: cs.CL

TL;DR: 本文提出S2D2,一种无需训练的自推测解码框架,用于块扩散语言模型,在不增加训练或测试开销的前提下,通过将同一预训练模型在块大小为1时作为草稿器和验证器,提升解码准确率与速度的权衡。

Details Motivation: 现有块扩散语言模型在少量步数下进行快速生成时,基于置信度阈值的解码方法鲁棒性差:激进阈值损害质量,保守阈值浪费计算;而改进方法往往需额外训练或增加测试开销。 Method: S2D2利用块扩散模型在单token块下退化为自回归模型的特性,复用同一预训练模型作为drafting和verifying组件;在标准块扩散解码中插入轻量级推测验证步骤,并采用轻量路由策略动态决定是否执行验证。 Result: 在SDAR、LLaDA2.1-Mini等三大主流块扩散模型上,S2D2显著优于强置信度阈值基线:在SDAR上相较自回归解码提速达4.7×,相较动态解码基线提速1.57×且准确率提升最多4.5分;在LLaDA2.1-Mini上仍与内置自校正互补,保守设置下比静态基线快4.4×且准确率略高。 Conclusion: S2D2是一种高效、通用、免训练的块扩散解码优化方案,通过自推测机制实现了更优的精度-速度平衡,为实际部署提供了新思路。 Abstract: Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

[36] Natural-Language Agent Harnesses

Linyue Pan,Lexiao Zou,Shuo Guo,Jingchen Ni,Hai-Tao Zheng

Main category: cs.CL

TL;DR: 本文提出自然语言代理框架(NLAHs)和智能框架运行时(IHR),将代理控制逻辑从硬编码中解耦,以自然语言形式外部化并统一执行,提升可移植性、可比性和可研究性。

Details Motivation: 现有代理性能依赖于'框架工程',但框架设计常嵌入控制器代码和运行时约定中,难以迁移、比较和作为科学对象研究。 Method: 提出自然语言代理框架(NLAHs)——用可编辑自然语言描述框架行为;以及智能框架运行时(IHR)——通过显式契约、持久化构件和轻量适配器统一执行NLAHs。 Result: 在编程与计算机使用基准上完成可控实验,验证了其操作可行性、模块消融效果及代码到文本框架迁移能力。 Conclusion: 将高阶控制逻辑外部化为自然语言可执行构件是可行且有效的,为代理工程提供了更科学、可复现、可互操作的新范式。 Abstract: Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

cs.CV [Back]

[37] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Francesco Gentile,Nicola Dall'Asen,Francesco Tonini,Massimiliano Mancini,Lorenzo Vaquero,Elisa Ricci

Main category: cs.CV

TL;DR: 本文提出SITH框架,一种无需数据、无需训练的CLIP视觉Transformer权重空间可解释性方法,通过奇异向量分解与COMP算法实现细粒度、语义连贯的注意力头内解释,并支持可解释的模型编辑与适应机制分析。

Details Motivation: 现有可解释性方法依赖激活值,存在数据依赖、偏差敏感和解释粒度粗等问题,亟需一种数据无关、更本质的权重空间分析方法。 Method: 提出SITH框架:对CLIP视觉Transformer每个注意力头的价值-输出矩阵进行奇异值分解,并用新算法COMP将每个奇异向量解释为稀疏、语义连贯的人类可理解概念组合。 Result: SITH生成了连贯且忠实的头内解释(经重建保真度与可解释性实验验证);支持精准的权重空间概念级编辑(增强/抑制特定概念),提升下游性能;揭示微调主要重加权稳定语义基而非学习新特征。 Conclusion: SITH为大视觉语言模型提供了首个完全数据免费、训练免费、细粒度且语义可解释的权重空间分析范式,兼具理论洞察力与实际编辑能力。 Abstract: As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

[38] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

Quanyun Wu,Kyle Gao,Daniel Long,David A. Clausi,Jonathan Li,Yuhao Chen

Main category: cs.CV

TL;DR: 本文提出了一种尺度感知的3D融合框架,利用视觉语言模型(VLM)引导的几何锚点机制,解决Transformer预测点云与局部重建网格之间的尺度和坐标不一致问题,构建度量一致、语义 grounded 的数字孪生环境。

Details Motivation: 现有基于Transformer的单目视频三维重建方法存在固有尺度模糊性和坐标约定不一致问题,导致无法可靠地将无尺度点云与局部物体网格融合,制约了具身AI对真实世界度量几何的需求。 Method: 提出尺度感知3D融合框架,包括:1)VLM引导的几何锚点机制以恢复真实世界度量尺度;2)几何感知配准流程,融合重力对齐的垂直估计、曼哈顿世界结构约束及无碰撞局部优化。 Result: 在真实室内厨房环境中实验表明,该方法提升了跨网络物体对齐精度与几何一致性,显著改善多原语拟合与度量测量等下游任务性能;并开源了一个带度量尺度、语义标注与物体网格注册的室内数字孪生数据集。 Conclusion: 本工作弥合了视觉重建与具身AI对度量一致数字孪生体的需求鸿沟,为构建可交互、可测量、语义丰富的三维环境提供了新范式。 Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

[39] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Yicheng Xu,Jiangning Zhang,Zhucun Xue,Teng Hu,Ran Yi,Xiaobin Hu,Yong Liu,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了一种面向能力的六级分类法来诊断多模态统一模型中上下文学习(In-context Learning)的非单调性和任务依赖性问题,并构建了大规模数据集UniICL-760K与评测基准UniICL-Bench;进一步设计了轻量、即插即用的上下文自适应原型调制器(Context-Adaptive Prototype Modulator),显著提升了少样本适配的稳定性与性能。

Details Motivation: 上下文学习在统一多模态模型中因跨模态干扰和认知需求差异而表现不稳定、非单调且高度任务依赖,亟需系统性诊断与稳定化方法。 Method: 提出六级能力导向的演示功能分类法;构建包含15个子任务、8样本上下文学习片段的大规模语料UniICL-760K及评测基准UniICL-Bench;设计轻量即插即用模块——上下文自适应原型调制器。 Result: 在UniICL-Bench上,所提方法在多数理解类上下文学习任务上超越参数量更大的多模态大语言模型基线。 Conclusion: 基于认知能力分类的系统性分析与轻量架构干预可有效缓解多模态统一模型中上下文学习的敏感性与不稳定性,提升少样本泛化能力。 Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.

[40] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gokce Inal,Pouyan Navard,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了专用于月球表面和地下表征的视觉-语言模型LLaVA-LE,构建了大规模月球多模态数据集LUCID,并通过两阶段微调策略显著提升了模型在月球地形分析任务上的性能。

Details Motivation: 现有视觉-语言模型(VLMs)在行星科学领域应用受限,主要原因是缺乏配对真实行星图像与详细科学描述的大规模数据集。 Method: 构建了包含96k张高分辨率全色图像及对应描述、81k个问答对的月球多模态数据集LUCID;在此基础上,采用两阶段训练策略对LLaVA进行微调:第一阶段为领域特定地形描述的概念对齐,第二阶段为指令调优的视觉问答。 Result: LLaVA-LE在多项月球地形分析推理基准上显著优于基线模型:相比基础LLaVA提升3.3倍,相比第一阶段模型提升2.1倍;其推理得分达1.070,甚至超过人工评委的参考分。 Conclusion: 领域专用多模态数据与指令微调可有效推动视觉-语言模型在行星探测中的应用。 Abstract: Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

[41] Lookalike3D: Seeing Double in 3D

Chandan Yeshwanth,Angela Dai

Main category: cs.CV

TL;DR: 本文提出了室内场景中“相似物体检测”新任务,利用多视角图像和大视觉基础模型的语义先验,设计Lookalike3D多视角Transformer模型,在自建3DTwins数据集上显著提升性能,并推动联合三维重建与部件共分割等下游任务。

Details Motivation: 现有3D理解与生成方法常忽略真实场景中大量存在的重复/相似物体这一重要线索,缺乏对物体间重复性与互补性关系的建模。 Method: 提出Lookalike3D多视角图像Transformer模型,融合大型图像基础模型的强语义先验,用于判别物体对是否为完全相同、相似或不同;构建含76k标注对的3DTwins数据集(基于ScanNet++)。 Result: 在3DTwins上IoU指标较基线提升104%;验证了该方法可有效支持联合3D物体重建与部件共分割等下游任务。 Conclusion: 重复与相似物体是提升三维感知一致性与质量的关键线索;所提任务、模型与数据集为3D理解开辟了新方向。 Abstract: 3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

[42] Accurate Point Measurement in 3DGS -- A New Alternative to Traditional Stereoscopic-View Based Measurements

Deyan Deng,Rongjun Qin

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的新型三维点测量方法,利用其高质量、完整的新视角合成能力进行多视图三角测量,显著提升几何测量精度,尤其在薄结构和尖锐边缘等难重建区域优于传统网格方法。

Details Motivation: 3D高斯泼溅(3DGS)虽在新视角合成上表现卓越,但其在精确几何测量中的潜力尚未被充分挖掘;现有测量方法依赖昂贵立体工作站或不完整/不准的三维网格,限制了精度与易用性。 Method: 利用3DGS渲染的高质量、连续多视角图像,让用户直观地在不同视图中选取对应点(congruent points),再通过三角测量生成精确三维点坐标;支持双视图及多视图交集以提升精度,并实现为轻量级Web应用。 Result: 在UAV航拍数据集上验证:对明确定义点,RMSE达1–2 cm;对薄结构(原网格RMSE=0.062 m),本法降至0.037 m;对网格严重缺失的尖角,本法成功测量且RMSE仅0.013 m,而网格法完全失败。 Conclusion: 3DGS不仅适用于渲染,更可作为高精度几何测量的新范式,兼具高精度、易部署(Web)、免专业设备与操作门槛等优势,显著超越传统立体测量与网格直接测量。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: https://github.com/GDAOSU/3dgs_measurement_tool.

[43] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models

Shengli Zhou,Minghang Zheng,Feng Zheng,Yang Liu

Main category: cs.CV

TL;DR: 本文提出QuatRoPE和IGRE两种新方法,用于提升大语言模型在3D空间推理任务中的性能。QuatRoPE是一种线性复杂度的四元数位置编码,能显式建模物体间空间关系;IGRE则限制其影响范围,避免干扰LLM原有能力。实验验证了其有效性。

Details Motivation: 现有方法要么难以从绝对位置中提取空间关系,要么因显式编码所有两两空间关系(复杂度为二次)而缺乏可扩展性;同时3D场景-语言配对数据稀缺,制约空间推理模型训练。 Method: 提出QuatRoPE:基于四元数的新型位置编码,输入长度与物体数量呈线性关系,并通过注意力层中的点积显式计算两两空间关系;引入IGRE(孤立门控RoPE扩展),将QuatRoPE的影响限定于物体相关token,保护LLM原有位置编码能力。 Result: 在多个3D空间推理基准上显著优于现有方法,验证了QuatRoPE与IGRE在保持几何一致性、提升推理能力及兼顾模型可扩展性方面的有效性。 Conclusion: QuatRoPE与IGRE为3D空间推理提供了一种高效、几何一致且与LLM兼容的新范式,解决了现有方法在关系建模、可扩展性与位置编码干扰方面的关键瓶颈。 Abstract: Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.

[44] Confidence-Based Mesh Extraction from 3D Gaussians

Lukas Radl,Felix Windisch,Andreas Kurz,Thomas Köhler,Michael Steiner,Markus Steinberger

Main category: cs.CV

TL;DR: 本文提出了一种面向3D高斯泼溅(3DGS)的自监督置信度框架,通过可学习的置信值动态平衡光度与几何监督,并引入颜色与法向方差惩罚损失及改进的D-SSIM外观模型,显著提升无界场景下网格提取质量与效率。

Details Motivation: 现有3DGS在存在大量视角依赖效应的场景中难以准确提取网格表面,而多视角融合、迭代提取或大预训练模型等方案牺牲了3DGS固有的高效性。 Method: 引入自监督置信度框架,使可学习置信值动态调节光度与几何损失权重;设计针对每个高斯原语的颜色与法向方差惩罚损失;改进外观建模,解耦D-SSIM损失各项。 Result: 在无界场景网格提取任务上达到SOTA性能,同时保持高计算效率。 Conclusion: 所提方法以简洁高效的方式缓解了视角依赖效应导致的表面歧义问题,无需依赖复杂外部模块或多次迭代,兼顾精度与速度。 Abstract: Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.

[45] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception

Yuqi Hu,Vasha DuTell,Ahna R. Girshick,Jennifer E. Corbett

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP嵌入空间插值的框架,生成语义模糊图像谱,用于比较人类与机器分类器在概念边界判断上的差异;发现模型更倾向判为'兔子',而人类判断更贴近CLIP合成嵌入,且人类对引导尺度更敏感。

Details Motivation: 探究人类与机器视觉模型在语义模糊图像中概念边界判定的异同,揭示模型表征与人类感知的对齐程度。 Method: 在CLIP嵌入空间中对'鸭子'和'兔子'等概念进行插值,生成连续的语义模糊图像谱,并结合心理物理学实验与机器分类器测试,定量分析人类与模型的边界判定位置及影响因素(如引导尺度)。 Result: 机器分类器表现出对'兔子'类别的系统性偏好;人类判断更接近用于图像合成的CLIP嵌入方向;引导尺度显著影响人类敏感性,但对机器分类器影响较弱。 Conclusion: 可控的语义模糊性可作为诊断工具,弥合人类心理物理分析、图像分类与生成模型之间的鸿沟,有助于理解人机对齐、模型鲁棒性、可解释性及图像合成机制。 Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.

[46] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video

Selim Gilon,Emily Y. Miller,Scott D. Uhlrich

Main category: cs.CV

TL;DR: 本文提出OpenCap Monocular算法,仅用单台智能手机视频即可高精度估计三维骨骼运动学与动力学参数,验证显示其在多种动作中误差低、性能优于基线方法,并已通过App/Web/云平台免费开放使用。

Details Motivation: 传统生物力学评估需昂贵、耗时的实验室设备,临床转化受限;亟需可扩展、高精度的便携式评估工具。 Method: 基于单目姿态估计模型WHAM输出,通过优化提升3D姿态估计;再映射至生物力学约束骨骼模型计算运动学;最后结合物理仿真与机器学习估计动力学(如关节力矩、地面反作用力)。 Result: 在行走、深蹲、坐站转换任务中,运动学误差低(旋转平均绝对误差4.8°,骨盆平移3.4 cm),旋转与平移精度分别比纯回归基线提升48%和69%;动力学估计(如膝伸展力矩、膝内收力矩)达临床可用水平;地面反作用力估计精度媲美甚至优于先前双相机系统。 Conclusion: OpenCap Monocular实现了单智能手机端高精度、低成本、易部署的生物力学评估,推动其在衰弱症、膝骨关节炎等临床场景中的广泛应用。 Abstract: Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.

[47] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

David G. Shatwell,Sirnam Swetha,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出TIGeR模型,用于地理-时间感知图像检索,通过多模态Transformer将图像、地理位置和时间映射到统一的地理-时间嵌入空间,支持多种查询模式,并在多项任务上显著优于现有方法。

Details Motivation: 现实应用(如数字取证、城市监控、环境分析)需要联合推理图像的视觉外观、地理位置和时间信息,而现有方法难以满足更复杂的地理-时间联合检索需求。 Method: 提出TIGeR模型,基于多模态Transformer架构,将图像、地理位置和时间编码为统一的地理-时间嵌入空间;支持单模态与多模态输入;构建含450万训练样本和8.6万评估样本的Geo-Time Aware图像-位置-时间三元组基准。 Result: TIGeR在年份时间预测、日间时间预测和地理-时间感知检索召回率上分别比SOTA方法提升16%、8%和14%。 Conclusion: 统一建模地理与时间信息能有效提升跨外观变化下的场景定位鲁棒性,使检索依据‘场景所在位置与时间’而非仅‘视觉相似性’,验证了地理-时间联合表征的有效性。 Abstract: Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

[48] Synthetic Cardiac MRI Image Generation using Deep Generative Models

Ishan Kumarasinghe,Dasuni Kawya,Madhura Edirisooriya,Isuri Devindi,Isuru Nawinne,Vajira Thambawita

Main category: cs.CV

TL;DR: 本文综述了合成心脏MRI(CMRI)生成的最新方法,重点评估其在图像保真度、下游任务效用(如分割性能)和隐私保护三方面的表现,并指出当前研究缺乏整合性、以评估为导向的框架。

Details Motivation: 解决标注心脏MRI数据稀缺、不同厂商设备导致的数据差异、以及模型记忆带来的隐私泄露风险等问题。 Method: 综述GANs、VAEs、扩散模型、流匹配等生成方法;分析掩码条件生成、厂商风格条件化、强度归一化等技术;评估采用成员推断攻击、最近邻分析和差分隐私等隐私保护策略。 Result: 解剖结构约束的合成CMRI可提升多厂商场景下下游分割任务的准确性和鲁棒性;扩散与流匹配模型在边界保持和确定性变换上表现更优;隐私评估手段逐步增多但尚未标准化。 Conclusion: 现有CMRI生成方法在保真度、效用与隐私间存在权衡,亟需建立统一、评估驱动的集成框架以支持可靠临床应用。 Abstract: Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.

[49] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects

Narek Tumanyan,Samuel Rota Bulò,Denis Rozumny,Lorenzo Porzi,Adam Harley,Tali Dekel,Peter Kontschieder,Jonathon Luiten

Main category: cs.CV

TL;DR: 本文提出DRoPS方法,利用动态物体的静态预扫描作为几何与外观先验,通过网格化高斯原语和CNN驱动的运动参数化,显著提升极端新视角下的动态场景重建质量与3D跟踪精度。

Details Motivation: 现有方法在处理极端新视角和高度关节化运动时重建效果不佳,难以充分利用静态预扫描信息。 Method: 提出DRoPS:1)将高斯原语组织为锚定在物体表面的像素网格,构建网格化且表面对齐的模型;2)利用该网格结构,用CNN对网格条件化建模运动,实现强隐式正则化和邻近点运动相关性建模。 Result: 在渲染质量和3D跟踪精度上显著优于当前最先进方法。 Conclusion: DRoPS通过显式利用静态预扫描先验与结构化运动建模,有效约束解空间并保证序列几何一致性,解决了动态场景重建中极端视角与复杂运动下的关键挑战。 Abstract: Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.

[50] AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef,Tavi Halperin,Naomi Ken Korem,Mohammad Salama,Harel Cain,Asaf Joseph,Anthony Chen,Urska Jelercic,Ofir Bibi

Main category: cs.CV

TL;DR: 本文提出AVControl框架,基于LTX-2联合音视频基础模型,采用独立LoRA适配器与并行画布机制,支持多种控制模态(深度、姿态、相机轨迹、音频等)的轻量、可扩展、高效训练,无需修改主干架构,在VACE基准上多项任务超越基线。

Details Motivation: 现有视频音频生成控制方法要么训练固定模态的单一大模型,要么为每种新模态引入昂贵的架构改动,缺乏灵活性与效率。 Method: 基于LTX-2构建AVControl框架,将每种控制模态建模为独立LoRA适配器;引入‘并行画布’机制,将参考信号作为额外token注入注意力层,不改变原始模型结构;避免简单迁移图像级上下文方法到视频导致的结构控制失效问题。 Result: 在VACE基准上,深度/姿态引导生成、补全/扩图任务全面超越所有基线;相机控制与音视频任务表现具竞争力;支持7类异构模态(如深度、边缘、相机内参、稀疏运动、视频编辑、首个模块化音视频联合控制);各模态仅需小数据集与数百至数千步训练即收敛。 Conclusion: AVControl是一种计算与数据高效、高度可扩展的轻量控制框架,首次实现多模态控制LoRA的统一、解耦与即插即用式集成,为音视频可控生成提供了新范式。 Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

[51] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

Danil Tokhchukov,Aysel Mirzoeva,Andrey Kuznetsov,Konstantin Sobolev

Main category: cs.CV

TL;DR: 本文提出Calibri方法,通过引入单个可学习缩放参数并利用进化算法优化DiT组件,在仅调整约100个参数的情况下显著提升文本到图像生成质量,并减少推理步数。

Details Motivation: 挖掘扩散Transformer(DiT)在生成任务中的潜在能力,改善其去噪过程性能。 Method: 提出Calibri方法,将DiT校准建模为黑箱奖励优化问题,使用进化算法高效求解,仅修改约100个参数。 Result: Calibri在多个文本到图像模型上一致提升生成质量,并减少推理所需步数,同时保持高输出质量。 Conclusion: 单个可学习缩放参数和轻量级校准策略能显著增强DiT的生成能力,验证了参数高效优化的有效性。 Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

[52] Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

Abu Noman Md Sakib,Merjulah Roby,Zijie Zhang,Satish Muluk,Mark K. Eskandari,Ender A. Finol

Main category: cs.CV

TL;DR: 本文提出了一种可解释人工智能(XAI)引导的编码器整形框架,用于提升复杂腹主动脉瘤(AAA)CT图像分割性能,通过XAI场对齐预测与注意力,并引入轻量级精修路径和置信先验来抑制干扰、保留细微结构。

Details Motivation: 现有AAA CT图像分割模型常因关注无关结构或忽略低对比度薄目标而失败;模型‘看哪里’是关键训练信号,需显式优化编码器注意力。 Method: 构建基于归因的稠密编码器焦点图(XAI场),并以两种方式利用:(i) 将预测概率质量与XAI场对齐;(ii) 将XAI场送入轻量精修路径及置信先验,以调制推理时logits。XAI引导融入表征学习与解码过程。 Result: 在临床验证的易失败挑战病例上评估,相比基础SAM设置,分割性能显著提升。 Conclusion: 显式利用XAI指导优化编码器焦点,是一种在复杂场景下实现可靠分割的实用有效原则。 Abstract: Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

[53] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Deen Dayal Mohan,Hossein Souri,Vitali Petsiuk,Juhong Min,Gopal Sharma,Luowei Zhou,Suren Kumar

Main category: cs.CV

TL;DR: GoldiCLIP是一种数据高效的大规模视觉语言模型训练框架,仅用3000万图像(比主流方法少300倍)即在多项检索任务上达到数据高效方法的SOTA,并媲美十亿级数据训练的模型。

Details Motivation: 现有大规模视觉语言模型依赖数十亿样本数据,成本高、门槛高;虽有工作尝试提升监督质量,但仅解决对比预训练中部分缺陷。 Method: 提出基于‘恰到好处’(Goldilocks)原则的GoldiCLIP框架,包含三项创新:(1) 文本条件下的自蒸馏以对齐文本无关与文本相关特征;(2) 集成VQA目标的编码器-解码器结构,增强编码器对非描述性查询的泛化能力;(3) 基于不确定性的异构损失自动加权机制。 Result: 在MSCOCO检索上超越最佳可比基线2.2点,细粒度检索提升2.0点,问答式检索提升5.9点,且性能接近十亿级数据训练模型。 Conclusion: 通过协同优化监督信号的质与量,GoldiCLIP证明高质量、多维度、自适应的训练策略可大幅降低对海量数据的依赖,为视觉语言模型的高效训练提供新范式。 Abstract: Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.

[54] Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators

Yubo Wang,Marie Fridberg,Anirejuoritse Bafor,Ole Rahbek,Christopher Iobst,Søren Vedding Kold,Ming Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力机制和新型卷积模块(ERRC)的深度学习模型,用于根据图像外观自动分类骨外固定器针道感染(A组:有感染/炎症;B组:无并发症),在AUC(0.975)和F1-score(0.927)上优于基线方法,参数量仅5.77M。

Details Motivation: 针道感染常见、痛苦且增加患者发病率,亟需提升其识别与管理水平;现有研究多聚焦开放性伤口,缺乏对金属针-皮肤界面早期感染迹象的针对性建模。 Method: 构建针道感染图像数据集;提出结合注意力机制的深度学习模型,突出关键感染区域并抑制金属针干扰;引入高效冗余重建卷积(ERRC)增强特征图表达力并减少参数量。 Result: 模型AUC达0.975,F1-score为0.927,参数量仅5.77M,性能优于基线方法,结果与医护人员视觉评估一致。 Conclusion: 该DL模型可有效仅凭视觉特征区分针道感染状态,具备临床辅助诊断潜力,但仍需更大规模数据进一步验证。 Abstract: Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.

[55] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting

Alabi Mehzabin Anisha,Guangjing Wang,Sriram Chellappan

Main category: cs.CV

TL;DR: 本文提出了一种新型跨范式对抗攻击框架,可同时有效攻击基于密度图和点回归的两种主流人群计数与定位模型,在保持视觉不可察觉性的同时显著提升误差,并在多个SOTA模型间实现高成功率迁移。

Details Motivation: 现有对抗攻击研究主要集中在单一范式(如密度图)内,跨范式(密度图 vs 点回归)攻击尚未探索,而人群计数系统在安防等关键场景中对鲁棒性要求极高。 Method: 提出多任务损失优化的对抗框架:对点回归模型采用场景密度相关的高置信度logit抑制;对密度图模型采用峰值定向的密度图抑制;并引入模型无关的感知约束以保证扰动的不可察觉性。 Result: 平均MAE提升达7倍;在7个SOTA人群模型上成功迁移,迁移比为0.55–1.69;视觉质量保持良好,攻击有效性与不可察觉性优于现有可迁移攻击方法。 Conclusion: 该工作首次实现了对两类主流人群计数范式的统一、高效且不可察觉的跨范式对抗攻击,揭示了当前模型在统一威胁下的脆弱性,并为后续鲁棒性研究提供了新基准和思路。 Abstract: State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen

[56] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation

Junyi Ouyang,Wenbin Teng,Gonglin Chen,Yajie Zhao,Haiwei Chen

Main category: cs.CV

TL;DR: 本文提出DCARL框架,结合分治策略与自回归建模优势,通过关键帧生成器和插值生成器协同工作,实现高质量、长时序(最长32秒)、高相机轨迹一致性的视频生成。

Details Motivation: 现有视频扩散模型扩展性差,自回归模型存在视觉漂移和可控性差问题,难以支持长轨迹视频生成。 Method: 提出DCARL:先用无时间压缩的关键帧生成器建立全局结构锚点;再用基于重叠段的自回归插值生成器,以关键帧为全局上下文、单帧为局部参考合成密集帧。 Result: 在大规模互联网长轨迹视频数据集上训练后,DCARL在FID/FVD(视觉质量)和ATE/ARE(相机轨迹精度)上均优于当前SOTA自回归与分治方法,支持稳定高保真32秒长视频生成。 Conclusion: DCARL成功融合分治的结构稳定性与VDM的高保真生成能力,为世界建模中的长轨迹视频生成提供了可扩展、可控且高质量的新范式。 Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.

[57] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

Yihan Wang,Jia Deng

Main category: cs.CV

TL;DR: WAFT-Stereo是一种基于形变的立体匹配方法,无需传统成本体,性能领先且效率更高。

Details Motivation: 挑战成本体在立体匹配中的必要性,寻求更高效、简洁的架构。 Method: 提出基于形变(warping)的WAFT-Stereo方法,摒弃成本体设计,直接建模视差。 Result: 在ETH3D、KITTI和Middlebury基准上排名第一;ETH3D零样本误差降低81%,速度提升1.8–6.7倍。 Conclusion: 成本体并非立体匹配高性能的必要条件,基于形变的方法可在精度与效率上实现双重突破。 Abstract: We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.

[58] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

Katarina Trojachanec Dineva,Stefan Andonov,Ilinka Ivanoska,Ivan Kitanovski,Sasho Gramatikov,Tamara Kostova,Monika Simjanoska Misheva,Kostadin Mishev

Main category: cs.CV

TL;DR: 本文对20种前沿多模态大语言模型在2D神经影像(MRI/CT)上的诊断能力进行了全面基准测试,评估其在诊断、亚型分类、成像模态识别等多任务上的性能、校准性、结构化输出有效性及计算效率;结果显示技术属性识别已接近解决,而诊断推理(尤其亚型)仍具挑战,Gemini-2.5-Pro和GPT-5-Chat诊断性能最强,Gemini-2.5-Flash效率最优,开源模型MedGemma-1.5-4B在少样本下表现突出。

Details Motivation: 当前多模态大语言模型在神经影像中的可靠性与实际运行权衡尚不明确,亟需系统性基准评估以指导临床应用。 Method: 构建涵盖多发性硬化、卒中、脑肿瘤等五类的标准化2D MRI/CT数据集,要求模型同步生成诊断、亚型、模态、序列、解剖平面共五类输出;采用四维评估体系(带拒绝机制的判别分类、校准性、结构化输出有效性、计算效率),并设计多阶段框架控制选择偏差。 Result: 技术属性(如模态、平面)识别准确率接近饱和;诊断任务中肿瘤分类最可靠,卒中中等,多发性硬化与罕见异常最难;少样本提示提升部分模型性能但增加开销;Gemini-2.5-Pro与GPT-5-Chat整体诊断最强,Gemini-2.5-Flash效率最优,MedGemma-1.5-4B在少样本下逼近闭源模型零样本性能且结构化输出完美。 Conclusion: 该研究揭示了多模态大语言模型在神经影像诊断中的性能边界与实用权衡,为建立标准化评估范式提供了实证基础,并指出开源模型在特定设置下具备临床落地潜力。 Abstract: Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

[59] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment

Jinkui Hao,Gorkem Durak,Halil Ertugrul Aktas,Ulas Bagci,Bradley D. Allen,Nilay S. Shah,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出CORA,一种基于病理中心、合成驱动的3D视觉基础模型,利用解剖引导的病灶合成引擎,在12801例无标注冠脉CTA数据上进行自监督训练,显著提升斑块表征、狭窄检测与冠脉分割等任务性能,并通过融合大语言模型增强30天主要不良心脏事件(MACE)风险分层能力。

Details Motivation: 现有深度学习方法受限于专家标注数据稀缺;主流无标签预训练(如掩码图像建模)偏向全局解剖统计,难以捕捉冠状动脉斑块等局部病理特征。 Method: 提出CORA模型,采用病理中心、合成驱动的自监督框架:设计解剖引导的病灶合成引擎,生成模拟血管异常,使表征学习聚焦临床相关病变而非背景解剖;在12,801例无标注CCTA体积数据上训练;进一步耦合大语言模型构建多模态框架用于MACE风险预测。 Result: 在九家独立医院多中心数据集上,CORA在斑块表征、狭窄检测和冠脉分割等任务中持续超越现有最先进3D视觉基础模型,最高提升达29%;结合大语言模型后显著改善30天MACE风险分层效果。 Conclusion: CORA是一种可扩展、可扩展的3D心血管视觉基础模型,为统一解剖评估与心血管风险预测提供了新范式,推动无标注医学影像建模向临床实用化迈进。 Abstract: Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29\% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.

[60] Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration

Lukas Kratochvila,Jakub Stefansky,Simon Bilik,Robert Rous,Tomas Zemcik,Michal Wolny,Frantisek Rusnak,Ondrej Cech,Karel Horak

Main category: cs.CV

TL;DR: 本文提出了一种用于无人机自动巡检系统的烟雾探测器识别方法,对比了YOLOv11、SSD和RT-DETRv2等模型及多种数据训练策略,在两个具有挑战性的测试集上验证性能,YOLOv11n以mAP@0.5=0.884表现最优,并开源代码、模型与数据集。

Details Motivation: 烟雾探测器常安装于高处或难以到达的位置,人工巡检存在危险、低效且成本高的问题,亟需一种可集成于无人机的自动识别系统。 Method: 对比YOLOv11、SSD与RT-DETRv2(不同骨干网络尺寸)三种目标检测模型;探索真实数据与半合成数据结合的多种训练策略及数据增强方法;在两个含运动模糊、小目标、遮挡等挑战性场景的测试集上评估性能。 Result: YOLOv11n模型在mAP@0.5指标上达到0.884,为最佳性能;所有模型均在两类真实复杂场景测试集上完成鲁棒性验证;代码、预训练模型与数据集全部开源。 Conclusion: YOLOv11n在资源受限的嵌入式无人机平台中兼顾精度与效率,是烟雾探测器自动识别任务的优选方案;半合成数据与针对性增强策略对提升小目标和模糊场景下的检测鲁棒性具有实际价值。 Abstract: Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.

[61] OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Xiaoyu Tang,Jun Dong,Jintao Cheng,Rui Fan

Main category: cs.CV

TL;DR: 本文提出了跨域遥感视觉定位(CD-RSVG)新任务,构建首个大规模光学/SAR融合数据集OptSAR-RSVG,并提出高效框架OptiSAR-Net++,通过PL-MoE、对比学习范式、动态对抗负采样及文本引导双门融合模块等创新,在精度与效率上达到SOTA。

Details Motivation: 现有遥感视觉定位方法局限于单一传感器(光学或SAR),难以满足真实场景中多源数据协同需求,亟需支持跨域(光学+SAR)的视觉定位方法。 Method: 提出OptiSAR-Net++框架:1)补丁级低秩自适应混合专家(PL-MoE)实现跨域特征解耦;2)基于CLIP的对比学习范式+动态对抗负采样,替代计算昂贵的Transformer解码;3)文本引导双门融合模块(TGDF-SSA)和区域感知辅助头提升语义-视觉对齐与空间建模。 Result: 在新构建的OptSAR-RSVG和已有DIOR-RSVG基准上均达到SOTA性能,显著提升定位精度与推理效率。 Conclusion: CD-RSVG是一项具有实际价值的新任务,OptiSAR-Net++为多源遥感图像与自然语言联合理解提供了高效、鲁棒的解决方案,代码与数据集将开源。 Abstract: Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

[62] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform

Yan Meng,Jack Cook,X. Y. Han,Kaan Duman,Shauna Otto,Dhiraj Pangal,Jonathan Chainey,Ruth Lau,Margaux Masson-Forsythe,Daniel A. Donoho,Danielle Levy,Gabriel Zada,Sébastien Froelich,Juan Fernandez-Miranda,Mike Chang

Main category: cs.CV

TL;DR: 本文提出了一种用于垂体瘤手术视频的手术阶段识别综合框架,结合自监督表征学习、鲁棒时序建模与可扩展标注策略,在测试集上达到90%准确率,并通过外科医生协作平台实现数据持续增长与模型迭代优化。

Details Motivation: 准确的手术阶段识别对分析手术流程、支持术中决策及推动外科教育与绩效评估的数据驱动改进至关重要;但面临标注数据稀缺、类别不平衡和手术变异性大等挑战。 Method: 采用自监督预训练(基于251段无标签PTS视频)ResNet-50提取特征,再在81例标注手术视频上使用焦点损失、渐进式解冻与动态采样策略进行微调;并构建外科医生协作在线平台以支持视频上传、自动阶段分析与数据共建。 Result: 在独立测试集上达到90%准确率,显著优于当前最优方法,且在不同手术案例中表现出强泛化能力。 Conclusion: 所提框架有效缓解了标注数据匮乏问题,提升了手术阶段识别性能与临床实用性,平台设计为持续学习与跨中心协作提供了可行范式。 Abstract: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.

[63] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data

Haresh Rengaraj Rajamohan,Yuxuan Chen,Kyunghyun Cho,Cem M. Deniz

Main category: cs.CV

TL;DR: 本研究评估了自监督学习(SSL)在膝骨关节炎(OA)诊断与预后建模中的作用,发现图像-文本多模态SSL虽因数据严重偏倚(93%为KL 3级)未能提升诊断性能,却显著改善了4年结构进展的预测效果,尤其在低标注数据下优于ImageNet预训练。

Details Motivation: 探究自监督学习(特别是利用真实世界未标注医疗影像-报告数据)是否能超越ImageNet预训练,在膝骨关节炎的诊断与预后任务中提供更优表征。 Method: 比较两类SSL预训练:(i) 基于OAI、MOST、NYU队列膝X光片的图像单模态SSL;(ii) 基于未筛选医院膝X光片及其放射科医生报告的图像-文本多模态SSL;并在KL分级诊断与4年结构进展预测任务上评估线性探针与全量微调性能。 Result: 图像SSL在线性探针下诊断准确率略优,但全量微调不优于ImageNet;多模态SSL因预训练数据严重偏向重度OA(93% KL 3),诊断性能未提升;但在预后任务(如MOST外部验证AUROC达0.701 vs. ImageNet的0.599,10%标注数据下)显著胜出。 Conclusion: 预训练数据分布与下游任务匹配度决定SSL有效性:严重偏倚的临床数据虽不利于平衡诊断任务,却可有效支持与之分布一致的预后建模。 Abstract: This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution

[64] ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects

Jing Yang,Krithika Dharanikota,Emily Jia,Haiwei Chen,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出一个大规模真实世界偏振反射与材质数据集,包含218个日常物体的120万张高分辨率图像,并支持多种采集维度;基于该数据集训练的逆向与正向渲染模型在材质分解、重光照和稀疏视角三维重建任务上显著提升性能。

Details Motivation: 真实材质反射建模困难主要源于实测反射数据稀缺,现有方法依赖简化光照和低真实感的合成数据,导致模型难以泛化到真实图像。 Method: 构建了一个基于8相机、346光源Light Stage并配备交叉/平行偏振装置的大规模真实世界偏振反射与材质数据集,覆盖多视角、多光照、偏振、反射分离和材质属性五个维度,并提供漫反射/镜面反射分离、解析推导的漫反射反照率、镜面反照率及表面法线。 Result: 在本数据集上训练和评估了先进逆向与正向渲染模型,在固有图像分解、重光照和稀疏视角三维重建任务中,显著提升了材质分离精度、光照保真度和几何一致性。 Conclusion: 该工作为基于物理的材质理解建立了新基准,推动逆向渲染模型从合成训练走向真实世界泛化。 Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: https://jingyangcarl.github.io/ICTPolarReal/

[65] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

Xuepeng Jing,Wenhuan Lu,Hao Meng,Zhizhi Yu,Jianguo Wei

Main category: cs.CV

TL;DR: 本文提出TIGFlow-GRPO,一种两阶段生成框架,将基于流的轨迹生成与行为规则对齐,提升人类轨迹预测的社会合规性与物理可行性。

Details Motivation: 现有基于条件流匹配(CFM)的方法主要依赖监督拟合,难以充分反映社会规范和场景约束。 Method: 第一阶段构建CFM预测器,引入轨迹交互图(TIG)模块建模细粒度时空交互;第二阶段采用Flow-GRPO后训练,将确定性流展开重构为随机SDE采样,并设计融合视觉感知社会合规性与地图感知物理可行性的复合奖励函数。 Result: 在ETH/UCY和SDD数据集上,TIGFlow-GRPO提升了预测精度与长时稳定性,生成轨迹更符合社会规范且物理可行。 Conclusion: 该框架有效实现了流式轨迹建模与行为感知对齐,适用于动态多媒体环境中的智能系统。 Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.

[66] Infinite Gaze Generation for Videos with Autoregressive Diffusion

Jenna Kang,Colin Groth,Tong Wu,Finley Torrens,Patsorn Sangkloy,Gordon Wetzstein,Qi Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于自回归扩散模型的生成式框架,用于无限时长视频中的原始注视点预测,能生成具有连续空间坐标和高分辨率时间戳的注视轨迹,并在长程时空精度与轨迹真实性上显著优于现有方法。

Details Motivation: 传统显著性图和扫描路径难以刻画原始注视数据的细粒度时间动态,且现有模型受限于短时窗口(约3-5秒),无法建模真实场景中的长程行为依赖。 Method: 提出一种以显著性感知视觉潜在空间为条件的自回归扩散模型,用于生成任意长度视频中的原始注视轨迹,输出为连续空间坐标与高分辨率时间戳。 Result: 定量与定性评估表明,该方法在长程时空精度和轨迹真实性方面显著优于现有方法。 Conclusion: 该生成式框架成功实现了无限时长视频中高保真、长程一致的原始注视预测,推动了视频场景理解与多模态交互的发展。 Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

[67] Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出TIES框架,通过动态平衡注意力强度与层间排序一致性来选择视觉token,显著降低VLA模型推理延迟,提升任务成功率。

Details Motivation: 现有VLA模型因处理密集视觉token导致高推理延迟,且主流token缩减方法依赖静态的注意力强度选择,但该假设忽略了任务依赖性,可能损害策略性能。 Method: 提出TIES(Tau-guided Inter-layer Efficient Selection)动态token选择框架,利用层间token排名一致性指导选择,无需额外训练。 Result: 在CogACT + SIMPLER基准上,平均成功率提升6%,token使用量减少78%,并在多种解码器和基准中展现强泛化能力。 Conclusion: TIES验证了动态、一致性引导的token选择优于静态注意力驱动方法,为高效VLA模型提供了新思路。 Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

[68] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

Yasong Dai,Zeeshan Hayder,David Ahmedt-Aristizabal,Hongdong Li

Main category: cs.CV

TL;DR: 本文提出BiFM(双向流匹配)框架,统一建模生成与反演过程,通过估计双向平均速度场并引入连续时间区间监督与双向一致性目标,实现高质量、少步数的图像生成与语义保持编辑。

Details Motivation: 现有少步采样方法在前向过程近似上表现差,导致编辑质量下降;且多依赖预训练生成器和辅助模块,泛化性和可扩展性受限。 Method: BiFM联合学习生成与反演,估计图像→噪声和噪声→图像两个方向的平均速度场,约束于共享的瞬时速度场;采用连续时间区间监督、双向一致性损失和轻量级时间区间嵌入进行训练。 Result: BiFM在多种图像生成与编辑任务上显著优于现有少步方法,支持单步反演,并能无缝集成到主流扩散与流匹配骨干网络中。 Conclusion: BiFM提供了一种通用、高效、可扩展的少步生成与反演统一框架,提升了编辑质量与模型泛化能力。 Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

[69] Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

ZeBin Ji,Yang Hu,Xiuli Bi,Bo Liu,Bin Xiao

Main category: cs.CV

TL;DR: 本文提出了一种Select-Hypothesize-Verify框架,通过激活分布分析选择代表性样本、生成概念假设并验证其对神经元的激活程度,从而更准确地解释神经元功能,避免冗余或误导性概念导致的误判。

Details Motivation: 现有神经元概念解释方法假设每个神经元都有明确定义的功能且提供判别性特征,但实际中存在冗余或误导性神经元,易导致对网络决策机制的误解释。 Method: 提出Select-Hypothesize-Verify框架:1)基于激活分布分析选择最能体现神经元功能的样本;2)为选定神经元生成概念假设;3)验证生成概念是否能高激活对应神经元。 Result: 实验表明该方法生成的概念更准确,其激活对应神经元的概率约为当前最优方法的1.5倍。 Conclusion: 引入神经元功能验证机制并构建Select-Hypothesize-Verify框架,可显著提升神经元概念解释的准确性与可靠性。 Abstract: It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network's decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron's well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.

[70] Self-Corrected Image Generation with Explainable Latent Rewards

Yinyi Luo,Hrishikesh Gokhale,Marios Savvides,Jindong Wang,Shengfeng He

Main category: cs.CV

TL;DR: 本文提出xLARD框架,利用多模态大语言模型通过可解释的潜在奖励(Explainable Latent Rewards)引导文本到图像生成,实现自校正,提升细粒度语义与空间关系对齐。

Details Motivation: 文本到图像生成中,前馈式生成难以准确预判复杂提示的对齐效果,而图像评估相对容易;受此不对称性启发,提出自校正机制。 Method: 提出xLARD框架,包含轻量级校正器,基于模型生成参考图像提供结构化反馈,修正潜在表示;设计可微分映射,将潜在编辑映射为可解释的奖励信号,从而实现非可微图像评估对潜在空间的连续指导。 Result: 在多种生成与编辑任务上,xLARD显著提升了语义对齐精度和视觉保真度,同时保持原有生成先验。 Conclusion: xLARD验证了利用多模态大模型进行潜在空间自我评估与校正的有效性,为可控图像生成提供了新范式。 Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

[71] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

Yilin Ni,Wenjie Li,Zhengxue Wang,Juncheng Li,Guangwei Gao,Jian Yang

Main category: cs.CV

TL;DR: 本文提出PASDiff,一种无需训练的物理感知语义扩散方法,通过引入光度约束和面部结构注入,有效解决低光照下人脸图像的多重退化问题,并构建了WildDark-Face真实世界基准数据集。

Details Motivation: 现实场景中低光照人脸图像存在低照度、模糊、噪声和低可见性等多重退化;现有级联方法误差累积严重,通用联合模型缺乏显式面部先验,难以恢复清晰面部结构。 Method: 提出PASDiff:1)基于逆强度加权与Retinex理论引入光度约束,实现可信的可见性与自然色度恢复;2)设计Style-Agnostic Structural Injection(SASI),从现成面部先验中提取结构并滤除其光度偏差,将身份特征与物理约束无缝融合;3)构建真实世界基准WildDark-Face(700张复杂退化低光人脸图像)。 Result: 在多个指标上显著超越现有方法,在自然照度、色彩恢复与身份一致性之间取得更优平衡。 Conclusion: PASDiff通过物理建模与语义先验协同,在无需训练的前提下实现了高质量低光人脸增强,验证了物理约束与结构引导联合建模的有效性。 Abstract: Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.

[72] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

Dohwan Ko,Jinyoung Park,Seoung Choi,Sanghyeok Lee,Seohyun Lee,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 本文提出MoE-GRPO,一种基于强化学习(GRPO)的专家路由优化框架,用于提升视觉语言模型(VLMs)中混合专家(MoE)的专家选择多样性,缓解专家过拟合,并实现任务级专家专业化。

Details Motivation: 现有MoE在VLMs中采用的确定性top-K路由机制易忽略更优专家组合、导致专家过拟合,亟需提升路由多样性与适应性。 Method: 将专家选择建模为序列决策问题,采用Group Relative Policy Optimization(GRPO)进行强化学习优化;引入模态感知的路由器引导机制,抑制对特定模态不活跃专家的探索。 Result: 在多模态图像与视频基准上,MoE-GRPO显著优于标准top-K及其变体,提升了专家选择多样性,缓解了专家过拟合,并实现了任务级专家专业化。 Conclusion: MoE-GRPO通过RL驱动的自适应路由与模态感知引导,有效提升了MoE-VLM的泛化性、稳定性和效率,为多模态稀疏建模提供了新范式。 Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

[73] Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

Yusri Al-Sanaani,Rebecca Thornhill,Pablo Nery,Elena Pena,Robert deKemp,Calum Redpath,David Birnie,Sreeraman Rajan

Main category: cs.CV

TL;DR: 本文提出了一种基于模型无关元学习(MAML)的少样本(K-shot)3D左心房壁分割框架,结合多任务元训练和边界感知复合损失,在标注稀缺情况下显著提升薄结构分割精度与鲁棒性。

Details Motivation: 左心房壁在延迟钆增强MRI中因结构薄、对比度低且专家标注稀缺,导致分割困难。 Method: 采用Model-Agnostic Meta-Learning(MAML)框架,进行5/10/20-shot 3D左心房壁分割;元训练任务联合左、右心房腔体分割作为辅助任务;引入边界感知复合损失以提升薄结构边界精度。 Result: 在保留测试集上,5-shot时Dice达0.64(vs. 监督微调0.52),HD95为5.70 mm(vs. 7.60 mm);20-shot时接近全监督性能(0.69 vs. 0.71);在未见域偏移和本地队列上仍保持稳健(5-shot下Dice分别为0.59和0.57)。 Conclusion: 该MAML方法可在极少量标注下实现更准确可靠的左心房壁边界分割,有助于临床中仅需极少额外标注即可评估心房重构。 Abstract: Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall's thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.

[74] Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets

Peng Wu,Yuting Yan,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了首个面向事件流的视频异常检测框架EWAD,并构建了多个同步事件流与RGB视频的基准数据集,推动事件驱动的异常检测研究。

Details Motivation: 事件相机具有低冗余、聚焦动态运动和天然隐私保护等特性,非常适合视频异常检测,但缺乏专用的数据集和有效的建模方法,严重阻碍了该领域发展。 Method: 提出EWAD框架,包含三个创新点:事件密度感知的动态采样策略、密度调制的时间建模方法、以及RGB到事件的知识蒸馏机制。 Result: 在三个新构建的基准上,EWAD显著优于现有方法,验证了事件驱动建模在视频异常检测中的潜力与有效性。 Conclusion: 本工作首次系统性地建立了事件流视频异常检测的研究方向,提供了基准数据集与高效框架,为后续研究奠定基础。 Abstract: Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.

[75] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance

Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan

Main category: cs.CV

TL;DR: 本文提出C2W-Tune两阶段迁移框架,利用高精度左心房腔分割模型作为解剖先验,提升3D LGE-MRI中薄左心房壁的分割精度;通过腔到壁的权重迁移与渐进式层解冻微调,在多个边界指标上显著优于从头训练基线。

Details Motivation: 左心房壁在3D LGE-MRI中因壁薄、解剖复杂、对比度低,难以准确分割,而精准分割对壁厚映射和纤维化量化至关重要。 Method: 提出C2W-Tune:第一阶段用带ResNeXt编码器和实例归一化的3D U-Net预训练左心房腔分割;第二阶段迁移权重,采用渐进式层解冻策略微调网络以适应壁分割任务。 Result: 在2018 LA Segmentation Challenge数据集上,壁Dice达0.814(+0.191),1mm Surface Dice达0.731(+0.178),HD95降至2.55 mm,ASSD降至0.63 mm;仅用70例训练数据时仍达Dice 0.78、HD95 3.15 mm,优于多类别基准。 Conclusion: 基于解剖结构的任务迁移配合可控微调,可有效提升3D LGE-MRI中薄左心房壁分割的边界精度。 Abstract: Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall's thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.

[76] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang

Main category: cs.CV

TL;DR: 本文提出了首个双语文本到动作生成基准BiHumanML3D,并设计了带跨语言对齐(CLA)的双语动作扩散模型BiMD,在双语动作生成任务上显著优于单语模型和翻译基线。

Details Motivation: 现有文本到动作生成方法受限于双语数据集缺失及语言模型跨语言语义理解能力不足。 Method: 构建首个双语文本-动作基准BiHumanML3D(LLM辅助标注+人工校正),并提出双语动作扩散模型BiMD,核心为显式跨语言语义对齐(CLA)模块。 Result: BiMD在BiHumanML3D上取得FID 0.045(对比0.169)和R@3 82.8%(对比80.8%),显著优于单语模型与翻译基线;支持零样本语码转换。 Conclusion: 双语数据集BiHumanML3D和跨语言对齐策略CLA对跨语言动作合成至关重要且有效,为该方向奠定基础。 Abstract: Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}

[77] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting

Junoh Leea,Junmyeong Lee,Yeon-Ji Song,Inhwan Bae,Jisu Shin,Hae-Gon Jeon,Jin-Hwa Kim

Main category: cs.CV

TL;DR: 本文提出了一种在4D场景中显式保持高斯点局部几何结构随时间一致性的新方法,通过视图空间射线分组与α加权约束,无需依赖光流等外部先验,提升了单目动态3D重建的物理合理性和质量。

Details Motivation: 现有基于3D高斯溅射的动态3D场景重建方法难以建模符合真实物理规律的运动,尤其在单目视频数据中,高斯运动不连贯导致局部几何结构退化、重建质量下降;当前SOTA方法严重依赖光学流或2D轨迹等外部先验来维持时间一致性。 Method: 提出视图空间射线分组策略:对同一射线穿过的、α-blending权重超过阈值的高斯进行聚类,并对每组施加空间分布一致性约束,以稳定其局部几何结构,从而隐式建模更符合物理规律的运动。 Result: 在多个挑战性单目动态数据集上验证有效;集成到两个不同基线模型后,显著优于现有方法,在时间一致性和重建质量上均取得提升。 Conclusion: 所提方法能有效消除对外部先验的依赖,通过显式保持高斯局部几何的时间稳定性,实现更物理合理、高质量的4D动态场景重建。 Abstract: The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $α$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.

[78] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为概率概念图推理(PCGR)的可解释、可演化的多模态虚假信息检测框架,通过构建人类可理解的概念图并进行分层注意力推理,实现高准确率与强鲁棒性。

Details Motivation: 传统多模态虚假信息检测器是不透明的黑箱,且难以应对新型伪造手段,亟需可解释、可演化的解决方案。 Method: PCGR采用‘先构建后推理’范式:首先利用多模态大语言模型(MLLMs)自动发现并验证高层概念,构建人类可理解的概念图;然后在该图上应用分层注意力机制进行推理,生成可追溯的推理链。 Result: PCGR在多模态虚假信息检测任务中达到SOTA精度和对新兴操纵类型的强鲁棒性,在粗粒度检测与细粒度操纵识别两方面均优于先前方法。 Conclusion: PCGR通过结构化、概念驱动的推理范式,有效提升了多模态虚假信息检测的可解释性、适应性与性能。 Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

[79] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

WenXi Wang,JunQi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的分布式车辆控制方法,使普通车辆仅利用局部信息即可在线协同避让应急车辆,兼顾响应速度、对正常交通影响小及强泛化能力。

Details Motivation: 现有应急车辆快速通行方法(集中式优化与强化学习)存在计算开销大、难以扩展到不同交通场景的问题,亟需一种低开销、高适应性、安全可靠的分布式解决方案。 Method: 提出基于局部信息的分布式车辆控制框架,理论上证明其近似等价于全局信息下的最优决策;进一步设计分布式冲突消解机制,确保安全性并避免单点故障。 Result: 在真实交通数据集上的仿真表明,该方法决策更快、对普通车辆干扰更小,且在不同车流密度和路网结构下均保持优异的可扩展性。 Conclusion: 所提分布式方法突破了传统集中式与学习式方法的局限,在实时性、安全性、可扩展性与实用性之间实现了更好平衡,为智能网联环境下应急通行提供了新范式。 Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.

[80] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu,Necva Bolucu,Stephen Wan,Dadong Wang,Jiahao Xia,Jian Zhang

Main category: cs.CV

TL;DR: 本文提出SGREC方法,利用查询驱动的场景图作为视觉语言模型(VLM)与大语言模型(LLM)之间的结构化中介,实现可解释的零样本指代表达理解(REC),在多个基准上达到领先性能。

Details Motivation: 现有VLMs难以捕捉细粒度视觉细节和复杂对象关系,而LLMs虽擅长高层语义推理,却无法直接抽象视觉特征;需融合二者优势以提升零样本REC性能与可解释性。 Method: 提出SGREC:首先用VLM构建查询驱动的场景图(编码空间关系、描述性字幕和对象交互),再将该结构化文本表征输入LLM进行目标定位与可解释推理。 Result: 在RefCOCO val(66.78%)、RefCOCO+ testB(53.43%)、RefCOCOg val(73.28%)等零样本REC基准上取得top-1准确率最优。 Conclusion: SGREC通过引入查询驱动的场景图有效弥合了视觉区域与语义理解之间的鸿沟,在零样本REC任务中实现了高性能与强可解释性。 Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

[81] Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning

Md. Rokon Mia,Rakib Hossain Sajib,Abdullah Al Noman,Abir Ahmed,B M Taslimul Haque

Main category: cs.CV

TL;DR: 本文提出了一种结合Center Loss和ArcFace Loss的双损失框架,用于提升水稻叶片病害细粒度分类性能,在Rice Leaf Dataset上达到超99%的准确率,且无需修改主干网络结构,适合农业实际部署。

Details Motivation: 传统深度学习模型依赖交叉熵损失,在水稻病害数据集上面临类内差异大、类间相似性高的挑战,难以实现精准识别。 Method: 提出融合Center Loss(中心损失)与ArcFace Loss(弧度边缘损失)的双损失函数框架,并分别嵌入InceptionNetV3、DenseNet201和EfficientNetB0三种主流骨干网络进行训练。 Result: 在Rice Leaf Dataset上分别取得99.6%、99.2%和99.2%的分类准确率;验证了角距边界约束与中心约束能显著增强特征判别力。 Conclusion: 该双损失框架不依赖网络结构改动,兼具高性能与实用性,为植物病理图像细粒度分类提供了高效可行的解决方案。 Abstract: Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world's population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.

[82] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields

Thanh-Hai Le,Hoang-Hau Tran,Trong-Nghia Vu

Main category: cs.CV

TL;DR: Few TensoRF 结合 TensorRF 的高效张量表示与 FreeNeRF 的频率驱动少样本正则化,提升了稀疏视角下的3D重建稳定性与质量,同时保持快速训练速度。

Details Motivation: 解决稀疏输入视角下3D重建不稳定、质量差的问题,兼顾效率与数据有效性。 Method: 将 TensorRF 的张量表示与 FreeNeRF 的频率驱动少样本正则化结合,并引入频率掩码和遮挡掩码。 Result: 在 Synthesis NeRF 上 PSNR 从 21.45 dB 提升至 23.70 dB(微调达 24.52 dB),训练时间仍为约10–15分钟;在 THuman 2.0 上仅用8张图即达27.37–34.00 dB。 Conclusion: Few TensoRF 是一种高效、数据有效的实时3D重建方法,适用于多种场景。 Abstract: This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF's efficient tensor based representation with FreeNeRF's frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF's fast \(\approx10-15\) minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.

[83] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Zhangyu Jin,Maksim Siniukov,Deuksin Kwon,Ashutosh Chaubey,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出GDPO-Listener框架,通过自回归流匹配、分组奖励解耦策略优化及语义文本控制,显著提升双人交互中说话者与倾听者3D头部运动的表达力、动态性与可控性。

Details Motivation: 现有方法在生成倾听者头部运动时易出现‘回归均值’问题(即运动趋于静止),且缺乏对复杂非言语动作的建模能力。 Method: 1)提出自回归流匹配架构实现稳定监督学习;2)设计分组奖励解耦策略优化(GDPO),对FLAME参数分组进行独立奖励归一化,以鼓励高方差、高表达性运动生成;3)引入显式语义文本控制机制,支持定制化响应。 Result: 在Seamless Interaction和DualTalk数据集上的实验表明,该方法在长期运动方差、视觉表现力和语义可控性方面均优于现有基线。 Conclusion: GDPO-Listener有效解决了倾听者运动退化问题,为虚拟人双人交互提供了更自然、丰富、可控的3D头部运动合成新范式。 Abstract: Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

[84] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Zhe Gao,Shiyu Shen,Taifeng Chai,Weinong Wang,Haotian Xu,Xing W,Wenbin Li,Qi Fan,Yang Gao,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出VideoTIR方法,利用强化学习优化多级工具调用以提升长视频理解能力,缓解多模态大模型在该任务中的幻觉问题。

Details Motivation: 现有MLLMs在长视频理解中易产生幻觉,主因是文本与视觉token不平衡;虽有基于SFT的工具调用方法,但依赖大量高质量标注数据且工具调用路径受限。 Method: 提出基于强化学习的VideoTIR框架,支持Zero-RL和SFT冷启动;设计Toolkit Action Grouped Policy Optimization(TAGPO)提升工具调用效率;构建沙盒式轨迹合成框架生成高质量训练轨迹。 Result: 在三个长视频问答基准上实验表明,VideoTIR在准确性和效率上均优于现有方法。 Conclusion: VideoTIR通过RL驱动的多级工具协同调用与高效策略优化,有效缓解长视频理解中的幻觉问题,为MLLMs在复杂视频理解任务中提供了新范式。 Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

[85] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering

Xu Liu

Main category: cs.CV

TL;DR: 本文提出CARE,一种无需训练的可控医学图像恢复框架,通过双潜在分支策略平衡结构保持与先验引导修复,并利用风险感知自适应控制器动态调整各分支贡献,从而在不重新训练模型的情况下实现保守或增强型恢复模式。

Details Motivation: 现有医学图像恢复方法通常依赖任务特定的再训练,且难以控制保真重建与先验驱动增强之间的权衡,过度激进的恢复可能引入幻觉细节或改变诊断关键结构,临床安全性不足。 Method: 提出CARE框架,采用双潜在恢复策略:一支确保数据保真度与解剖一致性,另一支利用生成先验恢复缺失或退化信息;并设计风险感知自适应控制器,依据恢复不确定性与局部结构可靠性动态调节两分支权重。 Result: 在噪声和不完整医学影像场景中验证,CARE实现了高质量恢复,更好保留临床相关结构,显著降低不合理重建风险。 Conclusion: CARE为更安全、可控、即插即用的医学图像恢复提供了实用新路径。 Abstract: Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.

[86] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation

Jianbo Qi,Mengyao Li,Baogui Jiang,Yidan Chen,Qiao Wang

Main category: cs.CV

TL;DR: 本文提出GeoNDC,一种基于隐式神经场的可查询地球观测数据立方体,将遥感数据编码为连续时空表示,实现高效压缩、按需查询与高保真重建。

Details Motivation: 现有卫星遥感数据以离散栅格文件形式存储,导致存储、传输和查询成本高昂,缺乏统一、紧凑且分析就绪的数据表示方式。 Method: 构建GeoNDC——一种可查询的神经数据立方体,利用连续时空隐式神经场对全球尺度地球观测数据进行建模,支持直接时空查询、连续时间重建与端到端压缩。 Result: 在MODIS、Sentinel-2和HiGLASS数据上验证:压缩率达95:1(20年MODIS仅0.44GB),光谱保真度高(R²>0.98,RMSE=0.021);云遮挡下时序重建R²>0.85;生物物理产品重建R²>0.98;可在消费级硬件上实时查询。 Conclusion: GeoNDC提供了一种AI-native的统一表示范式,融合查询、重建与压缩功能,有望成为原始遥感档案之上的新一代分析就绪数据层。 Abstract: Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5\,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10\,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ($R^2 > 0.85$) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ($R^2 > 0.98$). The representation compresses the 20-year MODIS archive to 0.44\,GB -- approximately 95:1 relative to an optimized Int16 baseline -- with high spectral fidelity (mean $R^2 > 0.98$, mean RMSE $= 0.021$). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.

[87] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

Wonjoon Lee,Sungmin Woo,Donghyeong Kim,Jungho Lee,Sangheon Park,Sangyoun Lee

Main category: cs.CV

TL;DR: 本文提出MoRGS框架,通过引入光流作为轻量运动线索、学习高斯运动偏移场及运动置信度,实现高效在线4D动态场景重建,显著提升运动保真度与时间一致性。

Details Motivation: 现有在线动态场景重建方法缺乏对每个高斯椭球体(Gaussian)真实3D运动的建模能力,仅依赖光度损失导致运动估计失真,无法区分动态与静态区域。 Method: 提出MoRGS:1)利用稀疏关键视角的光流作为轻量运动监督;2)学习每高斯运动偏移场以对齐投影3D运动与观测光流;3)引入每高斯运动置信度,加权属性更新并抑制静态区域冗余运动。 Result: 在多个动态场景数据集上达到在线方法中最优的重建质量与运动保真度,同时保持可流式处理的实时性能。 Conclusion: MoRGS通过显式建模每高斯运动并融合多视角时序运动线索,在不牺牲效率的前提下显著提升了在线4D重建的真实性与稳定性。 Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

[88] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

Liyuan Zhu,Manjunath Narayana,Michal Stary,Will Hutchcroft,Gordon Wetzstein,Iro Armeni

Main category: cs.CV

TL;DR: GaussFusion 提出一种基于几何信息的视频生成方法,用于改善野外场景下3D高斯泼溅(3DGS)重建质量,显著缓解浮点伪影、闪烁和模糊等问题,并支持多种重建范式,实现实时高质量新视角合成。

Details Motivation: 解决3D高斯泼溅在真实场景中因相机位姿误差、覆盖不全和几何初始化噪声导致的浮点、闪烁、模糊等常见伪影问题。 Method: 提出几何信息引导的视频到视频生成器,利用深度、法向、不透明度和协方差等高斯原语渲染缓冲区作为输入;设计伪影合成流水线以模拟多样化退化模式,提升泛化性。 Result: 在新视角合成基准上达到SOTA性能;高效变体可在21 FPS实时运行,支持交互式3D应用。 Conclusion: GaussFusion是一种通用、鲁棒且高效的后处理框架,可适配优化型与前馈型3DGS方法,显著提升野外重建质量与时间一致性。 Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.

[89] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics

Jing Tao,Taihang Lei,Banglei Guan,Ying Qu,Xudong Na,Likun Ma,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种结合空间可变曝光(SVE)相机与神经形态事件相机的闭环Event--SVE测量系统,用于在高动态范围、浓烟和微秒级粒子运动等极端条件下实现燃烧过程的实时三维监测。

Details Motivation: 传统成像在高能推进剂燃烧实时监测中面临高动态范围、微秒级粒子运动和浓烟干扰导致的饱和、运动模糊和粒子提取不稳定等问题。 Method: 构建闭环Event--SVE系统:SVE相机生成HDR图像并采用烟雾感知融合策略;多线索烟雾似然图分离粒子发射与烟雾散射;利用HDR强度作为事件相机绝对强度参考以抑制烟雾相关事件伪影;基于净化后的事件数据,通过立体事件驱动的3D流水线进行特征提取与三角测量估计分离高度和等效粒径。 Result: 在硼基推进剂实验中成功获取多模态等效半径统计分布,并捕捉到常规传感器难以观测的快速分离瞬态;最大标定误差为0.56%;实现了烟雾遮蔽下微秒级分辨率的三维燃烧测量。 Conclusion: 该框架为烟雾干扰下的高动态范围燃烧过程提供了实用且标定一致的微秒级三维测量新途径。 Abstract: Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event--SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.

[90] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

Xuankai Zhang,Junjin Xiao,Shangwei Huang,Wei-shi Zheng,Qing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于单目视频的高质量动态高斯泼溅方法,通过SE(3) B样条运动基显式建模高斯椭球的位置与朝向连续变形,并引入自适应控制机制、软分割重建策略及多视角扩散模型以提升建模能力、减少运动干扰和避免过拟合。

Details Motivation: 现有动态高斯泼溅方法难以精确建模复杂连续运动,且易受长时序运动干扰和单目视角局限影响,导致新视角合成质量受限。 Method: 采用SE(3) B样条运动基显式建模高斯椭球的连续位置与姿态变形;设计自适应机制动态调整运动基与控制点数量;提出软分割重建策略缓解长间隔运动干扰;引入多视角扩散模型提供跨视角先验以抑制过拟合。 Result: 在新视角合成任务上显著优于当前最先进方法,实现了更高质量、更鲁棒的动态场景重建。 Conclusion: 所提方法通过几何感知的连续运动建模与多视角正则化,有效提升了单目动态场景的高斯泼溅重建精度与泛化能力。 Abstract: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.

[91] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Junpeng Ma,Sashuai Zhou,Guanghao Li,Xin Gao,Yue Cao,Hengyu Zeng,Yuxiang Yan,Zhibin Wang,Jun Song,Bo Zheng,Shanghang Zhang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出GIFT框架,通过评估帧的内在不可替代性来选择关键帧,以解决现有视频大模型中密集帧处理计算成本高的问题。

Details Motivation: 现有方法在选择关键帧时采用贪心策略,并且将相关性和多样性分开评估,容易陷入局部最优并错误选择无关噪声帧。 Method: 提出GIFT(Global Irreplaceability Frame Targeting)框架:首先引入有向多样性(Directed Diversity)来量化帧在相关性条件下的唯一性,构建统一的不可替代性得分;然后通过预算感知优化策略,先选取最高不可替代性的核心帧,再随预算增加逐步补充关键时间上下文。 Result: 在LLaVA-Video-7B上,GIFT在长视频基准测试中相较均匀采样平均提升达12.5%。 Conclusion: GIFT是一种无需训练的关键帧选择方法,能更准确、高效地提升视频大模型的理解性能。 Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

[92] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Nanxiang Jiang,Zhaoxin Fan,Baisen Wang,Daiheng Gao,Junhang Cheng,Jifeng Guo,Yalan Qin,Yeying Jin,Hongwei Zheng,Faguo Wu,Wenjun Wu

Main category: cs.CV

TL;DR: 本文提出了Z-Erase,首个专为单流扩散Transformer(如Z-Image)设计的概念擦除方法,通过解耦更新机制与拉格朗日引导的自适应调制算法,在避免生成崩溃的同时实现敏感概念擦除与图像保真度的平衡,并理论证明其收敛性。

Details Motivation: 现有概念擦除方法在单流扩散Transformer(如Z-Image)上易导致生成崩溃,因其文本与图像token共享参数、统一建模,而此前方法多针对U-Net或双流架构设计,缺乏适配性。 Method: 提出Stream Disentangled Concept Erasure Framework以解耦参数更新,并在此框架下设计Lagrangian-Guided Adaptive Erasure Modulation——一种带约束优化的自适应擦除调制算法;同时提供收敛性分析,证明其可达Pareto平稳点。 Result: Z-Erase有效缓解生成崩溃问题,在多种擦除任务上达到SOTA性能。 Conclusion: Z-Erase是首个适用于单流T2I扩散Transformer的概念擦除方法,兼顾稳定性、可控性与理论保证,为该新兴架构的安全对齐提供了新范式。 Abstract: Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

[93] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Guoyin Wang,Jiancan Wu,Xiang Wang,Xiangnan He

Main category: cs.CV

TL;DR: 本文提出Token-Reweighting(ToR)策略,解决多模态大语言模型在强化学习中感知与推理 token 耦合导致的优化困难问题,通过动态重加权关键 token 显式建模二者依赖关系,在多个基准上实现SOTA性能。

Details Motivation: 多模态大语言模型(MLLMs)在RLVR中面临感知相关token(视觉接地)与推理相关token(符号推理)内在耦合、难以孤立优化的根本挑战。 Method: 提出即插即用的Token-Reweighting(ToR)策略:基于token级实证分析识别两类关键token,并在RLVR训练中动态重加权,以显式建模其相互依赖;可集成于GRPO、DAPO等现有方法之上。 Result: ToR在多个多模态推理基准上带来一致性能提升,同时实现准确视觉接地与连贯推理,达到当前最优(SOTA)水平。 Conclusion: 感知与推理token具有本质耦合性,需联合建模而非分离优化;ToR通过动态token重加权有效捕捉该依赖关系,是提升MLLMs在RLVR中表现的有效通用策略。 Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.

[94] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

Chengxu Yang,Jingling Yuan,Chuang Hu,Jiawei Jiang

Main category: cs.CV

TL;DR: 本文提出CLVA(跨层视觉锚点)方法,通过增强中间层关键视觉特征并抑制深层注意力回归的噪声,缓解多模态大语言模型中的物体幻觉问题,无需额外训练且计算开销小。

Details Motivation: 现有方法在解释最终层注意力漂移方面缺乏足够可解释性,且未充分关注视觉特征在各层的演化规律。 Method: 提出无需训练的CLVA方法,利用注意力动态中捕获的关键中间层视觉锚点,强化中间层特征、抑制早期视觉噪声向深层的回归。 Result: 在多种架构与基准上验证了CLVA的有效性,显著缓解物体幻觉,且未明显增加计算时间与GPU内存占用。 Conclusion: 深层注意力幻觉源于早期视觉噪声在深层的回归;中间层视觉锚点比最终层更可靠;CLVA提供了一种高效、可解释、即插即用的幻觉抑制方案。 Abstract: Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

[95] THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Tzu-Yen Ma,Bo Zhang,Zichen Tang,Junpeng Ding,Haolin Tian,Yuanze Li,Zhuodi Hao,Zixin Ding,Zirui Wang,Xinyu Yu,Shiyao Peng,Yizhuo Zhao,Ruomeng Jiang,Yiling Huang,Peizhi Zhao,Jiayuan Chen,Weisheng Tan,Haocheng Gao,Yang Liu,Jiacheng Liu,Zhongjun Yang,Jiayu Huang,Haihong E

Main category: cs.CV

TL;DR: THEMIS is a new multi-task benchmark for evaluating multimodal large language models (MLLMs) on visual fraud reasoning using real-world academic fraud cases, featuring complex images, diverse fraud types, fine-grained manipulations, and multi-dimensional capability assessment.

Details Motivation: Existing benchmarks lack realism, diversity, and granularity in evaluating MLLMs’ ability to reason about visual academic fraud; THEMIS addresses these gaps with real retracted-paper cases and synthetic multimodal data. Method: THEMIS constructs a benchmark of 4,000+ questions across seven real-world scenarios, incorporates 60.47% complex-texture images, covers five fraud types and 16 fine-grained manipulation operations per sample, and maps fraud types to five core visual fraud reasoning capabilities for multi-dimensional evaluation. Result: Experiments on 16 leading MLLMs show even the top-performing model (GPT-5) achieves only 56.15% overall accuracy, confirming THEMIS’s rigor and highlighting current models’ limitations in visual fraud reasoning. Conclusion: THEMIS sets a new standard for evaluating MLLMs in complex, real-world visual fraud reasoning and is expected to drive progress in developing more robust and capable models for academic integrity applications. Abstract: We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

[96] Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

Main category: cs.CV

TL;DR: Pixelis 是一种直接在像素空间操作的视觉-语言智能体,通过可执行的像素级工具链(如缩放、分割、跟踪、OCR等)进行学习与行动,在无需外部反馈的情况下实现物理世界中的具身自适应。

Details Motivation: 现有视觉-语言系统多为静态观察者,缺乏行动能力与安全的在线改进机制,难以实现真正物理 grounded 的通用视觉智能;需从‘描述像素’转向‘作用于像素’。 Method: Pixelis 分三阶段训练:(1) 监督微调,使用带掩码的模仿损失学习像素工具语法;(2) 好奇-一致性奖励微调,联合预测误差好奇心、相邻步连贯性与效率先验;(3) 像素级测试时强化学习,基于邻域检索、轨迹投票与KL锚定的安全更新机制进行无标签自适应。 Result: 在6个公开图像/视频基准上平均相对提升+4.08%(最高+6.03%),生成更短、可审计的工具链,并在测试时学习中保持KL约束内的稳定漂移控制。 Conclusion: Pixelis 证明了直接在像素空间中行动是构建具身、可适应、物理 grounded 多模态智能的关键路径,为脱离纯语言token、迈向真实世界交互提供了新范式。 Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

[97] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning

Yuqiao Zeng,Xu Wang,Tengfei Liang,Yiqing Hao,Yi Jin,Hui Yu

Main category: cs.CV

TL;DR: 本文提出RL-MBA框架,利用强化学习实现模态平衡与难度感知的多模态主动学习,通过自适应调整模态权重和基于证据融合的样本难度估计,提升有限标注预算下的分类精度与模态公平性。

Details Motivation: 现有多模态主动学习方法通常假设模态重要性在标注轮次中恒定,忽视了训练过程中模态价值和样本难度的动态变化,导致选择策略缺乏适应性。 Method: 提出RL-MBA:将样本选择建模为马尔可夫决策过程;包含两个核心模块——(1) 自适应模态贡献平衡(AMCB),通过强化反馈动态调整模态权重;(2) 难度感知的证据融合(EFDA),利用不确定性估计样本难度以指导选择。 Result: 在Food101、KineticsSound和VGGSound数据集上,RL-MBA在有限标注预算下显著优于强基线,同时提升分类准确率与模态公平性。 Conclusion: 动态建模模态贡献与样本难度对多模态主动学习至关重要;RL-MBA通过强化学习实现了更鲁棒、公平且高效的标注样本选择。 Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.

[98] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Chenglong Wang,Yifu Huo,Yang Gan,Qiaozhi He,Qi Meng,Bei Li,Yan Wang,Junfu Liu,Tianhua Zhou,Jingbo Zhu,Tong Xiao

Main category: cs.CV

TL;DR: 本文提出了一种多阶段强化学习(MSRL)方法,通过利用大规模文本偏好数据预训练、逐步迁移至多模态任务,并结合跨模态知识蒸馏,在极少或多无多模态标注数据的情况下,显著提升生成式多模态奖励模型(MRM)性能。

Details Motivation: 现有基于强化学习从可验证奖励(RLVR)训练的多模态奖励模型(MRMs)严重依赖昂贵且难以获取的多模态偏好标注数据,限制了其可扩展性。 Method: 提出多阶段强化学习(MSRL)框架:第一阶段在大规模文本偏好数据上学习通用奖励推理能力;第二阶段通过字幕(caption-based)强化学习迁移至多模态;第三阶段进行全多模态强化学习;并引入跨模态知识蒸馏以增强偏好泛化能力。 Result: MSRL在VL-RewardBench上准确率从66.6%提升至75.9%,在GenAI-Bench上从70.2%提升至75.7%,且无需额外多模态偏好标注。 Conclusion: MSRL提供了一种高效、可扩展的RLVR训练范式,显著缓解多模态奖励建模对标注数据的依赖,推动MRMs在视觉理解与生成任务中的实际应用。 Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.

[99] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

Yuto Matsuo,Yoshihiro Fukuhara,Yuki M. Asano,Rintaro Yanagi,Hirokatsu Kataoka,Akio Nakamura

Main category: cs.CV

TL;DR: 本文提出了一种基于莫尔干涉图案的轻量级、无存储、无需外部数据的程序化数据增强方法,显著提升了图像分类模型(尤其是ViT)在多种鲁棒性基准(如ImageNet-C/R和对抗样本)上的性能。

Details Motivation: 现有数据增强方法(如扩散模型或特征混合)计算开销大或依赖外部数据,亟需一种高效、轻量且不依赖外部数据的新范式。 Method: 提出基于解析莫尔干涉模式的程序化增强方法,利用闭式数学公式实时生成多尺度结构化扰动纹理,直接在内存中合成并混合到训练图像中,随后丢弃。 Result: 在Vision Transformer上验证,该方法在ImageNet-C、ImageNet-R及对抗攻击基准上均一致优于标准增强和其它免外部数据增强方法,单图处理仅耗时0.0026秒。 Conclusion: 解析干涉模式是一种实用、高效的生成式增强替代方案,为鲁棒性训练提供了新思路。 Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

[100] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

Jiawei Lin,Wanrong Zhu,Vlad I Morariu,Christopher Tensmeyer

Main category: cs.CV

TL;DR: AnyDoc 是一个面向多任务、多类别的文档生成框架,通过构建大规模合成数据集 DocHTML 并结合高度感知的强化学习(HARL)微调多模态大模型,在意图到文档、文档反渲染和元素到文档三项任务上实现显著提升。

Details Motivation: 现有文档生成方法受限于人工标注数据规模小、覆盖类别少,难以支撑通用、鲁棒的文档生成能力。 Method: 提出 AnyDoc 框架:1)构建可扩展的 HTML/CSS 文档自动合成流水线,生成含丰富元数据的大规模 DocHTML 数据集(26.5 万样本,111 类,32 种风格);2)基于该数据集微调多模态大语言模型(MLLM)以支持三类文档生成任务;3)引入高度感知强化学习(HARL)后训练策略,以高度差为奖励信号缓解内容溢出问题。 Result: AnyDoc 在意图到文档、文档 derendering 和元素到文档三项任务上均显著优于通用 MLLM 和专用基线模型,定性与定量实验验证其有效性。 Conclusion: AnyDoc 证明了大规模合成数据与任务定制化对齐训练(特别是 HARL)对通用文档生成的关键作用,为构建统一、可控、可扩展的 AI 文档生成系统提供了新范式。 Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

[101] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

Minh-Quan Viet Bui,Jaeho Moon,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出AirSplat框架,通过自一致姿态对齐(SCPA)和基于评分的不透明度匹配(ROM)技术,将3D视觉基础模型(3DVFMs)的几何先验有效迁移至无姿态依赖的新视角合成(NVS),显著提升重建质量。

Details Motivation: 3D视觉基础模型虽在零样本几何估计中表现优异,但直接用于通用新视角合成仍具挑战性,亟需将其几何先验适配到高质量、无姿态约束的NVS任务中。 Method: 提出AirSplat训练框架,包含两项关键技术:(1) 自一致姿态对齐(SCPA),在训练中构建反馈环以实现像素级对齐监督,缓解姿态-几何不一致问题;(2) 基于评分的不透明度匹配(ROM),利用稀疏视角NVS教师模型提供的局部3D几何一致性知识,筛选并剔除低质量渲染原语。 Result: 在大规模基准测试中,AirSplat显著优于现有无姿态NVS最先进方法,在重建质量上取得明显提升。 Conclusion: AirSplat验证了将3DVFMs适配至NVS任务的可行性与有效性,为同步实现高精度几何估计与高质量新视角合成提供了新范式。 Abstract: While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

[102] Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

Yaowen Chang,Zhen Cao,Xu Zheng,Xiaoxin Mi,Zhen Dong

Main category: cs.CV

TL;DR: 本文提出DAPASS框架,解决源数据不可用(SFUDA)下全景语义分割中的几何失真与伪标签噪声问题,通过PCGD和CRAM两个模块提升跨域性能,在多个基准上达到SOTA。

Details Motivation: 全景语义分割面临严重几何失真和密集标注成本高的挑战;在源数据不可访问的源自由无监督域自适应(SFUDA)设定下,域偏移加剧、伪标签不可靠、小众类性能骤降。 Method: 提出DAPASS框架:1)Panoramic Confidence-Guided Denoising(PCGD)模块,利用扰动一致性与邻域置信度生成高保真、类别均衡伪标签;2)Contextual Resolution Adversarial Module(CRAM),通过对抗对齐高分辨率局部细节与低分辨率全局语义,缓解尺度差异与畸变。 Result: 在Cityscapes-to-DensePASS(户外)和Stanford2D3D(室内)基准上分别达55.04% mIoU(+2.05%)和70.38% mIoU(+1.54%),为当前最优性能。 Conclusion: DAPASS有效缓解SFUDA下全景分割的伪标签噪声与几何失真问题,验证了无需源数据的鲁棒知识迁移可行性,推动隐私敏感场景下的360°场景理解发展。 Abstract: Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

[103] Robust Principal Component Completion

Yinjian Wang,Wei Li,Yuanyuan Gui,James E. Fowler,Gemine Vivone

Main category: cs.CV

TL;DR: 本文提出了一种新的鲁棒主成分补全(RPCC)框架,通过变分贝叶斯推断对概率稀疏张量分解建模,直接学习稀疏成分的支持集,避免后处理阈值化,显著提升前景提取与异常检测性能。

Details Motivation: 传统RPCA假设低秩与稀疏成分简单相加,但实际中稀疏前景常遮挡/替换背景元素,导致模型失配。 Method: 提出RPCC框架,以支持集隐式建模稀疏成分;采用全概率贝叶斯稀疏张量分解,并用变分贝叶斯推理求解;理论证明收敛到硬分类器。 Result: 在合成数据上接近最优;在真实彩色视频(前景提取)和高光谱数据(异常检测)上表现鲁棒;开源代码与附录已公开。 Conclusion: RPCC通过支持集建模和贝叶斯推理,克服了RPCA的结构假设局限,提升了稀疏成分识别的准确性与实用性。 Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.

[104] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Taegyoon Yoon,Yegyu Han,Seojin Ji,Jaewoo Park,Sojeong Kim,Taein Kwon,Hyung-Sin Kim

Main category: cs.CV

TL;DR: 本文介绍了EgoXtreme,一个专为真实世界第一人称视角(egocentric)6D物体姿态估计设计的大规模新数据集,涵盖工业维护、体育和应急救援三大挑战性场景,旨在解决现有基准在运动模糊、动态光照和视觉遮挡等极端条件下的不足,并验证了当前方法在这些条件下的局限性及时间信息利用的潜力。

Details Motivation: 现有6D姿态估计基准无法反映真实第一人称场景(如智能眼镜应用)中的严重运动模糊、动态光照和视觉遮挡等挑战,导致实验室性能与实际部署效果存在巨大鸿沟。 Method: 构建了全第一人称视角的大规模6D姿态数据集EgoXtreme,覆盖三大高挑战场景;在该数据集上系统评测了现有通用姿态估计算法,并分析了图像复原(如去模糊)与基于跟踪的方法的有效性。 Result: 现有SOTA通用姿态估计器在EgoXtreme极端条件下(尤其低光)泛化能力显著下降;单纯图像复原无改善;而引入时序信息的跟踪方法展现出性能提升。 Conclusion: EgoXtreme填补了真实第一人称6D姿态估计基准的空白,是推动鲁棒模型研发与评估的关键资源。 Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/

[105] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad,Ammarah Hashmi,Junichi Yamagishi,Yusuke Yasuda,Yu Tsao,Chia-Wen Lin,Yan-Tsung Peng,Hsin-Min Wang

Main category: cs.CV

TL;DR: 本文提出SAVe框架,一种完全基于真实视频的自监督音视频深度伪造检测方法,通过生成身份保持、区域感知的伪篡改样本和建模唇音同步性来提升检测鲁棒性和跨数据集泛化能力。

Details Motivation: 现有音视频深度伪造检测器严重依赖合成伪造数据训练,导致数据集和生成器偏差,难以泛化到未见过的伪造类型。 Method: SAVe采用自监督学习范式:1)在真实视频上动态生成身份保持、区域感知的自混合伪伪造样本,以学习多粒度面部视觉线索;2)引入音视频对齐模块建模唇音同步性,检测时序错位模式。 Result: 在FakeAVCeleb和AV-LipSync-TIMIT数据集上展现出有竞争力的域内性能和优异的跨数据集泛化能力。 Conclusion: 自监督学习是一种可扩展、鲁棒的音视频深度伪造检测新范式,能有效缓解合成数据依赖问题。 Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

[106] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

Hongxu Ma,Guang Li,Shijie Wang,Dongzhan Zhou,Baoli Sun,Takahiro Ogawa,Miki Haseyama,Zhihui Wang

Main category: cs.CV

TL;DR: 本文提出FD²框架,专为细粒度数据集蒸馏设计,通过定位判别性区域、构建细粒度表征,并在预训练和蒸馏阶段引入反事实注意力学习与细粒度约束,提升蒸馏样本的判别性和多样性。

Details Motivation: 现有解耦式数据集蒸馏方法依赖粗粒度类别标签,在细粒度数据集上导致类内差异过大、类间区分不足,且同类样本过度相似,削弱局部判别能力。 Method: FD²框架包含:1)预训练阶段采用反事实注意力学习聚合判别性表征更新类原型;2)蒸馏阶段引入细粒度特征约束(对齐本类原型、排斥他类)与相似性约束(增强同类样本间注意力多样性)。 Result: 在多个细粒度及通用数据集上实验表明,FD²能无缝集成到解耦式蒸馏流程中,并在大多数设置下提升性能,展现出强泛化与迁移能力。 Conclusion: FD²有效缓解了细粒度数据集蒸馏中的判别性不足与多样性缺失问题,为细粒度视觉任务的数据高效学习提供了新思路。 Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

[107] Learning to Rank Caption Chains for Video-Text Alignment

Ansel Blume,Burak Uzkent,Shalini Chaudhuri,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出了一种基于排序优化(ranking optimization)的视频-文本对齐方法,替代传统的二元DPO,通过生成降质 caption 链构建全序关系,在长文本生成与评估任务上优于标准DPO,并发现需微调视觉编码器,挑战了DPO仅作用于语言层的固有认知。

Details Motivation: 标准DPO采用二元胜者通吃策略,忽视视觉-语言模型中“失败响应”仍可能保持高视觉保真度的问题,缺乏对响应与视觉输入忠实度的精细建模。 Method: 提出基于排序优化的视频-文本对齐方法;利用反复caption降质策略,大规模生成具有全序关系的挑战性视频描述链;并实证检验是否需微调视觉编码器。 Result: 排序优化在长形式内容生成与评估上优于二元DPO;且其有效性依赖于视觉编码器的微调,表明DPO并非纯语言重加权过程。 Conclusion: 排序优化比二元DPO更适合视觉-语言对齐任务;DPO类方法的实际作用范围涵盖多模态编码器,需重新审视其机制与应用边界。 Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.

[108] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Chengyu Fang,Heng Guo,Zheng Jiang,Chunming He,Xiu Li,Minfeng Xu

Main category: cs.CV

TL;DR: Photon is a novel framework for 3D medical visual question answering that uses adaptive, instruction-conditioned token scheduling and surrogate gradient propagation to reduce computational cost without sacrificing accuracy.

Details Motivation: Scaling multimodal large language models to 3D medical imaging is hindered by high computational costs and prior methods disrupt volumetric continuity or obscure subtle findings. Method: Photon represents 3D volumes with variable-length token sequences, introduces instruction-conditioned token scheduling and surrogate gradient propagation, incorporates a custom backpropagation rule with gradient restoration, and applies regularization objectives to mitigate language-only bias. Result: Photon achieves state-of-the-art accuracy on diverse medical visual question answering tasks while reducing resource usage and accelerating both training and inference. Conclusion: Photon effectively balances computational efficiency and diagnostic accuracy in 3D clinical VQA by enabling adaptive, differentiable token compression grounded in visual evidence. Abstract: Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

[109] A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

SuYeon Kim,Wongyu Lee,MyeongAh Cho

Main category: cs.CV

TL;DR: 本文提出了一种语义解耦的统一模型,用于3D异常检测,通过粗到细全局标记化、类别条件对比学习和几何引导解码器,解决跨类别纠缠问题,显著提升检测性能与可靠性。

Details Motivation: 现有统一模型在多类别3D异常检测中存在跨类别纠缠(ICE)问题,导致语义先验错误和异常分数不可靠。 Method: 提出语义解耦的统一模型,包含:(i) 粗到细全局标记化以构建实例级语义标识;(ii) 类别条件对比学习以解耦类别语义;(iii) 几何引导解码器实现语义一致重建。 Result: 在Real3D-AD和Anomaly-ShapeNet数据集上达到SOTA,对象级AUROC分别提升2.8%(统一模型)和9.1%(类别特定模型)。 Conclusion: 所提方法有效缓解跨类别纠缠,提升了统一3D异常检测的准确性与可靠性。 Abstract: 3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

[110] SportSkills: Physical Skill Learning from Sports Instructional Videos

Kumar Ashutosh,Chi Hsuan Wu,Kristen Grauman

Main category: cs.CV

TL;DR: 本文介绍了SportSkills,首个面向物理技能学习的大规模体育视频数据集,包含36万 instructional 视频和63万视觉示范,并提出了基于错误条件的 instructional 视频检索新任务,显著提升了视频模型个性化指导能力。

Details Motivation: 现有大规模视频数据集侧重于通用人类活动,缺乏对细粒度物理技能学习所需动作的深度覆盖。 Method: 构建SportSkills数据集(36万 instructional 视频、63万视觉示范、55项运动),设计并验证细粒度动作理解能力,并提出错误条件下的 instructional 视频检索任务。 Result: 在相同模型下,基于SportSkills的表征性能比传统活动中心数据集提升最高达4倍;专业教练评估表明其检索方法显著提升视频模型提供个性化视觉指导的能力。 Conclusion: SportSkills填补了物理技能学习领域高质量视频数据的空白,并推动了从表征学习到可操作反馈生成的范式演进。 Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

[111] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

Bin Yang,Mohamed Abdelsamad,Miao Zhang,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: 本文提出PointINS,一种面向实例的自监督学习框架,通过几何感知学习提升点云表示能力,增强实例感知能力,在室内实例分割和室外全景分割任务上取得显著性能提升。

Details Motivation: 现有自监督学习方法在点云上虽提升了语义理解,但难以迁移到实例定位任务,缺乏实例感知能力,阻碍了通用3D基础模型的发展。 Method: PointINS引入正交偏移分支联合学习高层语义与几何推理,并提出两种正则化策略:偏移分布正则化(ODR)和空间聚类正则化(SCR),以增强实例定位鲁棒性。 Result: 在五个数据集上的实验表明,PointINS平均提升室内实例分割mAP 3.5%,室外全景分割PQ 4.1%。 Conclusion: PointINS有效 bridged 语义感知与实例感知之间的鸿沟,为可扩展的3D基础模型提供了新路径。 Abstract: Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

[112] ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

Xike Zhang,Maoyuan Ye,Juhua Liu,Bo Du

Main category: cs.CV

TL;DR: 本文提出ET-SAM,一种基于SAM的高效统一场景文本检测与版面分析框架,通过轻量级点解码器生成词热图减少前景点提示数量,并设计联合训练策略整合多源异构文本标注数据,在显著加速推理(约3倍)的同时提升多项基准性能。

Details Motivation: 现有基于SAM的方法依赖像素级文本分割采样大量前景点作为提示,导致推理延迟高、数据利用率低。 Method: 提出ET-SAM框架,包含轻量级点解码器(生成词热图以减少前景点数量)和分层掩码解码器;设计联合训练策略,融合多级、词级、行级标注数据,并为不同标注类型引入三组可学习任务提示。 Result: 相比先前SAM架构,ET-SAM在HierText上实现约3倍推理加速,在Total-Text、CTW1500和ICDAR15上平均F-score提升11.0%。 Conclusion: ET-SAM有效缓解了SAM在文本相关任务中因依赖像素分割带来的效率与数据利用瓶颈,实现了高效、通用且高性能的统一文本检测与版面分析。 Abstract: Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

[113] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

Shiji Zhao,Shukun Xiong,Maoxun Yuan,Yao Huang,Ranjie Duan,Qing Guo,Jiansheng Chen,Haibin Duan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于红外物理知识引导的对抗训练方法(KGAT),通过建模和量化不同类别间的热辐射关系,提升红外目标检测模型在面对对抗样本和常见退化时的鲁棒性与干净精度。

Details Motivation: 现有红外目标检测方法多采用数据驱动范式,忽视红外图像特有的物理特性(如热辐射关系),导致鲁棒性有限;而红外图像中类间相对热辐射关系在复杂干扰下仍稳定可靠,可作为先验知识加以利用。 Method: 基于灰度值排序建模类间热辐射关系,量化其稳定性,并将该物理知识嵌入对抗训练过程,设计知识引导的损失函数,使预测结果符合真实物理规律。 Result: 在三个红外数据集和六种主流检测模型上验证,KGAT显著提升了模型在干净样本上的精度以及对对抗攻击和常见退化(如噪声、模糊等)的鲁棒性。 Conclusion: 融入红外物理先验知识的对抗训练能有效提升模型鲁棒性,证明了结合领域知识与深度学习方法在红外检测中的重要性和可行性。 Abstract: In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.

[114] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam,John Quarles,Kevin Desai

Main category: cs.CV

TL;DR: 本文提出AG-EgoPose,一种双流框架,结合短/长时运动上下文与细粒度空间线索,用于鱼眼镜头下的自拍视角3D人体姿态估计,显著提升性能。

Details Motivation: 自拍视角下存在严重透视畸变、身体可见性低和相机运动复杂等问题,现有单帧或有限时序方法难以充分利用视频中的运动上下文。 Method: 设计双流架构:空间流用共享权重ResNet-18生成2D热图与空间特征令牌;时间流用ResNet-50加动作识别骨干提取运动动态;二者在含可学习关节令牌的Transformer解码器中融合并细化。 Result: 在真实世界数据集上达到SOTA性能,定量与定性指标均领先。 Conclusion: AG-EgoPose通过协同建模时空信息与解剖约束,有效提升了自拍视角3D姿态估计的鲁棒性与精度。 Abstract: Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

[115] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Marvin Seyfarth,Salman Ul Hassan Dar,Yannik Frisch,Philipp Wild,Norbert Frey,Florian André,Sandy Engelhardt

Main category: cs.CV

TL;DR: 本文提出了VolDiT,首个纯Transformer架构的3D扩散模型,用于体数据医学图像合成,通过体素块嵌入、全局自注意力和时序门控控制适配器实现高保真、可控、全局一致的生成。

Details Motivation: 现有3D医学图像生成方法多基于卷积U-Net,在感受野、全局建模和灵活条件控制方面存在局限;亟需更具备全局建模能力和结构化控制能力的架构。 Method: 提出VolDiT:1)采用纯Transformer主干,支持原生3D体素块嵌入与全局3D token自注意力;2)设计timestep-gated控制适配器,将分割掩码映射为可学习控制token,实现token级空间条件调制。 Result: 在高分辨率3D医学图像合成任务上,VolDiT相比基于U-Net的SOTA潜扩散模型,在全局一致性、生成保真度和可控性方面均有提升。 Conclusion: 纯Transformer扩散模型是体数据医学图像合成的灵活且有力的新范式,为后续可控、多模态、高精度3D生成奠定基础。 Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

[116] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

Jiahao Wang,Hualian Sheng,Sijia Cai,Yuxiao Yang,Weizhan Zhang,Caixia Yan,Bing Deng,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出AnyID框架,通过可扩展的全参考架构和主参考生成范式,实现高保真身份保持的视频生成,并利用强化学习优化身份保真度和提示可控性。

Details Motivation: 现有方法通常仅针对单一身份参考进行设计和优化,限制了创意灵活性,且单一参考导致问题不适定,难以在新场景中忠实再现身份。 Method: 提出AnyID框架:1)可扩展的全参考架构,统一异构身份输入;2)主参考生成范式,以一个参考为规范锚点,结合差分提示实现属性级精确可控;采用大规模数据集训练,并通过基于人类评估偏好数据集的强化学习进行微调。 Result: AnyID在多种任务设置下实现了超高的身份保真度和优越的属性级可控性。 Conclusion: AnyID有效解决了多源身份输入与高保真可控生成的难题,显著提升了视频生成中身份保持与编辑能力。 Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.

[117] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

Marvin Seyfarth,Sarah Kaye Müller,Arman Ghanaat,Isabelle Ayx,Fabian Fastenrath,Philipp Wild,Alexander Hertel,Theano Papavassiliu,Salman Ul Hassan Dar,Sandy Engelhardt

Main category: cs.CV

TL;DR: 本文提出CardioDiT,一种基于扩散Transformer的全4D潜在扩散框架,用于短轴心脏电影MRI(CMR)合成,通过联合建模3D空间与时间维度,提升心肌动态的时空一致性和生理合理性。

Details Motivation: 现有生成模型对 cine CMR 这类具有时间同步特性的4D医学影像多采用时空解耦或依赖辅助机制(如解剖掩码)保证时序一致性,易引入结构偏差,导致时空不连续或生理动力学失真。 Method: 提出 CardioDiT 框架:首先用时空VQ-VAE将2D+t切片编码为紧凑潜在表示;再由扩散Transformer对完整3D+t体积进行端到端联合建模,实现空间与时间在生成全过程中的紧密耦合。 Result: 在公开及私有CMR数据集上验证表明,CardioDiT显著提升层间一致性、时间运动连贯性,并生成更符合真实心脏功能分布的图像,优于逐步增强时空耦合的基线方法。 Conclusion: 显式4D建模结合扩散Transformer为心脏影像的时空合成提供了原理上更合理的基础,避免了传统因子化架构的局限性。 Abstract: Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at https://github.com/Cardio-AI/cardiodit.

[118] TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

Peng Wen,Yuting Wang,Qiurui Wang

Main category: cs.CV

TL;DR: 本文提出了TacSIm,一个用于足球战术风格模仿的大规模数据集与基准,旨在通过广播视频精确复现真实球队的战术行为,而非仅优化奖励目标;它支持空间占用与运动向量相似性评估,并在统一虚拟环境中对全队行为进行定量与可视化评测。

Details Motivation: 现有足球模仿研究多聚焦于奖励导向目标(如进球数、胜率),忽视了对真实球队战术行为的准确复现。 Method: 构建TacSIm数据集与基准,从英超比赛广播视频中提取单视角下22名球员的起始位置与动作,映射至标准球场坐标系;定义基于空间占用相似性与运动向量相似性的战术风格模仿任务与评估协议;在统一虚拟环境中运行多种基线方法生成全队行为。 Result: TacSIm实现了对进攻/防守场景下整支球队战术风格的可量化、可视化评估,提供了首个面向风格对齐的足球战术模仿基准。 Conclusion: TacSIm填补了足球AI中战术行为模仿与评估的空白,推动从‘赢球导向’向‘风格对齐’建模范式的转变。 Abstract: Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.

[119] CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

Shaojin Bai,Yuting Su,Weizhi Nie

Main category: cs.CV

TL;DR: 本文提出CIV-DG框架,利用条件工具变量(Conditional Instrumental Variables)解决医疗AI中因选择偏差导致的跨站点泛化问题,通过DeepGMM架构实现病理语义与扫描伪影的解耦,在Camelyon17和胸部X光数据集上显著提升模型泛化性能。

Details Motivation: 医疗AI跨站点泛化能力受限于选择偏差——患者人口统计特征(如年龄、病情严重程度)非随机地影响其就诊医院分配,导致站点特异性伪相关,传统域泛化方法无法应对。 Method: 提出CIV-DG因果框架,放松标准工具变量法对随机分配的强假设,引入条件工具变量解耦病理语义与扫描设备伪影;基于深度广义矩估计(DeepGMM),设计条件判别器在人口统计亚组内最小化矩条件违背并保证工具变量与误差正交。 Result: 在Camelyon17和大规模胸部X光数据集上,CIV-DG显著优于主流域泛化基线方法。 Conclusion: 条件因果机制(如CIV-DG)可有效缓解结构性混杂,是构建鲁棒医疗AI的关键路径。 Abstract: Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.

[120] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Jiahao Tian,Chenxi Song,Wei Cheng,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出FreeLOC框架,通过Video-based Relative Position Re-encoding(VRPR)和Tiered Sparse Attention(TSA)两种技术,无需额外训练即可解决预训练视频扩散模型在生成长视频时因帧级位置与上下文长度分布偏移导致的质量下降问题,并引入层自适应探测机制提升效率。

Details Motivation: 预训练视频扩散模型通常在短片段上训练,直接用于长视频生成时会出现视觉质量显著下降,主要源于帧级相对位置和上下文长度两类分布外(O.O.D.)问题。 Method: 提出FreeLOC:1)Video-based Relative Position Re-encoding(VRPR),分层重编码时间相对位置以对齐预训练分布;2)Tiered Sparse Attention(TSA),跨时间尺度结构化稀疏注意力以兼顾局部细节与长程依赖;3)层自适应探测机制,识别各Transformer层对O.O.D的敏感性,实现选择性、高效应用。 Result: 在多项实验中显著优于现有无训练方法,在时间一致性和视觉质量上达到SOTA。 Conclusion: FreeLOC是一种训练免费、层自适应的通用框架,有效缓解长视频生成中的两类关键分布外问题,为基于预训练模型的长视频合成提供了新范式。 Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

[121] SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment

Pengyu Chen,Haotian Sa,Yiwei Hu,Yuhan Cheng,Junbo Wang

Main category: cs.CV

TL;DR: 本文提出SDD-YOLO,一种专为地面观测空中无人机(G2A)场景设计的小目标检测框架,通过高分辨率P2检测头、无DFL/NMS结构及MuSGD训练策略,在自建大型数据集DroneSOD-30K上实现86.0% mAP@0.5,显著优于YOLOv5n,并具备高推理速度,适用于边缘部署。

Details Motivation: 地面观测空中无人机(G2A)场景中存在目标像素占比极低、背景杂乱、实时性要求高等挑战,现有YOLO模型缺乏对亚像素级小目标的足够特征分辨能力且部署复杂。 Method: 提出SDD-YOLO框架:引入P2高分辨率检测头(4倍下采样)以捕获微小目标细节;融合YOLO26的DFL-free与NMS-free架构以简化推理;采用MuSGD混合训练策略(含ProgLoss和STAL),缓解稀疏小目标信号引起的梯度振荡;构建大规模G2A数据集DroneSOD-30K(约3万张标注图像)。 Result: SDD-YOLO-n在DroneSOD-30K上达到86.0% mAP@0.5,较YOLOv5n提升7.8个百分点;在NVIDIA RTX 5090上达226 FPS,在Intel Xeon CPU上达35 FPS,展现优异边缘部署效率。 Conclusion: SDD-YOLO有效解决了G2A反无人机场景中的小目标检测难题,在精度与速度间取得良好平衡,为实际安防系统提供了可行的轻量高效解决方案。 Abstract: Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.

[122] Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

Jonas Hein,Lilian Calvet,Matthias Seibold,Siyu Tang,Marc Pollefeys,Philipp Fürnstahl

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多视角6D位姿估计方法,仅需纹理CAD模型即可实现对未见过手术器械的高精度检测与位姿估计。

Details Motivation: 监督方法缺乏对新或未见器械的泛化能力,且依赖大量标注数据,而临床中常需快速适配新型器械。 Method: 分为两阶段:1)基于预训练特征提取器匹配渲染模板生成掩码提案,并通过多视角几何一致性进行3D实例三角化与筛选;2)迭代优化位姿假设,结合跨视角注意力的特征度量评分,并用多视角、遮挡感知的轮廓配准进行最终精调。 Result: 在真实MVPSP手术数据集上达到毫米级精度,性能媲美监督方法,同时完全泛化至未见器械。 Conclusion: 该方法融合基础模型、多视角几何与轮廓优化,实现了无需任务特定训练的高精度、高鲁棒性手术器械6D位姿估计,适用于动态临床环境。 Abstract: Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

[123] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models

Sazzad Hossain,Saiful Islam,Muhammad Ibrahim,Md. Rasel Ahmed,Md Shuayb,Ahmedul Kabir

Main category: cs.CV

TL;DR: 本文构建了一个面向孟加拉国五种常见皮肤病(接触性皮炎、白癜风、湿疹、疥疮、癣)的公开图像数据集(共1612张,含增强),并基于该数据集评估了多种机器学习与深度学习模型的分类性能,旨在缓解基层皮肤科资源匮乏问题。

Details Motivation: 孟加拉国等人口密集地区缺乏足够皮肤科专家和诊断设备,导致皮肤病漏诊误诊风险高,亟需基于AI的辅助诊断工具。 Method: 构建包含5类皮肤病的临床来源图像数据集(1612张,含数据增强),并采用多种传统机器学习(如SVM、RF)和深度学习(如VGG16、ResNet50)模型进行分类实验。 Result: 报告了不同模型在该数据集上的分类性能(具体指标未详述),验证了所建数据集可用于训练有效皮肤病识别模型。 Conclusion: 该公开数据集具有区域代表性且疾病类型具普适性,可支持全球范围内基于AI的皮肤病自动诊断研究,有望推动基层医疗智能化。 Abstract: Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.

[124] A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

Jiaming Liang,Chi-Man Pun

Main category: cs.CV

TL;DR: 本文提出了一种用于空间结构化任务(如语义分割、目标检测)的高可迁移性基于变换的对抗攻击方法——空间对齐框架(SAF),通过同步变换输入与标签来解决空间错位问题,显著降低模型性能指标。

Details Motivation: 现有基于变换的对抗攻击(TAAs)在非结构化分类任务中迁移性强,但在语义分割和目标检测等空间结构化任务中效果差甚至失效,主因是未同步变换空间结构化的标签导致梯度错误。 Method: 提出统一的空间对齐框架(SAF),包含空间对齐(SA)算法,使对抗变换在作用于输入图像的同时,同步、一致地应用于对应的空间结构化标签(如分割掩码、检测框)。 Result: 在Cityscapes上非目标攻击使平均mIoU从24.50降至11.34;在Kvasir-SEG上从49.91降至31.80;在COCO上平均mAP从17.89降至5.25,验证了SAF对结构化任务对抗攻击的关键提升作用。 Conclusion: 空间标签的同步变换是提升TAAs在结构化视觉任务中迁移性的关键,SAF为该类任务提供了通用、有效的对抗攻击设计范式。 Abstract: Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

[125] FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

Taejin Jeong,Joohyeok Kim,Jinyeong Kim,Chanyoung Kim,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出FEAST框架,通过全连接图和负感知注意力机制,结合离网格采样策略,提升空间转录组基因表达预测性能,并生成具有生物学意义的注意力图。

Details Motivation: 空间转录组(ST)成本高昂,限制其广泛应用;现有基于图神经网络的方法依赖预定义稀疏图,难以捕捉复杂的生物相互作用关系。 Method: 提出FEAST:1)将组织建模为全连接图以支持所有点对间交互;2)引入负感知注意力机制,建模兴奋性与抑制性相互作用;3)采用离网格采样策略,增强形态学上下文信息。 Result: 在多个公开ST数据集上,FEAST在基因表达预测任务中超越当前最优方法,并生成可解释、符合生物学规律的注意力图。 Conclusion: FEAST通过更灵活的图结构、更精细的注意力建模及更丰富的图像上下文,有效提升了空间基因表达推断的准确性与可解释性。 Abstract: Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.

[126] Semantic-Aware Prefix Learning for Token-Efficient Image Generation

Qingfeng Li,Haoxian Zhang,Xu He,Songlin Tang,Zhixue Fang,Xiaoqiang Liu,Pengfei Wan Guoqi Li

Main category: cs.CV

TL;DR: 本文提出SMAP语义感知前缀分词器,通过引入类别级语义条件和尾部token丢弃策略,使语义信息在训练中不可或缺;并结合CARD混合生成模型验证其语义对齐的潜在空间在图像生成任务中的有效性。

Details Motivation: 现有视觉分词器多基于重建目标训练,导致潜在表示语义薄弱;虽有方法增强语义对齐,但语义信号仅作为辅助正则项,未成为表示学习的必要组成部分。 Method: 提出SMAP分词器:在1D查询式分词框架中注入类别级语义条件,并设计尾token丢弃策略,迫使语义条件与早期潜在前缀在缩减token预算下承担更多表征责任;同时提出CARD混合生成器(因果自回归+扩散)以验证生成能力。 Result: 在ImageNet上实验表明,SMAP在离散与连续分词设置下均提升重建质量,且其语义对齐的潜在空间在紧凑token预算下展现出优异的下游生成性能。 Conclusion: 语义信息可被设计为分词过程的核心驱动因素而非辅助项,SMAP通过结构化机制实现语义必要性,显著提升生成建模中潜在表示的语义保真度与实用性。 Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

[127] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

Yabin Zhang,Maya Varma,Yunhe Gao,Jean-Benoit Delbrouck,Jiaming Liu,Chong Wang,Curtis Langlotz

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、测试高效且理论可靠的OOD检测方法TANL,通过在测试时动态挖掘对OOD样本响应强的负标签,并设计激活感知的打分函数,显著提升检测性能(如ImageNet上FPR95从17.5%降至9.8%)。

Details Motivation: 现有基于固定远端负标签的OOD检测方法因负标签对OOD样本激活弱,难以捕获OOD特性。 Method: 提出测试时激活负标签(TANL):在线识别高置信度测试样本,累积其在语料库上的分配概率构建标签激活指标,自适应选择分布匹配的负标签;进一步引入批自适应变体,并设计激活感知得分函数以增强对强激活负标签的权重。 Result: 在多个骨干网络和任务设置下验证有效;在ImageNet上FPR95显著降低至9.8%(原为17.5%);具备训练自由、测试高效、鲁棒性强等优势。 Conclusion: TANL是一种新颖、实用且理论可解释的OOD检测框架,通过动态、自适应地利用负标签激活信息,显著优于传统静态负标签方法。 Abstract: Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\% to 9.8\%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.

[128] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection

Md Awsafur Rahman,Chandrakanth Gudavalli,Hardik Prajapati,B. S. Manjunath

Main category: cs.CV

TL;DR: 本文提出TITAnD方法,将轨迹异常检测转化为视觉问题,通过构建超光谱轨迹图像(HTI)统一稠密与稀疏轨迹表示,并设计循环分解Transformer(CFT)建模人类行为的周期性结构,首次实现多月稠密GPS轨迹的高效异常检测。

Details Motivation: 现有稠密GPS方法计算代价高(二次方),无法支持多月分析;稀疏停留点方法虽可扩展但丢失细粒度信息,导致需不同架构且无法知识迁移。作者认为该瓶颈非必要,因人类轨迹在日内和日间维度均具天然二维周期结构。 Method: 提出TITAnD框架:1)将轨迹表示为超光谱轨迹图像(HTI),即‘天×一天内时刻’网格,各通道编码空间、语义、时间与运动学信息;2)将异常检测转化为图像分类(个体级)与语义分割(时间定位);3)设计循环分解Transformer(CFT),沿两个时间轴分解注意力,嵌入周期性先验并大幅降低计算开销。 Result: TITAnD在稠密与稀疏基准上均取得最优AUC-PR;相比UNet等视觉模型性能更优,且比标准Transformer快11–75倍、内存占用相当;首次实现多月稠密GPS轨迹的异常检测。 Conclusion: 将轨迹异常检测重构为视觉问题,并结合结构感知建模(如周期性先验),是同时兼顾精度、效率与泛化能力的关键路径。 Abstract: Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

[129] Towards Practical Lossless Neural Compression for LiDAR Point Clouds

Pengpeng Yu,Haoran Li,Runqing Jiang,Dingquan Li,Jing Wang,Liang Lin,Yulan Guo

Main category: cs.CV

TL;DR: 本文提出了一种面向LiDAR点云的高效无损预测编码方法,通过几何重稠密化和跨尺度特征传播两个轻量模块,在保持高速的同时提升压缩性能,并引入纯整数推理流程确保跨平台比特级一致性和熵编码稳定性。

Details Motivation: LiDAR点云几何细节高度稀疏,导致上下文建模效率低,限制了现有压缩方法的速度与性能。 Method: 提出包含几何重稠密化模块(迭代稠密化-特征提取-再稀疏化)和跨尺度特征传播模块(利用多分辨率占据线索指导分层特征传播)的紧凑表示框架,并设计纯整数推理流程以保障比特精确性。 Result: 在实现实时编码速度的同时,达到具有竞争力的无损压缩性能;避免了现有神经压缩方法中的熵编码崩溃问题。 Conclusion: 该方法在效率、性能与部署鲁棒性之间取得了良好平衡,为LiDAR点云无损压缩提供了实用新范式。 Abstract: LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at https://github.com/pengpeng-yu/FastPCC.

[130] ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis

Moonyeon Jeong,Seunggi Min,Suhyeon Lee,Hongje Seong

Main category: cs.CV

TL;DR: ViewSplat提出一种视图自适应的3D高斯泼溅网络,通过动态MLP根据目标视角实时调整高斯属性,显著提升新视角合成质量,同时保持高速推理与实时渲染性能。

Details Motivation: 现有前馈式3D高斯泼溅方法受限于静态高斯基元表达能力,难以兼顾所有视角的保真度,存在 fidelity gap。 Method: ViewSplat引入view-adaptive dynamic splatting:先预测基础高斯参数和动态MLP权重;渲染时,MLP以目标视角坐标为输入,输出各高斯属性(位置、尺度、旋转、不透明度、颜色)的视角相关残差修正。 Result: 在新视角合成任务中达到SOTA保真度,推理速度17 FPS,渲染速度154 FPS。 Conclusion: 视图自适应的动态泼溅机制有效弥补了静态表示的局限性,实现了高保真与高效率的统一。 Abstract: We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).

[131] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

Yuhan Chen,Pengwen Dai,Chuan Wang,Dayan Wu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出EagleNet,通过细粒度关系学习(FRL)和能量感知匹配(EAM)增强文本表达能力,融合视频帧上下文信息,提升文本-视频检索性能。

Details Motivation: 现有方法仅建模文本与视频/帧间的交互,忽略视频内部帧间丰富关系,导致扩展文本缺乏帧上下文信息,造成文本与视频语义不一致。 Method: 提出Energy-Aware Fine-Grained Relationship Learning Network(EagleNet):1)构建文本候选与帧的图结构;2)细粒度关系学习(FRL)建模文本-帧关系并聚合为上下文感知文本嵌入;3)能量感知匹配(EAM)建模文本-帧交互能量以更好拟合真实分布;4)采用sigmoid对比损失替代softmax损失以提升跨模态对齐与训练稳定性。 Result: 在MSRVTT、DiDeMo、MSVD和VATEX四个主流文本-视频检索数据集上均取得SOTA或领先性能。 Conclusion: EagleNet通过显式建模帧内上下文与文本-帧细粒度交互,并引入能量感知机制和改进损失函数,有效提升了文本表达的丰富性与上下文感知能力,从而显著增强文本-视频检索效果。 Abstract: Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.

[132] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

Weijia Li,Haoen Xiang,Tianxu Wang,Shuaibing Wu,Qiming Xia,Cheng Wang,Chenglu Wen

Main category: cs.CV

TL;DR: 本文提出了首个面向车辆与无人机协同感知(V2U)的大规模真实世界多模态数据集V2U4Real,旨在突破传统地面协同感知(如V2V/V2I)在大范围遮挡和远距离感知上的局限。

Details Motivation: 现有车车(V2V)和车基(V2I)协同感知受限于地面视角,难以应对大规模遮挡和复杂环境下的长距感知需求,亟需跨视角(如引入无人机视角)的协同感知新范式。 Method: 构建了首个真实场景下的Vehicle-to-UAV(V2U)多模态协同感知数据集V2U4Real,由地面车辆与无人机同步采集多视角LiDAR点云和RGB图像,并提供精细的3D标注;同时建立了单智能体检测、协同检测与目标跟踪三大基准任务。 Result: 在多个SOTA模型上验证了V2U协同显著提升感知鲁棒性与远距离感知能力;数据集与代码已开源。 Conclusion: V2U4Real填补了跨平台、跨视角协同感知数据集的空白,为未来智能驾驶系统突破感知瓶颈提供了关键基础资源与评估标准。 Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.

[133] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

Hongru Han,Tingrui Guo,Liming Zhang,Yan Su,Qiwen Xu,Zhuohua Ye

Main category: cs.CV

TL;DR: 本文提出可控低光图像增强(CLE)框架CLE-RWKV,通过引入Light100新基准和HVI颜色空间中的噪声解耦监督策略,解决传统LLIE方法因任务病态性导致的亮度不一致问题,避免依赖gt-mean后处理。

Details Motivation: 传统低光图像增强(LLIE)被建模为确定性映射,但该任务本质上是病态的(受未知环境与传感器参数影响),导致预测与标签间亮度不一致,常需gt-mean后处理;本文旨在将其重构为一个良定义的条件生成问题。 Method: 提出可控低光增强(CLE)范式;构建新基准Light100(含连续真实光照变化);在HVI颜色空间采用噪声解耦监督,分离照度调控与纹理恢复;设计Space-to-Depth(S2D)策略适配RWKV类状态空间模型(SSM)用于密集预测,恢复局部归纳偏置并弥合扫描间隙。 Result: 在七个基准上实验表明,CLE-RWKV在性能与亮度可控性方面均具竞争力,显著降低对gt-mean后处理的依赖,并提供面向真实多光照场景的实用替代方案。 Conclusion: 将LLIE从确定性映射转向可控条件生成是更合理且实用的范式;CLE-RWKV结合Light100、HVI解耦监督与S2D架构,有效缓解病态性,提升亮度可控性与色彩保真度。 Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating "gt-mean" post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the "scanning gap" inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.

[134] Adaptive Learned Image Compression with Graph Neural Networks

Yunuo Chen,Bing He,Zezheng Lyu,Hongwei Hu,Qunshan Gu,Yuan Tian,Guo Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络(GNN)的自适应图像压缩框架GLIC,通过构建双尺度图和动态调整节点邻域数量,克服了CNN和Transformer在建模图像冗余时的刚性局限,在多个数据集上达到SOTA性能。

Details Motivation: 现有主流学习型图像压缩方法(如CNN、Transformer)因固定感受野和静态连接模式,难以自适应建模图像中空间变化的冗余(尤其全局冗余)。 Method: 提出基于图神经网络的GLIC框架:构建双尺度图以实现数据驱动的灵活感受野,并引入基于局部内容复杂度的自适应连通性机制(动态调整每个节点的邻居数)。 Result: 在Kodak、Tecnick和CLIC数据集上,相比VTM-9.1分别获得19.29%、21.69%和18.71%的BD-rate降低,达到SOTA性能。 Conclusion: GLIC通过内容自适应图结构建模,显著提升了图像压缩对多样化冗余模式的建模能力,验证了GNN在高效、自适应图像压缩中的有效性与潜力。 Abstract: Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.

[135] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Zhekai Chen,Yuqing Wang,Manyuan Zhang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出MacroData数据集和MacroBench基准,解决多参考图像生成任务中因数据稀缺和评估标准缺失导致的性能瓶颈问题。

Details Motivation: 现有模型在处理多个视觉参考图像时性能显著下降,根本原因在于缺乏结构化、长上下文监督的数据集,且缺少标准化评估协议。 Method: 构建了包含40万样本、每样本最多10张参考图像的大规模数据集MacroData,并从定制化、插图、空间推理和时间动态四个维度系统组织;同时提出包含4000样本的评估基准MacroBench,并开展细调与消融实验。 Result: 在MacroData上微调显著提升多参考图像生成性能;消融研究表明跨任务协同训练和长上下文处理策略具有协同增益。 Conclusion: MacroData和MacroBench为多参考图像生成提供了高质量数据支撑与统一评估标准,推动该方向发展。 Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

[136] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

Yongsung Kim,Wooseok Song,Jaihyun Lew,Hun Hwangbo,Jaehoon Lee,Sungroh Yoon

Main category: cs.CV

TL;DR: 本文提出了一种基于头敏感度差异的两阶段稀疏化方法(HeSS),用于缓解视觉几何接地Transformer(VGGT)中全局注意力层因均匀稀疏导致的精度下降问题。

Details Motivation: 现有稀疏化方法对所有注意力头采用统一稀疏模式,忽视了不同头在稀疏化时的敏感性差异,从而导致显著精度损失。 Method: 提出两阶段稀疏化流程:第一阶段通过新指标Head Sensitivity Score(HeSS)量化各头对稀疏化的敏感度(基于小校准集上两类误差的Hessian近似);第二阶段依据HeSS动态分配总注意力预算,对敏感头保留更密连接、对鲁棒头施加更高稀疏。 Result: 实验表明HeSS能有效刻画头间敏感性异质性;所提方法在高稀疏度下显著缓解性能下降,在不同稀疏水平下均表现出强鲁棒性。 Conclusion: 注意力头在全局注意力层中具有异质稀疏敏感性,利用HeSS进行头感知稀疏化是一种高效且鲁棒的加速策略,为VGGT等大模型的轻量化部署提供了新思路。 Abstract: Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

[137] Image Rotation Angle Estimation: Comparing Circular-Aware Methods

Maximilian Woehrer

Main category: cs.CV

TL;DR: 本文系统研究了五种面向圆形拓扑的图像自动旋转估计方法,在多个现代网络架构上进行评估,发现基于概率的圆形高斯分布方法鲁棒性最强,分类法在匹配良好的骨干网络上精度最高;最佳配置在DRC-D、COCO2014和COCO2017数据集上分别达到1.23°、3.71°和2.84°的平均绝对误差。

Details Motivation: 图像自动旋转估计是视觉流程的关键预处理步骤,但角度具有环形拓扑结构,导致标准回归方法在边界处出现不连续性,难以准确建模。 Method: 系统评估五种环形感知方法:带环形损失的直接角度回归、角度分箱分类、单位向量回归、相位偏移编码器、环形高斯分布;均基于ImageNet预训练模型,通过调整输出头适配旋转预测任务。 Result: 环形高斯分布方法在不同骨干网络上表现最稳健;分类法在匹配良好的骨干(如EfficientViT-B3)上精度最高(DRC-D上MAE=1.23°),但训练不稳定;环形高斯+MambaOut Base达1.24°且更鲁棒;在COCO2014和COCO2017上MAE分别达3.71°和2.84°,显著优于先前工作。 Conclusion: 面向环形特性的概率建模(如环形高斯分布)比确定性回归或分类更具泛化性和稳定性,是图像旋转估计更优的技术路径。 Abstract: Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.

[138] InstanceAnimator: Multi-Instance Sketch Video Colorization

Yinhan Zhang,Yue Ma,Bingyuan Wang,Kunyu Feng,Yeying Jin,Qifeng Chen,Anyi Rao,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出InstanceAnimator,一种用于多实例草图视频着色的新型扩散Transformer框架,通过画布引导条件、实例匹配机制和自适应解耦控制模块,解决了现有方法在用户控制、实例对齐和细节保真度方面的三大局限。

Details Motivation: 现有方法存在三个核心局限:依赖单参考帧导致用户控制不灵活、多角色场景下实例可控性差导致错位、细粒度区域细节保真度下降。 Method: 提出三个创新:1)画布引导条件(Canvas Guidance Condition),支持参考元素与背景自由放置;2)实例匹配机制(Instance Matching Mechanism),融合实例特征与草图以实现多角色精准控制;3)自适应解耦控制模块(Adaptive Decoupled Control Module),将角色、背景及文本语义特征注入扩散过程以提升细节质量。 Result: 大量实验表明,InstanceAnimator在多实例着色任务中实现了更优的用户控制性、更高视觉质量及更强的实例一致性。 Conclusion: InstanceAnimator有效克服了现有草图视频着色方法的关键缺陷,为多实例、高可控、高保真的视频着色提供了新范式。 Abstract: We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

[139] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Jeannie Chung,Hanna Jang,Ingyeong Yang,Uiwon Hwang,Jaehyung Sim

Main category: cs.CV

TL;DR: 本文提出了一种关系型知识蒸馏框架CLIP-RD,包含垂直关系蒸馏(VRD)和交叉关系蒸馏(XRD),以更好保留CLIP教师模型的多模态嵌入结构关系,提升轻量级学生模型的零样本性能。

Details Motivation: 现有CLIP蒸馏方法未显式建模师生嵌入间的多向关系依赖,导致学生难以保持教师编码的结构关系。 Method: 提出VRD(保证跨模态师生蒸馏强度在分布层面一致)和XRD(施加跨模态师生相似性分布的双向对称约束),联合建模多向关系结构。 Result: CLIP-RD在零样本分类任务上较现有方法提升0.8个百分点。 Conclusion: 通过显式建模师生嵌入间的多方向关系,CLIP-RD能更忠实地对齐学生与教师的嵌入几何结构,显著提升蒸馏效果。 Abstract: CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

[140] Multimodal Dataset Distillation via Phased Teacher Models

Shengbin Guo,Hang Zhao,Senqiao Yang,Chenyang Jiang,Yuhang Cheng,Xiangru Peng,Rui Shao,Zhuotao Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为PTM-ST的分阶段教师模型框架,通过阶段感知建模和基于捷径的轨迹构建策略,提升多模态数据集蒸馏中对教师模型动态知识的学习能力,显著改善学生模型性能与蒸馏数据质量。

Details Motivation: 现有方法难以捕捉教师模型后期训练阶段中复杂、动态演化的知识,导致学生模型性能下降和蒸馏数据质量受损,且存在跨阶段性能差距大、教师轨迹不稳定等问题。 Method: 提出Phased Teacher Model with Shortcut Trajectory(PTM-ST),采用阶段感知的教师建模与基于捷径的轨迹构造策略,精准拟合教师模型在不同训练阶段的学习动态。 Result: 理论分析与实验表明,PTM-ST有效缓解优化振荡与阶段间知识断层,并降低存储开销;在Flickr30k和COCO上显著超越SOTA方法,在Flickr30k上最高提升13.5%,平均提升9.53%。 Conclusion: PTM-ST通过建模教师学习动态提升了多模态数据集蒸馏的稳定性与表达能力,为高效知识压缩与迁移提供了新范式。 Abstract: Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.

[141] FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection

Yingmei Zhang,Wangtao Bao,Yong Yang,Weiguo Wan,Qin Xiao,Xueting Zou

Main category: cs.CV

TL;DR: 本文提出FSGNet,一种结合频域感知与语义引导机制的轻量高效红外小目标检测框架,通过多向交互注意力、多尺度频域感知模块和全局语义引导流,缓解U-Net语义退化问题,提升小目标定位精度与鲁棒性。

Details Motivation: U-Net在红外小目标检测中存在深层到浅层特征传递时的语义退化问题,导致小目标精确定位能力受限。 Method: 提出FSGNet框架,包含:1)编码器中多方向交互注意力模块以增强对低对比度小目标的敏感性;2)基于FFT的多尺度频域感知模块抑制跳连中的背景干扰;3)最深层全局池化+上采样+语义引导流保障跨尺度语义一致性。 Result: 在四个公开IRSTD数据集上实验表明,FSGNet检测性能优越、效率高,具备良好实用性与鲁棒性。 Conclusion: FSGNet有效缓解了U-Net语义退化问题,通过频域感知与语义引导协同提升小目标检测精度与定位能力,是一种轻量且实用的IRSTD新方法。 Abstract: Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network's sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on https://github.com/Wangtao-Bao/FSGNet.

[142] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

Niccolò Cavagnero,Narges Norouzi,Gijs Dubbelman,Daan de Geus

Main category: cs.CV

TL;DR: 本文提出Plain Mask Decoder (PMD)和Plain Mask Transformer (PMT),在保持视觉基础模型(VFM)编码器冻结、可共享的前提下,实现高效图像与视频分割,兼顾性能与速度。

Details Motivation: 现有基于VFM的编码器-only分割模型(如EoMT、VidEoMT)虽速度快,但需微调编码器,牺牲了多任务共享冻结编码器的优势;亟需一种既保持冻结VFM特征、又具备编码器-only简洁性与低延迟的新方案。 Method: 设计轻量级Transformer-based Plain Mask Decoder (PMD),直接作用于冻结的VFM特征之上;构建Plain Mask Transformer (PMT),统一支持图像与视频分割,不修改编码器。 Result: 在图像分割上,PMT达到冻结编码器SOTA精度且快约3倍;在视频分割上,性能媲美全微调方法,且比当前最优冻结编码器模型快8倍。 Conclusion: PMT成功 reconciles 冻结VFM特征与高效解码之间的矛盾,在保持编码器共享性、通用性和低延迟的同时,显著提升分割性能与速度,适用于大规模部署。 Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

[143] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Xinkai Wang,Chenyi Wang,Yifu Xu,Mingzhe Ye,Fu-Cheng Zhang,Jialin Tian,Xinyu Zhan,Lifeng Zhu,Cewu Lu,Lixin Yang

Main category: cs.CV

TL;DR: LaMP是一种双专家视觉-语言-动作(VLA)框架,通过将稠密3D场景流作为潜在运动先验,提升机器人操作在复杂和分布外(OOD)场景中的泛化与鲁棒性。

Details Motivation: 现有VLA模型仅依赖2D语义特征直接回归动作,难以显式建模3D物理交互,导致在陌生空间动力学下性能下降。 Method: 提出双专家架构:Motion Expert通过流匹配生成部分去噪的单步3D场景流;Action Expert利用其隐藏状态(经门控交叉注意力)进行动作策略预测,无需完整多步重建。 Result: 在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准及真实世界实验中,LaMP均超越现有VLA基线;在LIBERO-Plus OOD扰动下平均成功率提升9.7%。 Conclusion: 引入3D场景流作为显式运动先验可有效增强VLA模型对空间动态的建模能力与泛化鲁棒性。 Abstract: We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

[144] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Huizhi Liang,Yichao Shen,Yu Deng,Sicheng Xu,Zhiyuan Feng,Tong Zhang,Yaobo Liang,Jiaolong Yang

Main category: cs.CV

TL;DR: 本文提出了一种分层框架,将视觉语言模型(VLMs)的3D空间理解学习分解为四个递进层次,并构建了大规模3D空间VQA数据集与RGB-D VLM模型,显著提升了空间理解与推理性能。

Details Motivation: 实现类人空间智能需要VLM能从2D观测中推断3D结构、识别3D空间中的物体属性与关系,并进行高层空间推理。 Method: 提出四层递进式空间理解框架;构建自动化pipeline生成500万图像、4500万物体的3D空间VQA数据;设计融合度量尺度点云图的RGB-D VLM。 Result: 在多个空间理解与推理基准上达到SOTA,超越专用空间模型及Gemini-2.5-pro、GPT-5等大模型;揭示了层级任务间的依赖关系。 Conclusion: 多层级任务设计可有效促进VLM中3D空间智能的涌现,所提框架与模型为构建具空间能力的通用VLM提供了新范式。 Abstract: Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

[145] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Yang Bai,Liudi Yang,Ziyuan Liu

Main category: cs.CV

TL;DR: 本文提出了VideoWeaver,首个支持多视角视频到视频(V2V)翻译的框架,通过共享4D潜在空间和扩散时间步训练实现跨视角一致性与可扩展性,显著提升机器人策略迁移中的多视角仿真质量。

Details Motivation: 现有单视角V2V方法无法保证多同步相机下的外观一致性,且标准Transformer因跨视角注意力计算成本高而难以扩展至多视角场景,限制了其在具身AI任务(如机器人策略学习)中的应用。 Method: VideoWeaver首先基于流式模型构建单视角V2V基础;再利用Pi3空间基础模型构建共享4D潜在空间以实现视角对齐;并通过在不同扩散时间步训练各视角,建模联合及条件视角分布,支持自回归生成新视角。 Result: 在单视角基准上达到或超越SOTA;首次实现物理与风格一致的多视角翻译,涵盖具身视角与异构相机等具挑战性的机器人世界随机化设置。 Conclusion: VideoWeaver为多视角V2V翻译提供了可扩展、一致性强的新范式,有效支撑具身AI中无需额外数据采集的策略跨环境迁移。 Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

[146] DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming

Wei Lian,Fei Ma,Hang Pan,Zhesen Cui,Wangmeng Zuo

Main category: cs.CV

TL;DR: 本文提出DC-Reg框架,通过差凸(DC)规划范式构建整体凹下估计器,显著收紧分支定界(BnB)搜索的下界,实现点云配准的全局最优解,尤其在部分重叠和大错位场景下表现优异。

Details Motivation: 点云配准在部分重叠和大错位情况下难以获得全局最优解;现有联合优化变换与对应关系的方法因目标函数非凸耦合,易陷入局部极小或收敛过慢。 Method: 提出基于差凸(DC)编程的DC-Reg框架,构造耦合目标函数的整体凹下估计器,利用线性分配问题(LAP)在搜索盒顶点高效计算紧致下界,替代传统逐项松弛方法。 Result: 在2D相似变换和3D刚体配准任务上验证有效;合成数据与3DMatch基准实验表明,DC-Reg比现有全局方法收敛更快、对极端噪声和离群点鲁棒性更强。 Conclusion: DC-Reg通过捕捉变换参数θ与对应矩阵P的联合结构交互,实现了更紧的下界估计和更高效的全局优化,为鲁棒点云配准提供了新范式。 Abstract: Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbolθ$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbolθ$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.

[147] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

Keming Ye,Zhou Zhao,Fan Wu,Shengyu Zhang

Main category: cs.CV

TL;DR: 本文提出CIAR框架,通过云-端协同和设备端自验证机制,优化自回归图像生成模型的推理效率,显著降低延迟和云端请求次数,同时保持图像质量。

Details Motivation: 自回归图像生成模型计算密集且序列化,导致设备端部署延迟高;需解决视觉合成中大词表规模和空间冗余带来的资源浪费问题。 Method: 提出CIAR框架:1)设备端基于连续概率区间的token不确定性量化器,避免对冗余token进行重复验证;2)区间增强解码模块配合分布对齐训练策略,加速解码并保障视觉保真度与语义一致性。 Result: 实验表明CIAR相比现有方法实现2.18倍加速、云端请求减少70%,同时图像质量无损。 Conclusion: CIAR有效平衡了自回归图像生成的效率与质量,为端侧高效视觉合成提供了可行路径。 Abstract: Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.

[148] AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao

Main category: cs.CV

TL;DR: 本文提出AdaSFormer,一种专为室内单目语义场景补全(MSSC)设计的序列化Transformer框架,通过自适应序列化Transformer、中心相对位置编码和卷积调制层归一化三大创新,在NYUv2和Occ-ScanNet上达到SOTA性能。

Details Motivation: 室内单目语义场景补全比室外更具挑战性,主要因复杂空间布局和严重遮挡;而现有Transformer因高内存开销和难以重建细粒度细节,难以有效应用于该任务。 Method: 提出AdaSFormer框架,包含三个核心设计:(1) 带可学习位移的自适应序列化Transformer以动态调整感受野;(2) 中心相对位置编码以增强空间信息建模;(3) 卷积调制层归一化以融合卷积与Transformer异构特征。 Result: 在NYUv2和Occ-ScanNet数据集上取得当前最优性能(SOTA)。 Conclusion: AdaSFormer有效克服了Transformer在室内MSSC中内存消耗大和细节重建弱的问题,验证了序列化与结构协同设计对复杂室内场景理解的有效性。 Abstract: Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.

[149] Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects

Jakob Paul Zimmermann,Gerrit Holzbach,David Lerch

Main category: cs.CV

TL;DR: 本文提出了一种名为知识引导失败预测(KGFP)的新框架,用于在运行时检测目标检测器对安全关键物体(如行人)的漏检,通过度量检测器特征与视觉基础模型嵌入之间的语义不一致性来识别异常,显著提升了召回率并优于现有OOD检测方法。

Details Motivation: 目标检测器在安全关键场景中可能静默失效(如漏检行人),而传统OOD检测方法仅关注输入分布偏移,无法直接预测检测器自身的功能失效。 Method: 提出KGFP框架,采用双编码器结构,计算目标检测器内部特征与视觉基础模型嵌入之间的角距离,以语义错位程度作为失败信号;当检测器超出能力范围或基础模型遇到新样本时,二者嵌入发散,产生高角度信号以标记不安全图像。 Result: 在COCO人检测任务上,以5%假阳性率为阈值,KGFP作为选择性预测门控,使人 Recall 从64.3%提升至84.5%;在六个COCO-O视觉域上均显著优于OOD基线方法。 Conclusion: KGFP是一种有效、通用的检测器运行时监控方法,能可靠识别漏检风险,在安全关键应用中具有重要价值。 Abstract: Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP.

[150] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

Koldo Basterretxea,Jon Gutiérrez-Zaballa,Javier Echanobe

Main category: cs.CV

TL;DR: 本文分析了高光谱成像(HSI)在自动驾驶(AD)中的应用挑战与技术方案,重点探讨了光照变化、动态场景、实时性及嵌入式平台资源限制等关键问题,并基于最新HSI-Drive数据集的实验结果评估多种HSI视觉技术。

Details Motivation: 高光谱成像在自动驾驶中潜力巨大,但面临非受控光照、大景深、快速运动物体、实时性要求及嵌入式平台算力受限等实际挑战,亟需适配的硬件选型与定制化算法。 Method: 系统分析现有HSI视觉技术在自动驾驶中的研究进展,并以最新HSI-Drive数据集的实验结果为依据进行评估和比较。 Result: 明确了HSI技术在AD中应用的关键约束条件,并指出了面向嵌入式实时系统的光谱-空间信息融合算法的设计方向。 Conclusion: HSI在自动驾驶中的实用化不仅依赖传感器技术选择,更需协同设计轻量、鲁棒的专用视觉算法;HSI-Drive数据集为该领域提供了重要实验基准。 Abstract: The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.

[151] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

Alex Hoi Hang Chan,Neha Singhal,Onur Kocahan,Andrea Meltzer,Saverio Lubrano,Miyako H. Warrington,Michel Griesser,Fumihiro Kano,Hemal Naik

Main category: cs.CV

TL;DR: 本文介绍了CHIRP数据集和CORVID方法,用于野生鸟类的个体再识别和长期行为监测,旨在弥合计算机视觉研究与生物学应用之间的差距。

Details Motivation: 长期个体动物行为监测对保护生物学和进化生物学至关重要,但现有计算机视觉方法在野外种群中自动化行为监测仍面临挑战,主要由于缺乏支持多种视觉任务的生物相关数据集。 Method: 提出CHIRP数据集(涵盖个体再识别、动作识别、2D关键点估计、目标检测和实例分割)和CORVID方法(基于彩色腿环分割与分类的概率型视频个体再识别流程)。 Result: CORVID在应用特异性基准测试(如摄食率、共现率)中优于当前最优再识别方法;CHIRP数据集支持多任务学习与生物可解释评估。 Conclusion: 本工作为从伦理批准的生物学研究中构建真实世界数据集提供了范式,推动计算机视觉与生态学、行为学的深度融合。 Abstract: Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

[152] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Xiangyang Luo,Qingyu Li,Yuming Li,Guanbo Huang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文提出Timestep-aware Quality Decoupling (TQD)方法,通过在训练中按时间步动态调整低质量但高运动/高视觉质量视频的采样分布,缓解Motion-Vision Quality Dilemma,从而在不依赖完美数据的前提下提升视频生成性能。

Details Motivation: 视频生成模型严重依赖兼具高视觉质量与高运动质量的“黄金数据”,但现实中二者存在负相关(Motion-Vision Quality Dilemma),难以同时兼得。 Method: 基于对视频扩散模型分层学习动力学和梯度退化分析的发现,提出Timestep selection机制;设计TQD方法,根据数据类型(运动丰富型或视觉高质量型)自适应地偏向高或低采样时间步,实现质量解耦训练。 Result: TQD仅用分离的、质量不平衡的数据即可超越传统方法使用更优数据的性能;且在高质数据上训练时也能进一步提升效果。 Conclusion: 完美视频训练数据并非必需;通过匹配模型学习动态的时间步感知采样策略,可有效利用现实中的质量失衡数据,为视频生成数据工程提供新范式。 Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

[153] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

Ning Ding,Keisuke Fujii,Toru Tamaki

Main category: cs.CV

TL;DR: 本文提出了首个羽毛球全场比赛密集标注数据集BFMD,并构建了一个基于VideoMAE的多模态字幕生成框架,引入语义反馈机制以提升字幕语义一致性,在 shot caption 任务上优于纯RGB基线。

Details Motivation: 现有羽毛球数据集多为短片段或任务特定标注,缺乏带密集多模态标注的全场比赛数据,难以支持准确的击球描述生成和比赛级战术分析。 Method: 构建BFMD数据集(19场广播比赛、20+小时、16751次击球,含多层次标注);提出基于VideoMAE的多模态字幕框架,引入语义反馈机制利用击球语义指导字幕生成。 Result: 多模态建模与语义反馈显著提升击球字幕质量;在BFMD上验证了全场比赛战术模式时序演化的分析潜力。 Conclusion: BFMD填补了全场比赛密集多模态标注的空白,所提方法提升了击球描述生成的语义一致性,为羽毛球战术理解与分析提供了新基准与工具。 Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

[154] Insights on back marking for the automated identification of animals

David Brunner,Marie Bordes,Elisabeth Mayrhuber,Stephan M. Winkler,Viktoria Dorfer,Maciej Oczak

Main category: cs.CV

TL;DR: 本研究探讨了如何设计背部标记以支持基于机器学习的猪个体识别,重点分析了ResNet-50模型在十头带独特背部标记猪上的分类性能,并提出了考虑运动模糊、视角变化、遮挡及数据增强策略的标记设计指南。

Details Motivation: 现有研究缺乏针对外观相似物种(如猪)个体级监测的背部标记设计指导,而机器学习监测方案的兴起亟需可被算法有效识别的标记设计准则。 Method: 使用ResNet-50神经网络训练模型对十头具有唯一背部标记的猪进行分类,并分析模型预测结果,结合运动模糊、多视角、遮挡及常见数据增强(颜色、翻转、裁剪)等因素评估标记设计有效性。 Result: 发现背部标记必须在运动模糊、多角度和行为导致的遮挡下仍保持明确可辨;且设计需适配训练中常用的数据增强策略。 Conclusion: 该研究为优化背部标记设计提供了实证依据,有助于提升未来动物个体监测研究与实际应用中的识别鲁棒性与准确性。 Abstract: To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model's predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.

[155] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

Xuran Hu,Zhitong Xiong,Zhongcheng Hong,Yifang Ban,Xiaoxiang Zhu,Wufan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种面向地球观测的、高度感知的多模态模型评估框架与基线模型,旨在解决现有大模型忽视垂直维度(高度)导致的空间推理能力不足问题。

Details Motivation: 当前地球观测中的大语言多模态模型(LMMs)普遍忽略“垂直”维度(即高度信息),限制了其在复杂遥感几何结构和灾害场景下的空间推理能力,而物理空间结构往往比平面纹理更重要。 Method: 构建了一个基于视觉语言模型(VLM)驱动的高度感知遥感数据生成流程,结合系统性提示工程与元数据提取;据此建立了两个互补基准:GeoHeight-Bench(相对高度分析)与GeoHeight-Bench+(地形整体感知);并提出了首个高度感知遥感LMM基线模型GeoHeightChat,将视觉语义与隐式注入的高度几何特征协同建模。 Result: GeoHeightChat基线模型成功缓解了模型的‘垂直盲点’,在所提基准上验证了高度感知对遥感理解的必要性,并实现了光学模型上的交互式高度推理新范式。 Conclusion: 引入高度维度是提升遥感多模态模型空间理解能力的关键;本文提出的评估框架与基线模型为高度感知的地球观测AI研究奠定了基础。 Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

[156] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

Sk Miraj Ahmed,Xi Yu,Yunqi Li,Yuewei Lin,Wei Xu

Main category: cs.CV

TL;DR: 本文提出两种层次感知的多模态学习方法CLiBD-HiR和CLiBD-HiR-Fuse,通过引入分层信息正则化和轻量级融合预测器,显著提升生物分类准确率,尤其在DNA数据不完整或受噪声干扰时效果更佳。

Details Motivation: 现有方法将分类学视为扁平标签空间,忽略了生物分类固有的层次结构,导致在噪声和模态缺失下鲁棒性差。 Method: 提出CLiBD-HiR(引入分层信息正则化HiR以塑造嵌入几何结构)和CLiBD-HiR-Fuse(增加轻量级融合预测器,支持图像、DNA或联合推理,并对模态损坏具有鲁棒性)。 Result: 在大规模生物多样性基准测试中,相比强多模态基线,分类准确率提升超14%,尤其在DNA部分缺失或受污染条件下增益显著。 Conclusion: 显式建模生物分类层次结构并结合灵活融合机制,是构建实用型生物多样性基础模型的关键。 Abstract: Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

[157] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation

Chengfeng Zhao,Junbo Qi,Yulou Liu,Zhiyang Dou,Minchen Li,Taku Komura,Ziwei Liu,Wenping Wang,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出UNIC方法,利用实例特定的神经变形场实时驱动虚拟角色服装变形,避免处理复杂拓扑,提升变形质量与效率。

Details Motivation: 物理仿真方法计算成本高、难以实时;现有基于图神经网络的学习方法难以建模复杂拓扑服装的精细变形。 Method: 提出基于神经变形场的UNIC方法,学习实例特定的3D点到形变偏移的映射,仅需泛化至新动作序列而非新服装。 Result: 在多种服装网格上实验表明,UNIC在变形质量与运行效率上均优于基线方法,适用于游戏等实时交互场景。 Conclusion: UNIC通过实例特定神经变形场实现高质量、实时服装动画,兼顾精度、效率与实用性。 Abstract: Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.

[158] DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

Zhenchen Zhu,Ge Hu,Weixiong Tan,Kai Gao,Chao Sun,Zhen Zhou,Kepei Xu,Wei Han,Meixia Shang,Xiaoming Qiu,Yiqing Tan,Jinhua Wang,Zhoumeng Ying,Li Peng,Wei Song,Lan Song,Zhengyu Jin,Nan Hong,Yizhou Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的深度学习模型DeepFAN,用于肺结节良恶性分类,并通过多中心临床试验验证其辅助初级放射科医生诊断的有效性。

Details Motivation: 现有深度学习方法未能充分融合全局与局部特征,且缺乏临床试验验证;同时,CT普及导致肺结节检出增多,亟需提升诊断准确性与一致性。 Method: 构建基于Transformer的DeepFAN模型,使用超1万例病理证实的肺结节数据训练,并开展多阅片者、多病例的前瞻性临床试验(涵盖三家独立医疗机构共400例),评估其对初级放射科医生的辅助效果。 Result: DeepFAN在内部测试集和临床试验数据集上AUC分别达0.939和0.954;12名阅片者平均AUC提升10.9%,准确率、敏感性和特异性也显著提升;结节级阅片一致性由fair提升至moderate(k值0.313→0.421)。 Conclusion: DeepFAN可有效辅助初级放射科医生提升诊断性能,有助于均质化诊断质量并减少不确定肺结节的不必要随访。 Abstract: The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

[159] LanteRn: Latent Visual Structured Reasoning

André G. Viveiros,Nuno Gonçalves,Matthias Lindemann,André Martins

Main category: cs.CV

TL;DR: 本文提出LanteRn框架,使大语言-视觉模型能在潜在空间中交织语言与紧凑视觉表征进行推理,避免像素级计算或外部工具依赖,在多个视觉推理基准上取得提升。

Details Motivation: 现有大 multimodal 模型在视觉推理(尤其是细粒度空间和视觉理解)方面能力有限,通常仅将视觉内容转为文本描述,限制了其性能。 Method: 提出LanteRn框架,在视觉-语言Transformer中引入生成和关注连续视觉‘思维嵌入’的能力;采用两阶段训练:监督微调实现视觉特征到潜在状态的对齐,再通过强化学习使潜在推理与任务目标对齐。 Result: 在VisCoT、V* 和 Blink 三个感知密集型基准上,LanteRn在视觉定位与细粒度推理方面均取得一致提升。 Conclusion: 利用模型内部潜在空间进行视觉推理是一种更高效、更有前景的多模态推理路径。 Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.

[160] Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

Chengshuai Yang

Main category: cs.CV

TL;DR: 本文提出spec.md规范格式和三个自主代理(Plan、Judge、Execute),将一句自然语言描述自动转化为经验证的前向模型,并通过设计到实际误差定理分解重建误差,实现与专家库质量相当的自动化成像系统设计。

Details Motivation: 解决计算成像系统设计依赖专家、耗时长、门槛高的问题,使更广泛的科学社区能参与成像仪器原型开发。 Method: 提出结构化规范格式spec.md;构建Plan、Judge、Execute三个自主代理;建立设计到实际误差定理,将总重建误差分解为五个可独立界定的项并关联修正动作。 Result: 在6种真实数据模态(覆盖全部5类载波族)上,自动化流程达到专家库水平(98.1 ± 4.2%);生成10种新设计,支持从3D到5D的复合链式结构,超越单一模态工具的组合能力。 Conclusion: 该方法显著降低计算成像系统设计门槛,提升可扩展性与跨模态组合能力,推动自动化、标准化成像仪器开发。 Abstract: Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.

[161] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Yunus Talha Erzurumlu,Jiyong Kwag,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了一种名为Just Zoom In的新方法,通过在城市级卫星地图上进行自回归式逐级缩放来实现跨视角地理定位(CVGL),避免了传统对比学习和难负样本挖掘的依赖,并在新构建的真实基准上取得了SOTA性能。

Details Motivation: 现有CVGL方法将问题建模为对比学习下的图像检索任务,受限于大批次训练、难负样本挖掘,且忽略了地图几何结构与街景/航拍图像覆盖不匹配的问题,导致定位目标模糊、难以显式空间推理。 Method: 提出Just Zoom In方法:从粗粒度卫星图出发,通过短序列的自回归‘缩放’决策,逐步定位到目标分辨率下的终端卫星单元;不使用对比损失或难负样本挖掘。同时构建了一个基于众包街景与高分辨率卫星影像的真实基准。 Result: 在新基准上,Just Zoom In在50米内Recall@1提升5.5%,100米内Recall@1提升9.6%,显著优于最强对比检索基线。 Conclusion: 序列化的由粗到细空间推理方式比传统嵌入式检索更适合跨视角地理定位任务,能更有效地利用地图结构与空间关系。 Abstract: Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

[162] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware,Animesh Gupta,Kevin Zhai,Zhenyi Wang,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出VISAGE框架,通过在推理时校准目标函数来缓解多模态扩散大语言模型(MDLLMs)中的多模态幻觉问题,其核心是利用交叉注意力空间熵估计语言-视觉目标不匹配,并通过强制注意力头间定位一致性重排序token,从而提升视觉对齐性与鲁棒性。

Details Motivation: 现有MDLLMs采用纯文本似然进行token排序,忽视局部视觉支持,导致目标错配和多模态幻觉。 Method: VISAGE是一种无需训练的解码框架,通过量化交叉注意力分布的空间熵来估计代理目标偏差,并强制多头注意力在空间上达成一致,惩罚均匀分布,实现对token的重排序以增强视觉基础。 Result: 在MMMU-val和HallusionBench上分别取得8.59%和7.75%的相对提升,验证了框架在幻觉敏感与通用基准上的鲁棒性。 Conclusion: VISAGE将幻觉重新解释为局部优化错误,通过推理时目标校准有效缓解语言捷径依赖,提升了多模态生成的视觉保真度与稳定性。 Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

[163] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kaijin Chen,Dingkang Liang,Xin Zhou,Yikang Ding,Xiaoqiang Liu,Pengfei Wan,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出Hybrid Memory范式及HyDRA架构,解决视频世界模型中动态主体遮挡后重出现时的身份与运动连续性问题,并构建首个大规模混合记忆视频数据集HM-World。

Details Motivation: 现有视频世界模型的记忆机制将环境视为静态画布,难以处理动态主体短暂离开视野后再出现时的冻结、形变或消失问题。 Method: 提出Hybrid Memory范式,要求模型同时精准存档静态背景与持续追踪动态主体;构建HM-World数据集(59K高保真片段,含解耦相机/主体轨迹、17场景、49主体、精心设计的进出事件);设计HyDRA内存架构,采用时空相关性驱动的检索机制压缩并选择性关注运动线索。 Result: 在HM-World上实验表明,HyDRA在动态主体一致性与整体生成质量上显著优于当前最优方法。 Conclusion: Hybrid Memory是提升视频世界模型对动态主体长期一致建模能力的有效新范式,HyDRA与HM-World为该方向提供了关键架构与基准支撑。 Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

[164] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham,David T. Hoffmann,Ricardo Guerrero,Brais Martinez

Main category: cs.CV

TL;DR: 本文提出了一种无需硬负样本、不增加推理开销的对比式视觉-语言模型改进方法,通过概念中心化短文本对齐与跨模态注意力池化,显著提升模型在组合性任务上的性能,同时保持甚至增强零样本与检索能力。

Details Motivation: 现有对比式视觉-语言模型难以学习组合性表征;生成硬负样本虽有效但泛化差、损害基础能力(如零样本/检索),不实用。 Method: 1)使用标准NLP工具提取短的概念中心化文本片段并与图像对齐;2)引入无参的跨模态注意力池化机制,从图像编码器中获取概念中心化视觉嵌入;辅以简单对比损失。 Result: 在标准组合性基准上达到SOTA性能,同时零样本分类与跨模态检索能力不降反升,且未增加推理成本。 Conclusion: 组合性缺陷源于长描述缺乏必要约束及全局池化破坏绑定信息;通过概念中心化对齐与注意力池化可根本性改善,无需硬负样本或模型扩容。 Abstract: Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

[165] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Chen Si,Yulin Liu,Bo Ai,Jianwen Xie,Rolandos Alexandros Potamias,Chuanxia Zheng,Hao Su

Main category: cs.CV

TL;DR: AnyHand是一个大规模合成数据集,包含2.5M单手和4.1M手-物交互的RGB-D图像,用于提升3D手部姿态估计性能;在RGB-only和RGB-D设置下均显著提升现有方法性能与泛化能力,并提出轻量级深度融合模块。

Details Motivation: 现有真实世界手部数据集覆盖有限,已有合成数据集缺乏遮挡、手臂细节和对齐深度信息的大规模支持,限制了3D手部姿态估计的发展。 Method: 构建AnyHand大规模合成RGB-D数据集(含丰富几何标注),并在不改变模型结构和训练策略的前提下,将其用于扩展现有基线方法的训练集;同时设计可即插即用的轻量级深度融合模块以增强RGB模型。 Result: 在FreiHAND和HO-3D等基准上显著提升性能;在未微调情况下对域外HO-Cap数据集表现出更强泛化能力;集成深度模块后在HO-3D上达到更优RGB-D性能。 Conclusion: AnyHand验证了高质量、大规模、多模态合成数据对3D手部姿态估计的重要推动作用,尤其在提升鲁棒性、泛化性和多模态融合方面具有显著价值。 Abstract: We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

[166] PixelSmile: Toward Fine-Grained Facial Expression Editing

Jiabin Hua,Hengyuan Xu,Aojie Li,Wei Cheng,Gang Yu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出PixelSmile扩散框架,通过完全对称联合训练和文本潜在插值实现精细、连续、可控的面部表情编辑,并构建FFE数据集与FFE-Bench评测基准。

Details Motivation: 细粒度面部表情编辑长期受限于内在语义重叠问题。 Method: 构建具有连续情感标注的Flex Facial Expression(FFE)数据集和FFE-Bench评测基准;提出PixelSmile扩散框架,采用完全对称联合训练解耦表情语义,并结合强度监督与对比学习,通过文本潜在插值实现线性可控的表情编辑。 Result: PixelSmile在结构混淆抑制、编辑精度、线性可控性及身份保持方面表现优异,支持平滑表情融合。 Conclusion: PixelSmile有效解决了表情语义重叠问题,实现了连续、可控、细粒度且身份鲁棒的表情编辑。 Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

[167] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao,Shaohao Rui,Kaining Ying,Bo Zheng,Chuanhao Li,Mingmin Chi,Kaipeng Zhang

Main category: cs.CV

TL;DR: 本文提出PackForcing框架,通过三段式KV缓存策略(Sink/Mid/Recent tokens)与动态上下文选择、时序RoPE调整等技术,高效压缩历史信息,在单卡H200上实现2分钟高清视频生成,KV缓存仅4GB,并支持24倍时序外推。

Details Motivation: 解决自回归视频扩散模型在长视频生成中面临的KV缓存线性增长、时间重复和误差累积等瓶颈问题。 Method: 提出PackForcing统一框架:1)三段式KV缓存策略——Sink tokens保留关键帧语义、Mid tokens通过双分支网络实现32倍时空压缩、Recent tokens维持局部时序连贯性;2)动态top-k上下文选择机制控制Mid tokens数量;3)连续Temporal RoPE Adjustment校准位置编码间隙。 Result: 在单H200 GPU上生成2分钟(832×480,16FPS)视频,KV缓存稳定在4GB;实现5秒到120秒(24倍)零样本或仅用5秒片段微调的时序外推;VBench上时序一致性达26.07、动态程度达56.25,均为SOTA。 Conclusion: 短视频监督足以支撑高质量长视频合成;层次化上下文压缩是突破自回归视频生成长度与效率瓶颈的有效范式。 Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

[168] Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo,Yuxuan Li,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种统一的视觉-语言-世界-动作模型Vega,用于基于指令的自动驾驶生成与规划,并构建了大规模驾驶数据集InstructScene。

Details Motivation: 现有自动驾驶模型缺乏根据多样化用户指令进行个性化驾驶的能力,仅将语言用于场景描述或推理。 Method: 构建包含约10万场景及对应轨迹的大规模驾驶数据集InstructScene;提出Vega模型,结合自回归范式处理视觉和语言输入,扩散范式生成未来预测和轨迹,并通过联合注意力和独立投影层实现多模态交互。 Result: 实验表明该方法在规划性能和指令遵循能力上均优于现有方法。 Conclusion: Vega模型为更智能、个性化的自动驾驶系统提供了新路径。 Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

[169] RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang,YuXin Song,Ge Wu,Haocheng Feng,Hang Zhou,Jingdong Wang,Yaxing Wang,jian Yang

Main category: cs.CV

TL;DR: 本文提出RefAlign框架,通过显式对齐参考图像特征与视觉基础模型(VFM)语义空间,提升参考到视频生成中身份一致性和语义可区分性,兼顾文本可控性与参考保真度,且训练后无推理开销。

Details Motivation: 现有参考到视频(R2V)方法依赖多模态辅助特征缓解VAE潜在空间的信息泄漏,但仍难以解决复制粘贴伪影和多主体混淆问题,根源在于异构编码器特征间的模态不匹配。 Method: 提出RefAlign框架,核心是参考对齐损失:在训练中拉近同一主体的DiT参考分支特征与VFM特征,推开不同主体的对应特征;仅作用于训练阶段,不增加推理开销。 Result: 在OpenS2V-Eval基准上,RefAlign在TotalScore指标上超越当前最先进方法,验证了显式参考对齐对R2V任务的有效性。 Conclusion: 显式将参考特征对齐至VFM语义空间是一种简单而高效的方法,能更好平衡文本可控性与参考保真度,为R2V生成提供了新思路。 Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

[170] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Bocheng Zou,Mu Cai,Mark Stanley,Dingfu Lu,Yong Jae Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为Multi-Resolution Fusion (MuRF)的推理时多分辨率特征融合策略,通过在冻结的视觉基础模型上并行处理不同分辨率图像并融合其特征,提升模型性能,且无需额外训练、兼容多种VFM架构。

Details Motivation: 现有视觉基础模型(VFMs)在推理阶段通常仅使用单一固定尺度输入,忽略了多尺度视觉感知中低分辨率利于全局语义识别、高分辨率利于细粒度识别的互补性。 Method: 提出MuRF方法:在冻结的VFM上对同一图像进行多分辨率前向传播,提取各尺度特征后进行融合,形成统一表征;该方法无需微调或重训练,适用于任意VFM架构。 Result: MuRF在多个关键计算机视觉任务上显著提升性能,并成功泛化至DINOv2、SigLIP2等不同家族的视觉基础模型。 Conclusion: MuRF是一种简单、通用、训练无关的推理增强策略,有效挖掘了多尺度信息的协同价值,为视觉基础模型的实际部署提供了即插即用的性能增益方案。 Abstract: Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

[171] Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

Yixing Lao,Xuyang Bai,Xiaoyang Wu,Nuoyuan Yan,Zixin Luo,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Shiwei Li,Hengshuang Zhao

Main category: cs.CV

TL;DR: LGTM是一种新的前馈式3D高斯泼溅框架,通过预测紧凑的高斯基元并结合每个基元的纹理,解耦了几何复杂度与渲染分辨率,从而支持无需逐场景优化的4K新视角合成。

Details Motivation: 现有前馈式3D高斯泼溅方法因像素对齐基元导致基元数量随分辨率二次增长,难以扩展至4K等高分辨率合成。 Method: 提出LGTM框架,采用紧凑高斯基元加每基元纹理的方式,使几何表示与渲染分辨率解耦。 Result: 实现了无需逐场景优化的高质量4K新视角合成,同时显著减少高斯基元数量。 Conclusion: LGTM突破了前馈式高斯泼溅的分辨率扩展瓶颈,为高分辨率实时新视角合成提供了可扩展解决方案。 Abstract: Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/

[172] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Yawen Luo,Xiaoyu Shi,Junhao Zhuang,Yutian Chen,Quande Liu,Xintao Wang,Pengfei Wan,Tianfan Xue

Main category: cs.CV

TL;DR: ShotStream是一种新型因果多镜头视频生成架构,支持交互式叙事和实时帧生成,通过双缓存机制和两阶段蒸馏策略解决跨镜头一致性与误差累积问题,在单卡上实现16FPS实时生成。

Details Motivation: 现有双向多镜头视频生成架构交互性差、延迟高,难以支持长叙事和实时交互 storytelling。 Method: 提出ShotStream因果架构,将任务重构为基于历史上下文的下一镜头生成;采用分布匹配蒸馏将双向模型蒸馏为因果学生模型;引入双缓存(全局/局部)机制保障跨镜头与镜头内一致性,并用RoPE不连续指示器区分缓存;设计两阶段自强制蒸馏策略缓解误差累积。 Result: 在单GPU上实现亚秒级延迟、16 FPS实时生成;多镜头视频连贯性好,质量媲美或超越更慢的双向模型。 Conclusion: ShotStream为实时交互式多镜头视频生成提供了高效、鲁棒的新范式,推动了AI驱动的动态叙事发展。 Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our