Skip to content

Table of Contents

cs.CL [Back]

[1] When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Hasindri Watawana,Sergio Burdisso,Diego A. Moreno-Galván,Fernando Sánchez-Vega,A. Pastor López-Monroy,Petr Motlicek,Esaú Villatoro-Tello

Main category: cs.CL

TL;DR: 本文揭示了在基于医生-患者对话的自动抑郁检测中,模型常利用采访者固定提问(而非患者语言)实现高准确率,造成系统性偏差;作者呼吁限制使用采访者语句、聚焦参与者语言,并强调需按时间和说话人定位决策依据以提升可解释性与真实性。

Details Motivation: 现有抑郁检测模型虽性能强,但缺乏可解释性,不清楚预测依据是患者语言还是其他干扰因素(如采访者提示)。 Method: 分析三个公开数据集(ANDROIDS, DAIC-WOZ, E-DAIC),检验模型是否依赖采访者语句中的固定提示和位置信息,并对比仅使用参与者语句时的模型表现与决策证据分布。 Result: 发现模型在使用采访者语句时可仅凭脚本化提示实现高分类准确率,而限制为参与者语句后,决策依据更分散、更依赖真实语言特征,性能下降但更具可解释性和泛化意义。 Conclusion: 当前评估存在跨数据集、与模型结构无关的系统性偏差;应排除采访者提示干扰,聚焦参与者语言,并通过时空细粒度归因确保模型学习的是真实的抑郁语言标志。 Abstract: Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.

[2] Demystifying When Pruning Works via Representation Hierarchies

Shwai He,Guoheng Sun,Haichao Zhang,Yun Fu,Ang Li

Main category: cs.CL

TL;DR: 本文从表示层次视角分析网络剪枝对语言模型的影响,发现剪枝在嵌入和logit空间中相对鲁棒,但在logit到概率的非线性变换中会放大扰动,导致生成任务性能显著下降;而非生成任务因依赖稳定概率子空间和嵌入空间而受影响较小。

Details Motivation: 解释为何网络剪枝在非生成任务中有效,却在生成任务中常失效。 Method: 从表示层次(embedding、logit、probability)分解语言模型内部计算,分析剪枝对各空间的影响及扰动传播机制。 Result: 嵌入与logit空间对剪枝扰动鲁棒,但logit→概率的非线性变换会放大扰动并随时间步累积,导致生成质量下降;而概率子空间的稳定性支撑了非生成任务的剪枝有效性。 Conclusion: 剪枝效果高度依赖任务类型及其所依赖的表示空间特性,需根据任务选择适配的剪枝策略。 Abstract: Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations

[3] Fine-Tuning A Large Language Model for Systematic Review Screening

Kweku Yamoah,Noah Schroeder,Emmanuel Dorley,Neha Rani,Caleb Schutz

Main category: cs.CL

TL;DR: 本文探讨了利用微调的大型语言模型(LLM)提升系统性综述中标题与摘要筛选效率的方法,结果显示微调后的模型在各项指标上显著优于基线模型。

Details Motivation: 传统系统性综述耗时耗力,尤其在标题和摘要筛选阶段;现有基于提示词(prompting)的LLM方法效果不稳定,作者认为缺乏足够上下文是主因。 Method: 对一个12亿参数的开源LLM进行微调,训练数据来自人类标注的8500+篇标题与摘要(用于某系统性综述),任务为二分类(是否纳入)。 Result: 微调模型加权F1提升80.79%;在8277项研究测试集上,与人工编码者一致率达86.40%,真阳性率91.18%,真阴性率86.38%,多次推理结果完全一致。 Conclusion: 针对特定任务微调小型LLM可显著提升系统性综述初筛性能,具备实用潜力。 Abstract: Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.

[4] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Mohammed Nowshad Ruhani Chowdhury,Mohammed Nowaz Rabbani Chowdhury,Sakari Lukkarinen

Main category: cs.CL

TL;DR: 本研究通过在小型验证的芬兰语模拟临床对话语料库上微调LLaMA 3.1-8B模型,探索其在芬兰语医疗转录任务中的有效性,结果表明该方法在语义相似性上表现良好,具备在低资源语言环境中构建隐私导向型领域专用大模型的可行性。

Details Motivation: 解决芬兰语等低资源语言中电子健康记录(EHR)行政负担重、加剧医生职业倦怠的问题,提升临床文档质量以保障患者安全与诊疗连续性。 Method: 在Metropolia应用科学大学学生生成的模拟临床对话语料库上,采用受控预处理与优化策略对LLaMA 3.1-8B进行微调,并通过七折交叉验证评估性能。 Result: BLEU=0.1214,ROUGE-L=0.4982,BERTScore F1=0.8230;显示低n元语法重叠但高语义相似性。 Conclusion: 微调LLaMA 3.1-8B可有效支持芬兰语医疗口语转录,验证了构建隐私优先、领域适配的轻量级医疗大语言模型的可行性,并为后续研究提供方向。 Abstract: Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.

[5] Enhancing Structured Meaning Representations with Aspect Classification

Claire Benét Post,Paul Bontempo,August Milliken,Alvin Po-Chun Chen,Nicholas Derby,Saksham Khatwani,Sumeyye Nabieva,Karthik Sairam,Alexis Palmer

Main category: cs.CL

TL;DR: 本文介绍了一个新的英语句子数据集,该数据集在缺乏时态特征的AMR图上标注了UMR时态标签,旨在推动自动时态信息预测系统的发展。

Details Motivation: 时态(aspect)是语义表示中描述事件内部时间结构的重要特征,但在现有语义表示框架中标注稀疏,限制了人工标注和自动预测系统的发展。 Method: 构建了一个基于AMR图并补充UMR时态标签的新英文数据集,设计了基于UMR时态格的标注方案与多阶段仲裁流程以保障标注质量,并通过三种建模方法开展基线实验。 Result: 提供了首个自动UMR时态预测的基准结果,验证了数据集对自动化任务的有效性,并为更广泛地将时态整合进语义表示奠定了基础。 Conclusion: 本工作填补了UMR时态标注资源的空白,不仅支持高质量人工标注实践,也为后续自动时态识别与语义表示增强提供了关键数据基础与技术起点。 Abstract: To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.

[6] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini

Main category: cs.CL

TL;DR: 本文研究了在葡萄牙语持续预训练中,合成数据重写如何与源数据质量相互作用。研究发现,重写高质量数据能显著提升模型性能,而重写低质量数据效果有限,表明合成重写主要是质量放大器,而非数据筛选的替代方案,且该效应依赖于模型规模。

Details Motivation: 现有合成数据生成研究多集中于英语,且未系统控制源数据质量,本文旨在填补葡萄牙语场景下源数据质量与合成重写交互影响的研究空白。 Method: 基于标注有STEM和教育质量分数的葡萄牙语语料库ClassiCC-PT,构建高低质量两个10B词符子集;使用7B指令微调模型对每种子集进行四种风格重写,生成约40B词符合成数据;在1.1B和7B两种规模英语中心基座模型上开展持续预训练,并在44任务葡萄牙语基准PoETa V2上评估。 Result: 7B模型上,重写高质量数据带来+3.4 NPM增益,重写低质量数据仅+0.5 NPM;1.1B模型上该差异减弱,未修改的低质量数据表现接近重写的高质量数据。 Conclusion: 合成重写主要起质量放大作用,不能替代数据筛选,且其效益随模型规模增大而增强。 Abstract: Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

[7] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Haobo Xu,Sirui Chen,Ruizhong Qiu,Yuchen Yan,Chen Luo,Monica Cheng,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出arrol方法,通过在线剪枝rollout并平衡正确性来加速强化学习与可验证奖励(RLVR)训练,显著提升准确率和训练速度。

Details Motivation: 现有RLVR方法如GRPO和DAPO计算开销大,且因奖励相对优势稀疏导致学习信号弱。 Method: arrol是一种在线rollout剪枝方法:训练轻量级质量头预测部分rollout的成功概率以实现早期剪枝,并利用该质量头加权候选结果;系统层面在推理引擎内剪枝并动态重批处理剩余rollout。 Result: 在Qwen-3和LLaMA-3.2(1B-8B)模型上,arrol使平均准确率提升+2.30至+2.99,训练速度最高提升1.7倍,并在测试时缩放中额外提升平均准确率+8.33。 Conclusion: arrol有效缓解了RLVR中高计算成本和弱学习信号问题,兼顾训练效率与推理性能提升。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

[8] LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis

Baraa Hikal,Jonas Becker,Bela Gipp

Main category: cs.CL

TL;DR: LogSigma is a system for SemEval-2026 Task 3 (DimABSA), predicting continuous Valence and Arousal scores using learned homoscedastic uncertainty to balance regression objectives, achieving first place on five datasets.

Details Motivation: Traditional ABSA predicts discrete sentiment labels, but DimABSA requires continuous Valence and Arousal scores; prediction difficulty varies across languages and domains, necessitating adaptive task balancing. Method: LogSigma employs learned homoscedastic uncertainty with task-specific log-variance parameters to dynamically balance Valence and Arousal regression objectives, combined with language-specific encoders and multi-seed ensembling. Result: LogSigma achieves 1st place on five datasets across both tracks of SemEval-2026 Task 3; learned variance weights differ substantially across languages (e.g., 0.66x for German, 2.18x for English), confirming language-dependent optimal balancing. Conclusion: Optimal balancing of Valence and Arousal regression tasks is language-dependent and must be learned rather than set a priori; LogSigma’s uncertainty-aware approach effectively addresses cross-lingual DimABSA challenges. Abstract: This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.

A. Feder Cooper,Mark A. Lemley,Christopher De Sa,Lea Duesterwald,Allison Casasola,Jamie Hayes,Katherine Lee,Daniel E. Ho,Percy Liang

Main category: cs.CL

TL;DR: 本文提出了一种解码约束的束搜索方法,用于高效、确定性地估计大语言模型中近逐字(near-verbatim)记忆提取风险,显著降低计算成本,并揭示了传统逐字方法无法发现的记忆风险模式。

Details Motivation: 现有贪婪解码方法无法刻画不同序列间记忆提取风险的差异;概率提取方法虽可解决该问题,但仅适用于逐字记忆,无法覆盖具有相似隐私与版权风险的近逐字记忆;而近逐字提取风险的蒙特卡洛估计计算开销极大。 Method: 提出解码约束的束搜索(decoding-constrained beam search),在可比低计算成本下给出近逐字提取风险的确定性下界。 Result: 该方法能以约20次蒙特卡洛采样的代价实现可靠估计;实验表明其能发现更多可提取序列、更大单序列提取质量,并揭示模型规模与文本类型对近逐字提取风险的影响规律。 Conclusion: 解码约束束搜索是一种高效、确定且实用的近逐字记忆提取风险量化工具,弥补了现有方法在隐私与版权风险评估中的关键缺口。 Abstract: Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

[10] Toward domain-specific machine translation and quality estimation systems

Javad Pourmostafa Roshan Sharami

Main category: cs.CL

TL;DR: 本文研究了如何通过数据驱动的方法将机器翻译(MT)和质量估计(QE)系统适配到专业领域,提出了包括基于相似性的数据选择、分阶段QE训练、子词切分与词表对齐优化、以及QE引导的大模型上下文学习等方法,显著提升了跨域、低资源及零样本场景下的性能。

Details Motivation: MT和QE在通用领域表现良好,但在领域不匹配时性能显著下降,亟需针对专业领域的高效适配方法。 Method: 提出四种数据聚焦方法:1)基于相似性的MT数据选择;2)结合领域自适应与轻量数据增强的分阶段QE训练;3)分析并优化子词切分与词表对齐对微调的影响;4)QE引导的无需参数更新的大语言模型上下文学习。 Result: 各方法均取得显著效果:小规模领域数据超越大规模通用数据;QE方法在多语言、跨领域及零样本设置下提升性能;对齐的tokenization-vocabulary提升训练稳定性与翻译质量;QE引导的上下文学习优于标准检索,并支持无参考评估。 Conclusion: 领域适配效果高度依赖于数据选择、表示方式与高效适应策略;本论文为构建鲁棒的领域专用MT与QE系统提供了系统性方法支撑。 Abstract: Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

[11] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

Yuhang Zhou,Zhuokai Zhao,Ke Li,Spilios Evmorfos,Gökalp Demirci,Mingyi Wang,Qiao Liu,Qifei Wang,Serena Li,Weiwei Li,Tingting Wang,Mingze Gao,Gedi Zhou,Abhishek Kumar,Xiangjun Fan,Lizhu Zhang,Jiayi Liu

Main category: cs.CL

TL;DR: 本文提出MoFA框架,利用大语言模型进行基于语义和定量信息的可解释、约束感知的序列式特征选择,适用于标注数据稀缺的工业场景,并在三个真实应用中验证了其有效性。

Details Motivation: 传统特征选择方法依赖标注数据和统计启发式,在生产环境中因标注数据有限且需满足多种运行约束而难以应用。 Method: 提出Model Feature Agent(MoFA)模型驱动框架,将特征定义、重要性得分、相关性及元数据等融入结构化提示,通过可解释、约束感知的推理进行序列式特征选择。 Result: 在真实工业场景中验证:1)提升预测准确率并降低特征组复杂度;2)发现高阶交互项,显著提升线上用户参与度;3)选出紧凑高价值特征子集,兼顾准确率与推理效率。 Conclusion: LLM驱动的推理式特征选择在实际生产系统中具备实用性与有效性,为标注稀缺场景提供了新范式。 Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

[12] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

Xiaowei Zhu,Yubing Ren,Fang Fang,Shi Wang,Yanan Cao,Li Guo

Main category: cs.CL

TL;DR: 本文提出Exons-Detect,一种无需训练的AI生成文本检测方法,通过外显子感知的token重加权机制提升检测鲁棒性与可解释性,在多项指标上达到SOTA。

Details Motivation: 现有无训练检测方法假设各token贡献均匀,难以应对短文本或局部篡改,且在 misinformation、作者归属和知识产权等方面存在社会风险,亟需更鲁棒可靠的检测技术。 Method: Exons-Detect基于双模型隐状态差异识别并放大信息量高的'外显子型token',进行重要性加权,并据此计算可解释的翻译得分。 Result: 在DetectRL等基准上AUROC平均相对提升2.2%,显著优于先前最优无训练方法,且对对抗攻击和不同输入长度具有强鲁棒性。 Conclusion: Exons-Detect验证了token异质性建模对无训练检测的关键作用,为高鲁棒、可解释、免训练的AI文本鉴别提供了新范式。 Abstract: The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2\% relative improvement in average AUROC over the strongest prior baseline on DetectRL.

[13] Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models

Tony Mason

Main category: cs.CL

TL;DR: 本文发现模型将指令视为社会行为而非技术规范,指令的语义效力受语言社会语域(如祈使语气)影响;通过将祈使句改写为陈述句,可显著降低跨语言指令执行差异,并提示宪法式AI原则若用祈使语气撰写,可能导致语言依赖型对齐问题。

Details Motivation: 探究多语言大模型在不同语言中对相同语义指令表现出相反交互拓扑(如英语合作 vs 西班牙语竞争)的原因,检验社会语域(如语气、礼貌程度)是否影响模型的指令理解与执行。 Method: 在四种语言和四种模型上开展指令级消融实验;使用22个手工设计探针、分解为56个指令块;系统性将祈使句重写为陈述句,并通过置换检验评估跨语言方差变化及溢出效应。 Result: 将单个指令块由祈使改为陈述,跨语言方差降低81%(p=0.029);改写11个祈使块中的3个即可使西班牙语指令拓扑由竞争转为合作,并影响未改写的其余块;证实模型对'NEVER do X'与'X: disabled'的理解存在根本性语用差异。 Conclusion: 模型在推理时依据语言特有社会语域处理指令,该机制很可能在训练阶段即已习得;因此,以祈使语气制定的宪法式AI原则可能造成语言依赖型对齐偏差,这是一个可实证检验的预测。 Abstract: System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: "NEVER do X" is an exercise of authority whose force is language-dependent, while "X: disabled" is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.

[14] Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobranić,Tina Munda,Oliver Pejić,Vojko Gorjanc,Uroš Šmajdek,David Bordon,Jakob Lenardič,Tjaša Konovšek,Kristina Pahor de Maiti Tekavčič,Ciril Bohak,Darja Fišer

Main category: cs.CL

TL;DR: 本研究结合主题建模、大语言模型(LLM)驱动的细粒度情感分析、命名实体图谱可视化与质性话语分析,探究斯洛文尼亚历史报纸《Slovenec》和《Slovenski narod》在二十世纪初如何表征集体身份、政治取向与民族归属;采用BERTopic识别主题差异,筛选适配OCR退化文本的斯洛文尼亚语GaMS3-12B-Instruct模型进行情感分析,并构建实体关系图谱,融合量化网络分析与批判话语分析揭示历史政治与社会经济身份的交织演化。

Details Motivation: 探究二十世纪初斯洛文尼亚历史报纸中集体身份、政治取向与民族归属的话语表征,应对OCR退化文本带来的计算分析挑战,并弥合数字人文中可扩展计算方法与批判性解释之间的鸿沟。 Method: 整合BERTopic主题建模、四类指令微调LLM的情感分类评估(最终选定斯洛文尼亚适配的GaMS3-12B-Instruct)、NER驱动的实体关系图谱构建,以及定量网络分析与批判话语分析相结合的混合方法。 Result: 发现两报存在共享议题但意识形态分化明显(保守天主教 vs 自由进步);GaMS3-12B-Instruct在中性情感识别上优于正/负向;不同群体在话语中呈现中性描述或评价/冲突导向的差异化表征;实体图谱揭示了集体身份与地理空间的关联模式及历史身份的交织演化。 Conclusion: 可扩展的计算方法与批判性人文解释的协同应用,能有效支持对噪声大、质量低的历史报纸数据的深度数字人文研究。 Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

[15] Closing the Confidence-Faithfulness Gap in Large Language Models

Miranda Muqing Miao,Lyle Ungar

Main category: cs.CL

TL;DR: 本文通过机制可解释性分析发现,大语言模型(LLMs)的口头化置信度与实际准确率呈线性但正交编码关系;推理过程会干扰置信度方向,导致‘推理污染效应’;据此提出两阶段自适应引导方法,显著提升校准对齐效果。

Details Motivation: 大型语言模型(LLMs)常口头表达与实际准确率脱节的置信度,其背后的几何机制尚不清楚。 Method: 采用线性探针和对比激活添加(CAA)引导技术,对三个开源权重模型和四个数据集进行机制可解释性分析,并设计两阶段自适应引导流程来对齐内部准确率估计与口头化置信输出。 Result: 发现校准信号与口头化置信信号线性编码但彼此正交;识别出‘推理污染效应’;所提两阶段引导方法在所有评估模型上显著改善校准对齐。 Conclusion: 口头化置信与真实校准在模型内部是分离且正交的,可通过干预内部表征实现更可靠的置信表达;该发现为提升LLM可靠性提供了新路径。 Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

[16] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Suraj Racha,Prashant Harish Joshi,Utkarsh Maurya,Nitin Yadav,Mridul Sharma,Ananya Kunisetty,Saranya Darisipudi,Nirmal Punjabi,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: 本文提出oMind框架,旨在解决大语言模型(LLM)在心理健康领域应用中的三大挑战:高质量可解释、知识支撑数据缺乏、训练范式局限及多轮对话评估困难;通过构建约16.4万条多任务监督微调数据集和新型多轮对话基准oMind-Chat,显著提升LLM在心理健康对话与推理能力上的表现。

Details Motivation: 心理健康是全球日益严峻的问题,而大语言模型在该领域的适配面临高质量数据缺失、训练范式单一及多轮对话评估难三大挑战,亟需针对性解决方案。 Method: 提出oMind框架,包含基于结构化知识检索、LLM剪枝与人工审核的数据生成流程,构建164k多任务SFT数据集,并设计专家标注的多轮对话基准oMind-Chat;对LLM进行训练与对齐,覆盖对话与核心能力。 Result: oMind LLM在核心能力与对话任务上持续超越基线模型,在推理能力上达到最高80%胜率。 Conclusion: oMind框架有效提升了LLM在心理健康领域的适用性与可靠性,为专业、安全、可解释的AI心理支持系统提供了可行路径。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.

[17] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: 本文提出基于二型信号检测理论的评估框架,用meta-d'和M-ratio解耦大语言模型的感知能力(Type-1)与元认知敏感性(Type-2),揭示不同模型在“知道自己不知道什么”方面存在显著差异。

Details Motivation: 现有LLM置信度评估指标(如ECE、Brier分数)混淆了模型知道多少(Type-1敏感性)与知道自己知道多少(Type-2元认知敏感性)两种能力,亟需解耦评估。 Method: 基于二型信号检测理论(Type-2 SDT),引入meta-d'和M-ratio两个新指标,对4个主流LLM在22.4万次事实问答中进行系统评估,并分析温度调节、领域特异性及指标对比。 Result: 发现:(1)元认知效率跨模型差异显著,且与Type-1性能不相关;(2)元认知效率具有领域特异性;(3)温度仅影响二型判断标准,不改变meta-d';(4)AUROC₂与M-ratio给出完全相反的模型排序。 Conclusion: meta-d'框架能有效区分真正具备元认知能力的模型与仅靠置信度策略‘显得’校准良好的模型,对模型选型、部署及人机协作具有关键指导意义。 Abstract: Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

[18] Goodness-of-pronunciation without phoneme time alignment

Jeremy H. M. Wong,Nancy F. Chen

Main category: cs.CL

TL;DR: 本文提出了一种利用弱监督ASR模型进行低资源语言语音评估的新方法,通过构建音素混淆网络、采用词级语速和时长特征、以及跨注意力机制融合音素与帧级特征,避免了对音素时间对齐的依赖,在英语和泰米尔语数据集上取得了与传统同步特征相当的效果。

Details Motivation: 现有ASR模型在低资源语言中训练数据有限,而开源弱监督模型虽支持多语言但存在帧异步和非音素化问题,阻碍了语音评估特征提取。 Method: 通过将ASR假设映射到音素混淆网络来计算音素后验概率;使用词级而非音素级语速与时长;采用跨注意力架构融合音素级和帧级特征,从而规避音素时间对齐需求。 Result: 在英语Speechocean762和低资源泰米尔语数据集上,该方法性能与标准帧同步特征相当。 Conclusion: 所提方法有效解决了弱监督ASR模型与语音评估特征提取之间的不兼容问题,为低资源语言语音评估提供了可行路径。 Abstract: In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

[19] To Write or to Automate Linguistic Prompts, That Is the Question

Marina Sánchez-Torrón,Daria Akselrod,Jason Rauchwerk

Main category: cs.CL

TL;DR: This paper compares expert-designed prompts with automatically optimized prompts (GEPA) for linguistic tasks, finding task-dependent results and showing that automatic optimization can often match expert performance.

Details Motivation: To explore whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks, given the sensitivity of LLM performance to prompt design. Method: Systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, using five model configurations. Result: Results vary by task: in terminology insertion, optimized and manual prompts yield statistically indistinguishable quality; in translation, each approach wins on different models; in LQA, expert prompts excel at error detection while optimization improves characterization; GEPA consistently improves base DSPy signatures, and most expert-optimized comparisons show no significant difference. Conclusion: Automatic prompt optimization (e.g., GEPA) can often match expert prompt engineering in linguistic tasks, though the approaches differ fundamentally—GEPA relies on labeled data and programmatic search, whereas expert prompts rely on domain knowledge and iterative refinement. Abstract: LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

[20] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le,Benjamin Goh,Quy Anh Tang

Main category: cs.CL

TL;DR: 本文提出了一种利用轻量级通用大语言模型(如gemini-2.0-flash-lite-001)作为低延迟安全裁判的方法,通过结构化推理流程(意图分解、安全信号验证、危害评估与自省)有效检测提示攻击,并已在新加坡公共服务聊天机器人中部署。

Details Motivation: 现有轻量级分类器和规则系统在分布偏移下泛化能力差,而高容量LLM裁判又难以满足实时防护的低延迟与低成本要求,存在部署鸿沟。 Method: 设计结构化推理流程(显式意图分解、安全信号验证、危害评估、自我反思),并优化轻量级通用LLM的提示与输出格式,使其胜任实时安全判断;同时探索多模型融合(MoM)策略。 Result: 轻量级通用LLM(如gemini-2.0-flash-lite-001)在真实对抗数据集上表现出高检测性能,已部署为新加坡公共服务聊天机器人的中心化防护服务;MoM方案仅带来小幅增益。 Conclusion: 轻量级通用LLM经合理提示工程与推理引导后,可在严苛生产约束下可靠承担实时安全裁判角色,是弥合LLM安全防护部署鸿沟的有效路径。 Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

[21] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

Ying Li,Xinglin Lyu,Junhui Li,Jinlong Yang,Hengchao Shang,Min Zhang,Shimin Tao,Daimeng Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为Cross-Preference Learning(CPL)的偏好学习框架,用于提升上下文感知机器翻译(MT)的性能。该方法通过引入句内和跨条件偏好信号,显式建模上下文信息在不同句子中的差异化收益,从而增强模型自适应利用上下文的能力,且无需修改模型结构。

Details Motivation: 上下文感知机器翻译虽利用文档级信息,但其性能不稳定,因上下文信号对不同句子的增益不均;现有训练目标未显式建模这种变异性,限制了模型对上下文的自适应利用能力。 Method: 提出Cross-Preference Learning(CPL),一种偏好学习框架,将句内偏好(intra-condition)与跨条件偏好(cross-condition)统一纳入偏好优化目标,为模型提供关于上下文何时及如何提升翻译质量的显式监督信号。 Result: 在多个公开上下文感知MT任务上,基于Qwen3-4B、Qwen3-8B和Llama-3-8B等模型验证了CPL的有效性,结果表明其在两种输入条件下均带来一致的翻译质量与鲁棒性提升。 Conclusion: CPL是一种通用、即插即用的训练范式,能有效提升上下文感知MT性能,无需架构改动,具有良好的泛化性和实用性。 Abstract: Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model's ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.

[22] Probing the Lack of Stable Internal Beliefs in LLMs

Yifan Luo,Kangping Xu,Yanzhen Lu,Yang Yuan,Andrew Chi-Chih Yao

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLM)在多轮对话中维持隐式一致性(即对未明说目标的持续遵守)的能力,发现当前LLM缺乏稳定的内部表征,难以在无显式提示下保持隐式目标,从而限制其人格化建模效果。

Details Motivation: 构建人格驱动的大语言模型需要其在交互中表现出稳定的行为倾向(如坚持性、可靠性),但现有LLM缺乏锚定响应的稳定内部表征,导致人格一致性不足。 Method: 设计了一个20问式谜题游戏范式:LLM需秘密选定一个目标,并仅用'是/否'回答用户猜测;通过多轮评估LLM是否能隐式维持所选目标(即隐式一致性)。 Result: 实验发现LLM难以维持隐式一致性——其隐含‘目标’常随对话轮次漂移,除非在上下文中显式重复提供所选目标。 Conclusion: 当前LLM在隐式目标持久性方面存在关键缺陷,亟需引入能长期锚定隐式目标的机制,以支撑真实可信的人格化对话系统。 Abstract: Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.

[23] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

Main category: cs.CL

TL;DR: 本文系统整理了当代巴斯克语方言数据与资源,区分了原生方言在线数据和标准语到方言的适配数据,并手动构建了高质量的方言XNLI测试集,同时评估了自动适配数据的质量。

Details Motivation: 解决方言自然语言处理中数据稀缺这一主要限制问题。 Method: 系统梳理并分类巴斯克语方言数据资源,包括原生在线方言数据和标准语到方言的适配数据;手动将XNLI测试集适配为三种巴斯克方言;对自动适配的BasPhyCowest数据集进行母语者人工评估。 Result: 构建了覆盖多种来源的巴斯克方言数据目录;生成了高质量的三方言XNLI平行评测数据集;验证了自动适配数据在一定条件下可作为银标数据使用。 Conclusion: 该资源目录为巴斯克方言NLP研究提供了坚实的数据基础,手动适配数据具有高可靠性,而经人工校验的自动适配数据可作为补充性银标资源。 Abstract: Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

[24] A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan,Shuyu Dai,Jinglu Wang,Fengtao Zhou,Yan Lu,Xi Wang,Yingcong Chen,Can Yang,Shujie Liu,Hao Chen

Main category: cs.CL

TL;DR: 本文提出了CPGBench,一个用于评估大语言模型(LLM)在多轮临床对话中检测与遵循临床实践指南(CPGs)能力的自动化基准框架,发现模型虽能在一定程度上识别指南内容,但在准确引用来源及实际遵循方面表现较差。

Details Motivation: 当前尚不清楚大语言模型在临床对话中识别和遵循临床实践指南(CPGs)的能力如何,亟需系统性评估框架以支撑其安全落地。 Method: 构建CPGBench:收集近十年来自9个国家/地区及2个国际组织的3418份CPG文档,提取32155条结构化临床推荐;为每条推荐生成对应的多轮对话测试样例;评估8个主流LLM在指南检测、来源引用和遵循能力上的表现,并辅以56名临床医生的人工验证。 Result: 检测准确率71.1%-89.6%,但正确引用指南标题仅3.6%-29.7%;遵循率仅21.8%-63.2%;人工评估验证了自动评估结果的有效性。 Conclusion: 现有LLM在临床指南的溯源与应用层面存在显著缺陷,亟需针对性改进以保障其在真实临床场景中的安全性与可靠性。 Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

[25] SafeMath: Inference-time Safety improves Math Accuracy

Sagnik Basu,Subhrajit Mitra,Aman Juneja,Somnath Banerjee,Rima Hazra,Animesh Mukherjee

Main category: cs.CL

TL;DR: 本文揭示了数学应用题可能隐含有害、偏见或心理伤害性内容的问题,构建了ToxicGSM数据集,并提出SafeMath方法在保障安全性的同时维持甚至提升数学推理准确性。

Details Motivation: 现有研究表明大语言模型易受对抗性或看似无害输入的操纵,产生有害、偏见或违规输出;而数学应用题作为自然语言叙述形式,在教育场景(尤其面向儿童)中可能成为传播有害内容的隐蔽渠道,该问题尚未被充分研究。 Method: 构建包含1900个嵌入有害/敏感语境但保持数学逻辑严谨性的算术题数据集ToxicGSM;基于该数据集对主流大语言模型进行安全与数学性能联合审计;提出名为SafeMath的安全对齐技术。 Result: 发现模型在处理有毒数学题时存在安全与准确性权衡问题;SafeMath能显著降低有害输出,同时保持甚至提升数学推理准确率;验证了语言层面的危害性与数学推理能力可解耦。 Conclusion: 数学应用题中的隐性危害需引起重视;安全对齐策略(如SafeMath)可在不牺牲数学性能的前提下增强模型安全性;应推动兼顾语言安全与任务正确性的评估范式。 Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.

[26] Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

Danlu Chen,Ka Sing He,Jiahe Tian,Chenghao Xiao,Zhaofeng Wu,Taylor Berg-Kirkpatrick,Freda Shi

Main category: cs.CL

TL;DR: 本文提出了FRED难度指标(Fertility Ratio, Retrieval Proxy, Pre-training Exposure, Corpus Diversity),用于量化极低资源机器翻译数据集的内在难度,揭示性能差异主要源于训练-测试重叠和预训练暴露程度,而非模型能力,并指出消亡语言和非拉丁语系土著语言因分词覆盖率低而面临根本性迁移挑战。

Details Motivation: 解决极低资源机器翻译中跨语言对性能报告差异大、难以归因于方法优劣还是基准偏差的问题,尤其对古代语言等特定语种研究者缺乏可比性评估依据。 Method: 提出四个数据集本征难度指标:Fertility Ratio(F)、Retrieval Proxy(R)、Pre-training Exposure(E)和Corpus Diversity(D),构成FRED指标体系,用于解构性能变异来源并诊断分词与迁移瓶颈。 Result: 发现结果变异性主要由训练-测试重叠和预训练暴露解释;识别出消亡语言及非拉丁土著语言存在高token fertility问题,即分词覆盖差,制约跨语言迁移效果。 Conclusion: FRED指标为XLR MT社区提供了更透明、可靠的跨语言迁移评估框架,强调需关注数据本征属性而非仅模型改进,推动建立更具可比性和可复现性的低资源MT研究范式。 Abstract: The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages -- particularly extinct and non-Latin indigenous languages -- suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

[27] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

Giuseppe Samo,Paola Merlo

Main category: cs.CL

TL;DR: 本研究比较了自然数据和合成数据对训练和评估大语言模型(LLM)的影响,聚焦法语和意大利语中被动动词交替现象;结果表明,基于自然数据训练的模型在自然与合成测试集上均表现稳健,而仅用合成数据训练的模型难以泛化至自然句子,凸显自然数据在语言学评估中的重要性。

Details Motivation: 探究自然数据与合成数据在训练和评估大语言模型语言能力(特别是句法与语义知识)中的相对作用,尤其关注被动动词交替这一抽象语言模式的建模效果。 Method: 采用Blackbird Language Matrices(BLMs)结构化数据集,对比基于Universal Dependencies提取的自然句子与人工构造的合成句子所生成的结构化模板;在法语和意大利语被动动词交替任务上进行训练与泛化实验。 Result: 模型在合成数据上训练并测试时达到上限性能,但无法可靠泛化到自然句子;而在自然数据上训练的模型在自然和合成测试集上均表现稳健,展现出更强的抽象语言模式捕捉能力。 Conclusion: 自然数据比合成数据更有利于训练具备真实语言泛化能力的LLM;结构化的评估设置(如BLMs)能有效揭示模型的深层语言知识。 Abstract: This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

[28] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Taolin Han,Shuang Wu,Jinghang Wang,Yuhao Zhou,Renquan Lv,Bing Zhao,Wei Hu

Main category: cs.CL

TL;DR: 本文提出MolQuest,一种基于真实化学实验数据的代理式评估框架,用于评估大语言模型在分子结构解析中的动态推理能力。结果显示,当前前沿模型在此类真实科学场景中表现有限,准确率仅为约50%,多数模型低于30%。

Details Motivation: 现有科学评估基准多依赖静态单轮问答形式,无法有效衡量模型在需多步迭代与实验交互的复杂科学任务中的表现。 Method: 构建MolQuest框架,将分子结构解析建模为多轮交互任务,要求模型主动规划实验步骤、整合多种谱图数据(如NMR、MS)并迭代修正结构假设,系统评估其溯因推理与策略决策能力。 Result: 实证表明,最先进模型在MolQuest上的准确率仅约50%,大多数模型低于30%,暴露出当前LLMs在战略科学推理上的显著不足。 Conclusion: MolQuest为面向科学的LLM评估提供了可复现、可扩展的新范式,揭示了LLM在主动参与科学研究方面的能力缺口,为未来研究指明方向。 Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

[29] CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Abhijnan Nath,Hannah VanderHoeven,Nikhil Krishnaswamy

Main category: cs.CL

TL;DR: 本文提出了CRAFT多智能体基准,用于在严格的部分信息条件下评估大语言模型的实用交流能力。多个具有互补但不完整视角的智能体需通过自然语言协作构建一个共享的3D结构,而该结构单个智能体无法完全观测。研究将问题形式化为多发送者实用推理任务,并提供诊断框架以分解空间定位、信念建模和实用交流等错误类型。实验表明,更强的推理能力并不总带来更好的协调表现,小型开源模型常可匹敌甚至超越前沿模型,个体交流能力提升也不保证协作成功,说明多智能体协调仍是当前语言模型的根本性挑战。

Details Motivation: 现有大语言模型在多智能体协作场景下缺乏系统性评估,尤其在严格部分可观测、需实用语言交流的设置中,其协调能力尚不明确。 Method: 提出CRAFT多智能体基准,形式化为多发送者实用推理任务;构建3D结构协作环境;设计诊断框架,分解失败类型(空间接地、信念建模、实用交流)并建立行为失败分类体系;在8个开源模型和7个前沿模型(含推理模型)上进行实证评估。 Result: 发现更强的推理能力不必然提升协作性能;小型开源模型常与前沿模型表现相当甚至更优;个体通信能力提升不能保证团队协作成功;多智能体协调仍是未解难题。 Conclusion: 当前大语言模型在多智能体实用交流与协作方面存在根本性局限,需新的建模思路与评估范式来推动进展。 Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

[30] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

Nicolás Benjamín Ocampo,Tommaso Caselli,Davide Ceolin

Main category: cs.CL

TL;DR: 本文提出WSF-ARG+数据集及LLM-in-the-loop标注框架,联合处理仇恨言论与需核查主张,提升检测效果并降低人工标注负担。

Details Motivation: 仇恨言论常以貌似事实的形式传播,单独处理仇恨言论或虚假信息会加剧偏见、强化刻板印象,并增加内容审核难度,因此需联合建模二者。 Method: 构建首个融合仇恨言论与可核查性标签的数据集WSF-ARG+;设计LLM-in-the-loop框架,利用12种开源大模型辅助标注可核查主张,并经人工评估验证有效性。 Result: HS中含可核查主张的消息具有更高攻击性和仇恨程度;引入可核查性标签后,大模型在HS检测上的macro-F1平均提升0.154,最高达0.213。 Conclusion: 联合建模仇恨言论与可核查性有助于提升检测性能与审核效率,LLM-in-the-loop框架可在不损质量前提下显著减少人工标注工作量。 Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.

[31] Separate Before You Compress: The WWHO Tokenization Architecture

Kusal Darshana

Main category: cs.CL

TL;DR: 本文提出了一种面向Abugida文字(如僧伽罗语、天城文)的新型分词算法SGPE及三层架构WWHO,显著降低token数量,提升LLM对复杂文字的处理效率,并保证音节不被切分。

Details Motivation: 标准BPE分词器在处理结构复杂的Abugida文字(如僧伽罗语、天城文)时,会将多码点的合字拆分为无意义子单元,导致推理效率下降和‘Token税’加重,尤其影响全球南方语言使用者。 Method: 提出三层次架构WWHO(Where-What-How Often)与音节感知的图形单元对编码算法SGPE,将文字语言规则与统计压缩解耦,支持无缝多语言分词,并在3000万句清洗数据上训练,在近150万句测试集上评估。 Result: 在僧伽罗语上TWR达1.274(较o200k减少61.7%);在印地语上TWR为1.181(减少27.0%);混合脚本下整体TWR为1.240,相较o200k、Llama 4 Scout、DeepSeek V3分别减少36.7%、39.6%、60.2%,上下文窗口最多可扩展4.38倍,并实现‘语言零断裂保证’。 Conclusion: SGPE+WWHO有效解决了Abugida文字在LLM中的分词低效问题,兼顾语言学合理性与压缩性能,为全球南方语言的高效大模型应用提供了关键技术路径。 Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.

[32] Beyond Detection: Rethinking Education in the Age of AI-writing

Maria Marina,Alexander Panchenko,Vasily Konovalov

Main category: cs.CL

TL;DR: This paper argues that writing is essential for human deep learning and cognitive development, and warns against outsourcing it to AI tools like ChatGPT; it advocates for pedagogical adaptation and AI-text literacy instead of bans.

Details Motivation: The growing use of generative AI in education and daily life risks reducing writing to a mere formality, undermining its role in thinking and learning. Method: Drawing on cognitive psychology, educational theory, and real classroom practices, the paper analyzes the cognitive value of writing and evaluates current AI-text detection capabilities. Result: The paper concludes that the writing process itself—despite being messy and slow—is where deep human learning occurs, and that recognizing AI-generated text is becoming a vital 21st-century literacy. Conclusion: Writing must be preserved as a cognitive tool; educators should focus on smarter pedagogy and fostering AI-text literacy rather than imposing bans. Abstract: As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality -- outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing -- messy, slow, often frustrating -- is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.

[33] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero

Main category: cs.CL

TL;DR: 本文提出了一种自适应分块框架(Adaptive Chunking),通过五个新颖的内在文档指标(如引用完整性、块内连贯性等)评估并选择最适合每个文档的分块策略,显著提升了RAG系统的性能。

Details Motivation: 现有RAG中的分块方法多为“一刀切”,难以适配多样文本结构;且缺乏独立于下游任务的分块质量评估框架。 Method: 提出Adaptive Chunking框架,定义五个内在分块质量指标(RC、ICC、DCC、BI、SC),并设计两种新分块器(LLM-regex splitter和split-then-merge递归分块器)及后处理技术,实现基于指标的自适应分块选择。 Result: 在法律、技术与社会科学多领域数据集上验证,RAG答案正确率从62–64%提升至72%,成功回答问题数增加超30%(65 vs. 49),且不改变模型或提示词。 Conclusion: 文档感知的自适应分块结合内在指标评估,是提升RAG鲁棒性的实用有效路径。 Abstract: The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.

[34] Large Language Model as Token Compressor and Decompressor

Wenbing Li,Zikai Song,Jielei Zhang,Tianhao Zhao,Junkai Lin,Yiran Wang,Wei Yang

Main category: cs.CL

TL;DR: 本文提出一种利用现成大语言模型(LLM)作为高效令牌压缩/解压缩器的新方法,通过自表达式自编码框架微调LLM,生成内容自适应的离散潜码(Z-tokens),实现高达18倍的token压缩率,同时保持重建精度和下游任务性能。

Details Motivation: 现有长文本处理面临token开销大、上下文长度受限的问题,亟需高效且语义保持的压缩机制。 Method: 设计自表达式自编码学习框架,微调预训练LLM,引入轻量LoRA适配器头,将其转化为能将长文本编码为可变长离散潜码(Z-tokens)并精确重建原文的token压缩-解压缩器。 Result: 在Wikipedia、CNN/DailyMail、HotpotQA和Qulac等长文本数据集上实现最高18倍token压缩,重建保真度高,下游任务性能无损,并支持prompt压缩与Z-token空间内的自回归生成。 Conclusion: 现成LLM可被有效重用为内容自适应的token压缩器,该方法为高效长上下文推理提供了新路径。 Abstract: In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

[35] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

Xu Huang,Zhejian Lai,Zixian Huang,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出了一种名为TAPO的强化学习框架,通过以英语为中介、采用‘先理解后推理’范式,并引入步级相对优势机制,有效提升大语言模型在多语言数学推理任务中的表现。

Details Motivation: 大型语言模型在英语数学推理上表现出色,但在多语言场景下性能差距显著,主要源于语言理解能力不足。 Method: 提出Translation-Augmented Policy Optimization (TAPO),基于GRPO框架,以英语为pivot,采用‘理解-推理’两阶段范式,并设计步级相对优势机制解耦理解与推理,融合翻译质量奖励。 Result: TAPO在多语言数学推理和翻译任务上均超越基线方法,且对未见语言和跨领域任务具有良好泛化能力。 Conclusion: TAPO能有效协同语言理解与推理能力,具备模型无关性和强泛化性,为多语言推理提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

[36] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

Erkan Gunes,Christoffer Florczak,Tevfik Murat Yildirim

Main category: cs.CL

TL;DR: 本文探讨了如何通过优化提示工程(prompt engineering)来提升大语言模型(LLMs)在社会科学文本分类任务中的性能,发现适度增加提示上下文可显著提升准确率,但过度增加反而可能降低性能,且效果因模型、任务和批量大小而异,强调需针对具体任务进行单独验证。

Details Motivation: 当前LLM在社会科学研究中的文本分类应用虽具成本效益潜力,但性能波动大,亟需探索如何系统性提升其准确性。 Method: 系统性地调整提示工程的三个维度:标签描述、指令引导(instructional nudges)和少样本示例(few-shot examples),并在两个不同案例中进行实证测试。 Result: 轻微增加提示上下文带来最大性能提升;进一步增加仅带来边际增益,甚至可能降低准确率;模型、任务与批处理规模间存在显著异质性。 Conclusion: LLM文本编码任务不能依赖通用规则,必须针对每个具体任务进行独立验证与调优。 Abstract: Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

[37] Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Jannis Vamvas,Ignacio Pérez Prat,Angela Heldstab,Dominic P. Fischer,Sina Ahmadi,Rico Sennrich

Main category: cs.CL

TL;DR: 本文提出了一种针对低资源语言罗马什语(Romansh)机器翻译的新数据增强策略,强调沿资源梯度方向进行数据增强,显著提升了最低资源方言的翻译质量(+23 BLEU),并首次实现了各罗马什语方言的流利翻译。

Details Motivation: 现有基于大语言模型(LLM)生成合成数据的低资源机器翻译方法在罗马什语上失效,因其6种方言易被LLM混淆。 Method: 提出依据源语言与目标语言间资源梯度方向进行数据增强,而非简单依赖LLM跨语言生成。 Result: 在罗马什语最低资源方言上超越Gemini 3 Pro达23 BLEU;人工评估证实其为首个能生成各罗马什语方言流利译文的模型。 Conclusion: 数据增强的方向应与语言资源梯度对齐,该原则对多变体低资源语言翻译至关重要。 Abstract: Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Pietro Dell'Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro

Main category: cs.CL

TL;DR: 本文对12种代表性虚假新闻检测方法在10个英文文本数据集上进行了系统评估,涵盖传统机器学习、深度学习、Transformer及跨域架构,在统一二分类框架下开展域内、多域和跨域实验,发现微调模型泛化能力差,跨域模型需大量数据,而大语言模型在零/少样本场景中展现潜力。

Details Motivation: 虚假新闻生成与传播因大语言模型和社会媒体而愈发复杂,现有检测方法在真实场景(如领域偏移、分布外数据)下的鲁棒性和泛化能力尚不明确,亟需系统性、标准化评估。 Method: 选取12种代表性方法(涵盖传统ML、DL、Transformer、跨域架构),在10个异构英文文本数据集上统一为二分类任务(Real/Fake),开展域内、多域联合训练、跨域迁移三类实验;采用标签归一化处理并明确其语义损失。 Result: 微调模型在域内表现优异但泛化能力差;跨域架构可缓解性能下降但依赖大量标注数据;LLMs在零样本和少样本设置下展现出更强的鲁棒性与适应性;所有结果受限于英文纯文本设定及潜在的数据污染与预训练暴露。 Conclusion: 当前虚假新闻检测方法在跨域泛化上仍面临严峻挑战;LLM驱动的零/少样本方法是更有前景的方向;未来研究需关注更贴近现实的评估协议、多模态信息融合及语义保留的标签建模。 Abstract: In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into "Real" and "Fake" to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

[39] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh,Hyewon Jang,Shalom Lappin,Asad Sayeed,Sharid Loáiciga

Main category: cs.CL

TL;DR: 本文研究视觉语言模型(VLMs)在视觉驱动故事生成中与人类叙事连贯性的差异,通过多维度指标评估发现:尽管VLMs表面流畅,其叙事结构(如指代、话语关系、主题连续性、角色一致性及多模态角色锚定)仍系统性偏离人类水平。

Details Motivation: 探究视觉语言模型生成的叙事在视觉驱动场景下是否真正具备人类水平的叙事连贯性,而不仅限于表面语言流畅。 Method: 在Visual Writing Prompts数据集上,采用涵盖共指消解、话语关系类型、主题连续性、角色持久性及多模态角色锚定等维度的指标,构建综合叙事连贯性评分体系,并对比人类撰写与VLM生成叙事的表现。 Result: VLMs整体连贯性轮廓与人类相似但存在系统性差异;单个指标差异细微,联合分析时更显著;模型虽具表面流畅性,但在跨句/跨图叙事组织上明显弱于人类。 Conclusion: 当前VLMs在视觉驱动叙事中尚未真正掌握人类式的深层连贯机制,需从话语结构与多模态对齐角度进一步建模。 Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

[40] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Minseo Kim,Sujeong Im,Junseong Choi,Junhee Lee,Chaeeun Shim,Edward Choi

Main category: cs.CL

TL;DR: 本文提出PICon框架,通过逻辑链式多轮提问评估基于大语言模型的 persona 代理在内部一致性、外部一致性和重测一致性三个维度的表现,发现现有系统仍显著落后于人类基线。

Details Motivation: 缺乏系统性方法来验证 persona 代理在交互过程中是否始终无矛盾、无事实错误;受审讯学中‘复杂虚构身份终将暴露矛盾’原理启发。 Method: 提出PICon评估框架,采用逻辑关联的多轮提问策略,从内部一致性(自洽性)、外部一致性(与真实世界事实对齐)、重测一致性(重复提问下的响应稳定性)三方面量化评估。 Result: 在七组 persona 代理与63名真实人类参与者对比实验中,即便是此前报告高一致性的系统,在所有三个维度上均未达到人类基线,暴露出矛盾与回避性回答。 Conclusion: PICon为 persona 代理提供了概念基础与实用评估方法,强调在将其作为人类替代者前必须进行严格一致性检验。 Abstract: Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

[41] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

Mingmeng Geng,Yuhang Dong,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文通过分析arXiv论文,揭示了大语言模型(LLMs)对学术写作语言使用的隐性影响,如标题中‘beyond’和‘via’使用增多、摘要中‘the’和‘of’减少;指出当前分类器难以准确识别具体生成模型,并提出一种可解释的线性方法来量化LLM使用的异质性与动态性。

Details Motivation: 现有研究尚未充分关注大语言模型对学术文本词频分布的潜在影响,且缺乏对真实世界中LLM使用多样性和动态变化的定量刻画。 Method: 采用直接、高可解释性的线性方法,结合对不同LLM及提示词差异的建模,分析arXiv论文中词频变化趋势,并评估多类模型文本分类性能。 Result: 发现LLM驱动的显著词频变化(如标题中‘beyond’/‘via’上升,摘要中‘the’/‘of’下降);当前分类器在多模型识别任务中表现不佳;实证表明真实LLM使用具有异质性与动态性。 Conclusion: LLMs已在潜移默化中改变学术写作风格,其影响具有模型共性与个体差异并存的特点,需用可解释方法持续监测与评估。 Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

[42] Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Cole Walsh,Rodica Ivan

Main category: cs.CL

TL;DR: 本研究探讨了构建无关因素对基于大语言模型(LLM)的双架构自动评分系统的影响,发现该系统对无意义文本填充、拼写错误和写作复杂度具有鲁棒性,但对大段重复文本和离题内容敏感,整体表现出良好的构造相关性。

Details Motivation: 随着大语言模型在自动评分中的广泛应用,其面对构建无关因素(如拼写错误、离题、文本重复等)的鲁棒性及‘幻觉’问题亟需深入研究。 Method: 采用双架构LLM评分系统,评估其对短文式情境判断测试作答中多种构建无关干扰(如无意义填充、拼写错误、重复文本、离题等)的响应。 Result: 系统对文本填充、拼写错误和写作复杂度鲁棒;重复大段文本反而导致得分下降;离题回答被显著扣分。 Conclusion: 若在设计中注重构建相关性,未来LLM-based评分系统可具备良好鲁棒性。 Abstract: Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.

[43] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Haoyan Yang,Mario Xerri,Solha Park,Huajian Zhang,Yiyang Feng,Sai Akhil Kogilathota,Jiawei Zhou

Main category: cs.CL

TL;DR: 本文提出了一种面向自改进大语言模型的系统级统一框架,将自改进过程建模为包含数据获取、数据选择、模型优化和推理精炼四阶段的闭环生命周期,并引入自主评估层进行全程监控与引导。

Details Motivation: 人类监督成本高、可扩展性差,且当模型能力接近人类水平时,人类反馈信号变得不够充分;同时模型自主决策与执行能力增强,为自动化模型开发流程提供了基础。 Method: 提出一个闭环生命周期框架,包含数据获取、数据选择、模型优化、推理精炼四个耦合环节,以及一个贯穿全程的自主评估层;模型自身驱动各环节,评估层提供反馈指导。 Result: 系统梳理并技术分析了各环节代表性方法,明确了当前局限,并展望了通向完全自改进LLM的研究路径。 Conclusion: 自改进是突破人类监督瓶颈、实现LLM持续进化的关键范式;系统化框架有助于统一理解、比较与推进相关研究。 Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

[44] S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han,Hao Wang,Han Gao,Kai Xu,Akash Srivastava

Main category: cs.CL

TL;DR: 本文提出S2D2,一种无需额外训练的自推测解码框架,用于块扩散语言模型,通过将同一预训练模型在小块尺寸下作为草稿器和验证器,结合轻量级路由策略,在保持或提升生成质量的同时显著加速推理。

Details Motivation: 现有块扩散语言模型在少量去噪步数下使用置信度阈值解码时存在质量与速度权衡困难:激进阈值损害质量,保守阈值增加冗余计算;而改进方法往往需额外训练或增加测试时开销。 Method: S2D2利用块扩散模型在块大小为1时退化为自回归模型的特性,使同一预训练模型兼具草稿器(并行扩散)与验证器(局部序列级自回归判别)功能;在标准块扩散解码中插入推测性验证步骤,并采用轻量级路由策略动态决定是否执行验证。 Result: 在SDAR、LLaDA2.1-Mini等三大主流块扩散模型上,S2D2一致优于强置信度阈值基线:在SDAR上相较自回归解码最高提速4.7倍,相较动态解码基线提速1.57倍且准确率提升最多4.5分;在LLaDA2.1-Mini上与内置自校正互补,在保守设置下比静态基线快4.4倍且精度略高。 Conclusion: S2D2是一种训练免费、即插即用的解码优化方法,有效缓解块扩散模型在实用少步场景下的鲁棒性与效率矛盾,为更快-than-autoregressive生成提供了新范式。 Abstract: Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

[45] Natural-Language Agent Harnesses

Linyue Pan,Lexiao Zou,Shuo Guo,Jingchen Ni,Hai-Tao Zheng

Main category: cs.CL

TL;DR: 本文提出自然语言代理框架(NLAHs)和智能框架运行时(IHR),将代理控制逻辑从硬编码中解耦,以自然语言形式外部化并统一执行,提升可移植性、可比性和可研究性。

Details Motivation: 当前代理性能高度依赖于框架工程(harness engineering),但框架设计通常嵌入控制器代码和特定运行时约定中,难以迁移、比较和作为科学对象研究。 Method: 提出自然语言代理框架(NLAHs)——用可编辑的自然语言描述框架行为;以及智能框架运行时(IHR)——通过显式契约、持久化构件和轻量适配器统一执行NLAHs。 Result: 在编程与计算机使用基准上,完成了操作可行性、模块消融及代码到文本框架迁移的受控评估,验证了方法的有效性与灵活性。 Conclusion: 将高阶控制逻辑外部化为自然语言可执行构件是可行且有益的,NLAHs与IHR为代理工程提供了更科学、可复现和可互操作的新范式。 Abstract: Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

cs.CV [Back]

[46] MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Weixiang Shen,Yanzhu Hu,Che Liu,Junde Wu,Jiayuan Zhu,Chengzhi Shen,Min Xu,Yueming Jin,Benedikt Wiestler,Daniel Rueckert,Jiazhen Pan

Main category: cs.CV

TL;DR: 本文提出MEDOPENCLAW运行时系统和MEDFLOWBENCH基准,旨在评估视觉语言模型在真实临床3D多模态医学影像中的主动导航与决策能力,发现当前VLMs在获得专业工具支持时因空间定位不准而性能下降。

Details Motivation: 现有医学影像VLM评估局限于人工筛选的2D图像,脱离真实临床中需主动导航3D多序列/多模态数据以支持诊断决策的实际需求。 Method: 构建MEDOPENCLAW——一个可审计的运行时系统,使VLM能嵌入标准医学工具(如3D Slicer)中动态交互;在此基础上建立MEDFLOWBENCH基准,覆盖脑MRI和肺CT/PET全研究级任务,分viewer-only、tool-use和open-method三轨道评估。 Result: 实验表明SOTA模型(如Gemini 3.1 Pro、GPT-5.4)能在viewer-only模式下完成基础任务,但接入专业工具后性能反降,主因缺乏精确空间定位能力。 Conclusion: MEDOPENCLAW与MEDFLOWBENCH首次将VLM评估从静态图像感知拓展至交互式全研究临床流程,为构建可审计、可复现的医学影像智能体奠定基础。 Abstract: Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.

[47] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Francesco Gentile,Nicola Dall'Asen,Francesco Tonini,Massimiliano Mancini,Lorenzo Vaquero,Elisa Ricci

Main category: cs.CV

TL;DR: 本文提出SITH框架,一种无需数据、无需训练的CLIP视觉Transformer权重空间解释方法,通过奇异向量分解与COMP算法实现细粒度、语义一致的注意力头内解释,并支持可解释的模型编辑与适应机制分析。

Details Motivation: 现有视觉语言模型可解释性方法依赖激活值,存在数据依赖、易受数据偏差影响、解释粒度粗(仅到头级别)等问题,亟需更鲁棒、细粒度、数据无关的解释方案。 Method: 提出SITH框架:在CLIP视觉Transformer权重空间中,对每个注意力头的value-output矩阵进行奇异值分解;对每个奇异向量,用新提出的COMP(Coherent Orthogonal Matching Pursuit)算法将其解释为稀疏、语义连贯的人类可理解概念组合。 Result: SITH生成了连贯且忠实的头内解释(经重建保真度与可解释性实验验证);支持精准、可解释的权重空间编辑(增强/抑制概念),提升下游任务性能且无需微调;揭示微调主要重加权稳定语义基而非学习新特征。 Conclusion: SITH是一种新颖的数据/训练无关、权重空间驱动的细粒度可解释性方法,不仅提升了模型透明度,还为可控模型编辑与适应机制理解提供了新范式。 Abstract: As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

[48] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

An Yu,Ting Yu Tsai,Zhenfei Zhang,Weiheng Lu,Felix X. -F. Ye,Ming-Ching Chang

Main category: cs.CV

TL;DR: ReDiPrune是一种无需训练的视觉token剪枝方法,在视觉-语言投影器前选择信息丰富的视觉token,兼顾文本相关性与多样性,显著提升多模态大模型的准确率与计算效率。

Details Motivation: 现有多模态大语言模型因需处理大量视觉token而计算开销大,亟需高效、无需训练的视觉token压缩方法。 Method: ReDiPrune在视觉编码器输出后、投影器前,基于轻量规则对每个token打分,综合文本条件相关性和max-min多样性进行剪枝,不修改模型结构或重训练。 Result: 在4个视频和5个图像基准上均提升精度-效率权衡;例如在EgoSchema上使用LLaVA-NeXT-Video-7B时,仅保留15%视觉token即获+2.0%绝对精度提升,并降低6倍以上TFLOPs。 Conclusion: ReDiPrune是一种即插即用、训练无关的高效视觉token剪枝方法,能显著提升多模态大模型推理效率而不损甚至提升性能。 Abstract: Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

[49] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

Quanyun Wu,Kyle Gao,Daniel Long,David A. Clausi,Jonathan Li,Yuhao Chen

Main category: cs.CV

TL;DR: 本文提出了一种尺度感知的3D融合框架,利用视觉语言模型(VLM)引导的几何锚点机制,解决Transformer预测点云与局部重建网格之间的尺度和坐标不一致问题,构建度量一致、语义 grounded 的数字孪生环境。

Details Motivation: 现有基于Transformer的单目视频重建方法虽能高效生成全局点云,但存在固有尺度模糊性和坐标约定不一致问题,导致无法可靠地与局部重建物体网格融合,阻碍了具身AI训练与评估所需的对象中心、度量准确、语义扎实的数字孪生环境构建。 Method: 提出尺度感知3D融合框架:1)VLM引导的几何锚点机制恢复真实世界度量尺度;2)几何感知配准流程,显式引入重力对齐垂直估计、曼哈顿世界结构约束和无碰撞局部优化以保障物理合理性。 Result: 在真实室内厨房环境中实验验证,提升了跨网络物体对齐精度与几何一致性,有利于多原语拟合与度量测量等下游任务;并开源了一个带度量尺度、语义标注且物体网格已注册的室内数字孪生数据集。 Conclusion: 该方法有效弥合了全局语义重建与局部几何重建之间的尺度与坐标鸿沟,为构建高保真、可度量、语义丰富的数字孪生环境提供了新范式。 Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

[50] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Yicheng Xu,Jiangning Zhang,Zhucun Xue,Teng Hu,Ran Yi,Xiaobin Hu,Yong Liu,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了一种面向能力的六级分类法来分析多模态统一模型中的上下文学习(In-context Learning, ICL)行为,并构建了大规模数据集UniICL-760K与评测基准UniICL-Bench;同时设计了轻量、即插即用的上下文自适应原型调制器(Context-Adaptive Prototype Modulator),显著提升了ICL在多模态理解任务上的稳定性和性能。

Details Motivation: 上下文学习在多模态统一模型中因跨模态干扰和认知负荷差异而表现不稳定、非单调且高度任务依赖,亟需系统性诊断与稳定化方法。 Method: 提出六级能力导向的演示功能分类体系;构建包含15个子任务、8样本ICL片段的大规模语料UniICL-760K及评测基准UniICL-Bench;设计轻量级Context-Adaptive Prototype Modulator模块以增强few-shot适应稳定性。 Result: 在UniICL-Bench上,所提方法在多数理解类ICL任务上超越参数量更大的多模态大语言模型基线,展现出强竞争力与泛化性。 Conclusion: 基于认知能力的结构化分析与针对性架构改进可有效缓解多模态ICL的敏感性问题,为统一模型的训练无关适配提供新范式。 Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.

[51] BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation

Bentao Song,Jun Huang,Qingfeng Wang

Main category: cs.CV

TL;DR: 本文提出BCMDA框架,通过虚拟域桥接和原型对齐校正,解决混合域半监督医学图像分割中域偏移与标注不足下的知识迁移困难和确认偏差问题。

Details Motivation: 在混合域半监督医学图像分割(MiDSS)中,存在标注与未标注数据分布差异大、未标注数据学习效率低导致严重确认偏差两大挑战。 Method: 提出双向相关图域自适应(BCMDA)框架:1)通过虚拟域桥接知识迁移(KTVDB),利用双向相关图合成图像并结合固定比例/渐进动态MixUp生成虚拟域,再用双CutMix实现跨域知识迁移;2)采用原型对齐与伪标签校正(PAPLC),基于可学习原型余弦相似度分类器实现虚实域双向原型对齐,并校正伪标签以缓解确认偏差。 Result: 在三个公开多域数据集上验证了方法优越性,尤其在极少量标注样本下仍表现优异。 Conclusion: BCMDA有效缓解了域偏移与标注稀缺场景下的知识迁移障碍和确认偏差,提升了混合域半监督医学图像分割性能。 Abstract: In mixed domain semi-supervised medical image segmentation (MiDSS), achieving superior performance under domain shift and limited annotations is challenging. This scenario presents two primary issues: (1) distributional differences between labeled and unlabeled data hinder effective knowledge transfer, and (2) inefficient learning from unlabeled data causes severe confirmation bias. In this paper, we propose the bidirectional correlation maps domain adaptation (BCMDA) framework to overcome these issues. On the one hand, we employ knowledge transfer via virtual domain bridging (KTVDB) to facilitate cross-domain learning. First, to construct a distribution-aligned virtual domain, we leverage bidirectional correlation maps between labeled and unlabeled data to synthesize both labeled and unlabeled images, which are then mixed with the original images to generate virtual images using two strategies, a fixed ratio and a progressive dynamic MixUp. Next, dual bidirectional CutMix is used to enable initial knowledge transfer within the fixed virtual domain and gradual knowledge transfer from the dynamically transitioning labeled domain to the real unlabeled domains. On the other hand, to alleviate confirmation bias, we adopt prototypical alignment and pseudo label correction (PAPLC), which utilizes learnable prototype cosine similarity classifiers for bidirectional prototype alignment between the virtual and real domains, yielding smoother and more compact feature representations. Finally, we use prototypical pseudo label correction to generate more reliable pseudo labels. Empirical evaluations on three public multi-domain datasets demonstrate the superiority of our method, particularly showing excellent performance even with very limited labeled samples. Code available at https://github.com/pascalcpp/BCMDA.

[52] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gokce Inal,Pouyan Navard,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了专用于月球表面和地下表征的视觉-语言模型LLaVA-LE,构建了大规模月球多模态数据集LUCID,并通过两阶段微调策略显著提升了模型在月球地形分析任务上的性能。

Details Motivation: 现有视觉-语言模型(VLMs)在行星科学领域应用受限,主要原因是缺乏配对真实行星图像与详细科学描述的大规模数据集。 Method: 构建了包含96k张高分辨率全色图像及对应描述、81k个问答对的月球多模态数据集LUCID;在此基础上,采用两阶段训练策略对LLaVA进行微调:第一阶段为领域特定地形描述的概念对齐,第二阶段为指令调优的视觉问答。 Result: LLaVA-LE在多项月球地形分析推理基准上显著优于基线模型:相比基础LLaVA提升3.3倍,相比第一阶段模型提升2.1倍;其推理得分达1.070,甚至超过人工评委的参考分。 Conclusion: 领域专用多模态数据与指令微调可有效推动视觉-语言模型在行星探测中的应用。 Abstract: Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

[53] Lookalike3D: Seeing Double in 3D

Chandan Yeshwanth,Angela Dai

Main category: cs.CV

TL;DR: 本文提出了室内场景中“相似物体检测”任务,利用多视角图像和大视觉基础模型的语义先验,设计Lookalike3D多视角图像Transformer模型,并构建3DTwins数据集,显著提升IoU及下游3D重建与部件共分割性能。

Details Motivation: 现有3D理解与生成方法常忽略真实场景中普遍存在的重复物体这一重要信息源,缺乏对相同或近似物体对的建模能力。 Method: 提出Lookalike3D多视角图像Transformer模型,融合大图像基础模型的强语义先验,用于判别物体对是否为相同、相似或不同;构建含76k标注对的3DTwins数据集(基于ScanNet++)。 Result: 在相似物体检测任务上IoU较基线提升104%;同时提升下游联合3D物体重建与部件共分割性能。 Conclusion: 重复与相似物体可作为一致、高质量3D感知的有力线索,所提方法与数据集为该方向提供了新范式与实用工具。 Abstract: 3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

[54] Accurate Point Measurement in 3DGS -- A New Alternative to Traditional Stereoscopic-View Based Measurements

Deyan Deng,Rongjun Qin

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的新型三维点测量方法,利用其高质量、完整的新视角合成能力进行多视图三角测量,显著提升几何测量精度,尤其在薄结构和尖锐边缘等难重建区域表现优异,且无需专业立体工作站或操作员立体视觉能力。

Details Motivation: 3D高斯泼溅(3DGS)虽在新视角合成方面表现出色,但其在精确几何测量中的潜力尚未被充分挖掘;现有测量方法依赖不完整/不准的网格或昂贵的立体工作站,限制了实用性与精度。 Method: 利用3DGS渲染的高质量、完整多视角图像,让用户直观地在不同视图中选取对应点(congruent points),再通过三角测量生成精确三维点坐标;支持双视图及多视图交集以提升精度,并实现为轻量级Web应用。 Result: 在多个无人机航拍数据集上验证:对明确定义点RMSE达1–2 cm;对薄结构(mesh RMSE=0.062 m),本方法RMSE=0.037 m;对网格严重缺失的尖角,本方法RMSE=0.013 m且全部成功测量,而网格法完全失败。 Conclusion: 3DGS不仅适用于渲染,更可作为高精度几何测量的新范式,兼具易用性、高精度与强鲁棒性,尤其适用于传统MVS/网格方法失效的挑战性几何结构。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: https://github.com/GDAOSU/3dgs_measurement_tool.

[55] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models

Shengli Zhou,Minghang Zheng,Feng Zheng,Yang Liu

Main category: cs.CV

TL;DR: 本文提出QuatRoPE,一种新型位置编码方法,用于提升大语言模型在3D场景空间推理任务中的性能;它以线性复杂度显式建模物体间空间关系,并通过IGRE机制隔离影响,保持LLM原有能力。

Details Motivation: 现有方法或难以提取空间关系(绝对位置编码),或扩展性差(显式编码所有两两关系呈二次复杂度),且3D场景-语言配对数据稀缺,制约空间推理模型训练。 Method: 提出QuatRoPE:将3D坐标整体向量化编码为四元数形式的位置嵌入,使注意力层中点积可显式计算成对空间关系;其输入长度与物体数呈线性关系。同时引入IGRE机制,仅将QuatRoPE作用于物体相关token,避免干扰LLM原有位置编码。 Result: 在多个3D空间推理基准上显著优于现有方法,验证了QuatRoPE在保持几何一致性、提升推理能力及良好扩展性方面的有效性;代码与数据已开源。 Conclusion: QuatRoPE是一种高效、几何一致且兼容性强的位置编码方案,为LLM赋能3D空间推理提供了新范式,兼顾建模能力与模型原有功能完整性。 Abstract: Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.

[56] Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

Daniele Agostinelli,Thomas Agostinelli,Andrea Generosi,Maura Mengoni

Main category: cs.CV

TL;DR: 本文系统评估了基于面部关键点的注视估计方法,提出了标准化的关键点提取与归一化流程,并在多个大规模数据集上训练了轻量级回归模型(XGBoost、全连接MLP和双目结构MLP),发现其在跨域场景下可媲美ResNet18,表明稀疏几何特征足以支撑鲁棒、高效、可解释且隐私友好的边缘应用。

Details Motivation: 现有基于CNN的外观方法虽准确但计算开销大、缺乏可解释性;而基于关键点的几何方法虽轻量,但其性能极限与泛化能力在现代基准中尚未充分探索。 Method: 构建标准化关键点提取与归一化流程,覆盖Gaze360、ETH-XGaze和GazeGene三个大数据集;训练三种轻量模型:XGBoost、整体式MLP、双目结构Siamese MLP。 Result: 关键点模型在域内测试中性能较低(受关键点检测噪声影响),但在跨域测试中,MLP架构泛化能力与ResNet18基线相当。 Conclusion: 稀疏几何特征足以支持鲁棒的注视估计,为高效、可解释、隐私保护的边缘应用提供了可行路径。 Abstract: Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

[57] Confidence-Based Mesh Extraction from 3D Gaussians

Lukas Radl,Felix Windisch,Andreas Kurz,Thomas Köhler,Michael Steiner,Markus Steinberger

Main category: cs.CV

TL;DR: 本文提出了一种面向3D高斯泼溅(3DGS)的自监督置信度框架,通过可学习的置信值动态平衡光度与几何监督,并引入颜色与法向一致性损失及改进的D-SSIM外观模型,显著提升无界场景下网格提取的质量与效率。

Details Motivation: 现有3DGS在存在丰富视角依赖效应的场景中难以准确提取表面,而多视角、迭代提取或大预训练模型等方案牺牲了3DGS固有的高效性。 Method: 提出自监督置信度框架,使每个高斯原语具有可学习的置信值以加权光度与几何损失;引入原语级颜色与法向方差惩罚损失;解耦D-SSIM损失项以改进外观建模。 Result: 在无界网格重建任务上达到SOTA性能,同时保持高计算效率。 Conclusion: 该方法在不牺牲3DGS效率的前提下,有效缓解视角依赖效应导致的表面歧义,为高质量、高效率的隐式表面提取提供了新范式。 Abstract: Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.

[58] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception

Yuqi Hu,Vasha DuTell,Ahna R. Girshick,Jennifer E. Corbett

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP嵌入空间插值的框架,生成语义模糊图像谱,用于比较人类与机器分类器在概念边界判断上的差异;发现模型更倾向‘兔子’判断,而人类判断更贴近CLIP合成嵌入,且人类对引导尺度更敏感。

Details Motivation: 探究人类与机器视觉模型在语义模糊图像中概念边界判定的异同,揭示模型表征与人类感知的对齐程度。 Method: 在CLIP嵌入空间中对语义概念(如duck/rabbit)进行插值,生成连续模糊图像谱;结合心理物理学实验与机器分类器测试,定量测量人与模型的语义边界位置及敏感性。 Result: 机器分类器表现出更强的‘兔子’偏好;人类判断更贴近CLIP合成所用嵌入;引导尺度(guidance scale)对人类敏感性影响显著大于对模型的影响。 Conclusion: 可控的语义模糊性可作为诊断工具,弥合人类心理物理分析、图像分类与生成模型之间的鸿沟,有助于理解人-模对齐、鲁棒性、可解释性与图像合成。 Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.

[59] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video

Selim Gilon,Emily Y. Miller,Scott D. Uhlrich

Main category: cs.CV

TL;DR: 本文提出OpenCap Monocular算法,仅用单个智能手机视频即可高精度估计人体三维运动学与动力学参数,显著降低传统实验室评估的成本与门槛,已在多种日常动作中验证其临床实用性。

Details Motivation: 传统运动学和动力学分析依赖昂贵、耗时的实验室设备(如标记点动捕系统和测力台),难以在临床和大规模场景中推广应用,亟需可扩展、低成本、高精度的生物力学评估工具。 Method: 基于单目姿态估计模型WHAM输出的3D姿态,通过优化进行 refine;再映射到生物力学约束的骨骼模型以计算运动学;最后结合物理仿真与机器学习估计动力学(如关节力矩、地面反作用力)。 Result: 在行走、深蹲、坐站转换任务中,运动学误差为4.8°(旋转自由度)和3.4 cm(骨盆平移),较纯回归基线提升48%(旋转)和69%(平移);动力学估计(如膝关节伸展力矩、内收力矩)达到临床可接受精度;地面反作用力估计精度媲美甚至优于先前双相机OpenCap系统。 Conclusion: OpenCap Monocular实现了单智能手机端的高精度、可扩展、免费开放的生物力学评估,为 frailty 和膝骨关节炎等疾病的预测、治疗与监测提供了实用新工具。 Abstract: Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.

[60] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

David G. Shatwell,Sirnam Swetha,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出TIGeR模型,通过多模态Transformer将图像、地理位置和时间映射到统一的地理-时间嵌入空间,支持多种查询方式并提升地理定位、拍摄时间预测及地理-时间感知图像检索性能。

Details Motivation: 现实应用(如数字取证、城市监控、环境分析)需联合推理图像外观、地理位置与时间,现有方法难以满足复杂需求(如按指定时间检索同一地点图像)。 Method: 提出TIGeR模型:基于多模态Transformer,构建图像、地理位置和时间的统一地理-时间嵌入空间;支持单模态/多模态输入;统一处理地理定位、时间预测与地理-时间感知检索任务。 Result: 在时间年份预测、时间日预测和地理-时间感知检索召回率上分别比SOTA方法提升16%、8%和14%。 Conclusion: 统一建模地理与时间信息可显著提升跨模态检索与预测能力,TIGeR为地理-时间联合推理提供了有效框架。 Abstract: Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

[61] Synthetic Cardiac MRI Image Generation using Deep Generative Models

Ishan Kumarasinghe,Dasuni Kawya,Madhura Edirisooriya,Isuri Devindi,Isuru Nawinne,Vajira Thambawita

Main category: cs.CV

TL;DR: 本文综述了合成心脏MRI(CMRI)生成的最新方法,涵盖GAN、VAE、扩散模型和流匹配等技术,重点评估其在图像保真度、下游任务效用(如分割性能)及隐私保护(如差分隐私、成员推断攻击)三方面的表现,并指出当前缺乏统一、评估驱动的集成框架。

Details Motivation: 解决标注心脏MRI数据稀缺、设备厂商差异大、模型记忆导致隐私泄露等临床应用瓶颈。 Method: 系统综述现有CMRI生成方法,包括掩码条件生成、扩散模型、流匹配、厂商风格条件化、强度归一化等技术,并引入成员推断攻击、最近邻分析和差分隐私等隐私评估手段。 Result: 解剖结构约束的合成数据可提升多厂商环境下下游分割任务的准确性和鲁棒性;扩散与流匹配模型在边界保持和确定性变换方面更具优势;隐私保护机制正逐步融入生成流程。 Conclusion: 需构建兼顾保真度、效用与隐私的集成评估框架,以支撑可靠、可落地的临床工作流。 Abstract: Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.

[62] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects

Narek Tumanyan,Samuel Rota Bulò,Denis Rozumny,Lorenzo Porzi,Adam Harley,Tali Dekel,Peter Kontschieder,Jonathon Luiten

Main category: cs.CV

TL;DR: 本文提出DRoPS方法,利用动态物体的静态预扫描作为几何与外观先验,通过网格化高斯原语和CNN驱动的运动参数化,显著提升极端新视角下的动态场景重建质量与3D跟踪精度。

Details Motivation: 现有方法在极端新视角和高度关节化运动下难以重建动态场景,且未能充分利用静态预扫描信息。 Method: 提出DRoPS:1)将高斯原语组织为锚定在物体表面的像素网格,构建网格化、表面对齐模型;2)利用该网格结构,用CNN条件化参数化运动,注入强隐式正则化并关联邻近点运动。 Result: 在渲染质量和3D跟踪精度上显著超越当前最先进方法。 Conclusion: DRoPS通过显式利用静态预扫描先验与结构化运动建模,有效约束解空间并保证序列几何一致性,解决了动态场景重建在极端视角下的关键挑战。 Abstract: Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.

[63] AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef,Tavi Halperin,Naomi Ken Korem,Mohammad Salama,Harel Cain,Asaf Joseph,Anthony Chen,Urska Jelercic,Ofir Bibi

Main category: cs.CV

TL;DR: 本文提出AVControl框架,基于LTX-2联合音视频基础模型,采用独立LoRA适配器与并行画布机制,支持多种控制模态(如深度、姿态、相机轨迹、音频等)的轻量、可扩展、高效训练,无需修改主干架构,在多项基准上达到SOTA或竞争性性能。

Details Motivation: 现有视频音频生成控制方法要么依赖固定模态的单体模型,要么为每种新模态引入昂贵的架构改动,缺乏灵活性与可扩展性。 Method: 在LTX-2联合音视频基础模型上,为每种控制模态(如深度、姿态、音频等)单独训练LoRA适配器;引入‘并行画布’机制,将控制信号作为额外token注入注意力层,不改变原模型结构;避免简单扩展图像级上下文方法到视频所导致的结构控制失效问题。 Result: 在VACE基准上,深度/姿态引导生成、修复/外绘任务均超越所有对比基线;在相机控制与音视频联合任务中表现具竞争力;支持7类独立训练模态(含首个模块化音视频联合控制);各模态仅需小数据集与数百至数千步训练即收敛。 Conclusion: AVControl是一种计算与数据高效、架构无关、高度可扩展的多模态可控音视频生成框架,为未来灵活集成新控制信号提供了通用范式。 Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

[64] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

Danil Tokhchukov,Aysel Mirzoeva,Andrey Kuznetsov,Konstantin Sobolev

Main category: cs.CV

TL;DR: 本文提出Calibri方法,通过引入单个可学习缩放参数并利用进化算法优化DiT组件,在仅调整约100个参数的情况下显著提升文本到图像生成质量,并减少推理步数。

Details Motivation: 挖掘扩散Transformer(DiT)在生成任务中的潜在能力,提升其性能与效率。 Method: 提出Calibri方法,将DiT校准建模为黑箱奖励优化问题,使用进化算法求解,仅修改约100个参数。 Result: Calibri在多个文本到图像模型上一致提升生成质量,并减少推理所需步数,同时保持高输出质量。 Conclusion: 单个可学习缩放参数和轻量级校准策略能显著增强DiT性能,验证了参数高效优化的有效性。 Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

[65] Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

Abu Noman Md Sakib,Merjulah Roby,Zijie Zhang,Satish Muluk,Mark K. Eskandari,Ender A. Finol

Main category: cs.CV

TL;DR: 本文提出了一种可解释人工智能(XAI)引导的编码器塑形框架,通过构建基于归因的密集编码器焦点图(XAI场),在训练和推理中分别对齐预测概率质量与XAI场、并利用该场引导轻量级精修路径和置信先验,从而提升复杂腹主动脉瘤CT图像分割的鲁棒性。

Details Motivation: 现有AAA CT图像分割模型常因关注无关结构或忽略薄而低对比度的目标而导致失败;模型‘看哪里’是关键训练信号,需显式优化编码器焦点。 Method: 提出XAI引导的编码器塑形框架:1)从最终编码器块生成密集归因式XAI场;2)用其对齐预测概率质量以增强焦点-输出一致性;3)将其输入轻量精修路径与置信先验,在推理中抑制干扰物、保留细微结构;目标函数仅作控制信号,核心是将归因引导融入表征与解码过程。 Result: 在临床验证的、易失败的挑战性病例上评估,相比基础SAM设置,本方法取得显著性能提升。 Conclusion: 显式地通过XAI引导优化编码器焦点,是一种在复杂场景下实现可靠分割的实用且有效原则。 Abstract: Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

[66] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Deen Dayal Mohan,Hossein Souri,Vitali Petsiuk,Juhong Min,Gopal Sharma,Luowei Zhou,Suren Kumar

Main category: cs.CV

TL;DR: GoldiCLIP是一种数据高效的大规模视觉语言模型训练框架,仅用3000万图像(比主流方法少300倍)即达到数据高效方法中的SOTA性能,并在多项检索任务上显著超越基线,甚至媲美十亿级数据训练的模型。

Details Motivation: 现有大规模视觉语言模型依赖数十亿样本数据,成本高、门槛高;虽有工作尝试提升监督质量,但仅解决对比学习中部分缺陷,缺乏系统性平衡方案。 Method: 提出基于‘恰到好处’(Goldilocks)原则的GoldiCLIP框架,包含三项创新:(1) 文本条件下的自蒸馏对齐文本无关与文本相关特征;(2) 集成VQA目标的编码器-解码器结构,增强编码器对非描述性查询(如问答)的泛化能力;(3) 基于不确定性的动态损失加权机制,自动平衡异构损失。 Result: 在仅3000万图像上训练,GoldiCLIP在MSCOCO检索上超越最佳可比基线2.2点,细粒度检索提升2.0点,问答式检索提升5.9点,且性能接近十亿级数据训练模型。 Conclusion: GoldiCLIP验证了通过高质量、多维度、自适应的监督设计,可在极低数据量下实现视觉语言表征学习的突破,为资源受限场景提供了可行路径。 Abstract: Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.

[67] Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators

Yubo Wang,Marie Fridberg,Anirejuoritse Bafor,Ole Rahbek,Christopher Iobst,Søren Vedding Kold,Ming Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力机制和新型卷积模块(ERRC)的深度学习模型,用于根据图像外观自动分类骨外固定器针道感染(A组:有感染/炎症;B组:无并发症),在AUC(0.975)和F1-score(0.927)上优于基线方法,参数量仅5.77M。

Details Motivation: 针道感染常见、痛苦且增加患者发病率,亟需提升其识别与管理水平;现有研究多聚焦开放性伤口,缺乏对金属针-皮肤界面早期感染迹象的针对性建模。 Method: 构建针道感染图像数据集;提出结合注意力机制的深度学习模型,突出关键感染区域并抑制金属针干扰;引入高效冗余重建卷积(ERRC)增强特征图表达力并减少参数量。 Result: 模型AUC达0.975,F1-score为0.927,参数量仅5.77M,性能优于基线方法,结果与医护人员视觉评估一致。 Conclusion: 该DL模型可有效仅凭视觉特征区分针道感染状态,具备临床辅助诊断潜力,但仍需更大规模数据进一步验证。 Abstract: Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.

[68] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting

Alabi Mehzabin Anisha,Guangjing Wang,Sriram Chellappan

Main category: cs.CV

TL;DR: 本文提出了一种新型跨范式对抗攻击框架,可同时有效攻击基于密度图和点回归的两种主流人群计数与定位模型,并在保持视觉不可察觉性的同时实现高迁移性和强攻击效果。

Details Motivation: 现有研究主要关注单一范式(如密度图)内的对抗攻击,而跨范式(密度图 vs. 点回归)攻击尚未被探索;且人群计数模型在安防等关键场景中对鲁棒性要求高,亟需评估其跨架构脆弱性。 Method: 提出多任务损失优化的对抗框架:对点回归模型采用场景密度相关的高置信度logit抑制;对密度图模型采用峰值定向的密度图抑制;二者均结合模型无关的感知约束以保证扰动的不可察觉性。 Result: 在七种SOTA人群模型上实现成功迁移(迁移率0.55–1.69),平均MAE提升达7倍,同时保持良好视觉质量;相比现有可迁移攻击策略,在攻击有效性与不可察觉性之间取得更好平衡。 Conclusion: 跨范式对抗攻击是可行且显著有效的,该工作揭示了当前人群分析模型在架构多样性下的共性脆弱性,为提升模型鲁棒性提供了新评估基准和改进方向。 Abstract: State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen

[69] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation

Junyi Ouyang,Wenbin Teng,Gonglin Chen,Yajie Zhao,Haiwei Chen

Main category: cs.CV

TL;DR: 本文提出DCARL框架,结合分治策略与自回归建模,通过关键帧生成器和插值生成器协同实现长轨迹视频的高保真、稳定生成。

Details Motivation: 现有视频扩散模型扩展性差,自回归模型存在视觉漂移和可控性差问题,难以支持长轨迹视频生成。 Method: 提出DCARL:先用无时间压缩的关键帧生成器建立全局结构锚点;再用重叠段自回归插值生成器,以关键帧为全局上下文、前一干净帧为局部一致性依据合成密集帧。 Result: 在大规模互联网长轨迹视频数据集上训练后,DCARL在视觉质量(FID、FVD更低)和相机运动一致性(ATE、ARE更低)上均优于当前SOTA自回归与分治方法,支持长达32秒的稳定高质量生成。 Conclusion: DCARL有效融合了分治的结构稳定性与扩散模型的高保真能力,为长轨迹视频生成提供了新范式。 Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.

[70] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

Yihan Wang,Jia Deng

Main category: cs.CV

TL;DR: WAFT-Stereo是一种基于形变的立体匹配方法,无需传统成本体,性能领先且效率更高。

Details Motivation: 挑战现有主流方法依赖成本体(cost volume)的设计,探索更高效、简洁的立体匹配方案。 Method: 提出基于形变(warping)的WAFT-Stereo方法,摒弃成本体,直接建模左右图像特征对应关系。 Result: 在ETH3D、KITTI和Middlebury基准上排名第一;ETH3D零样本误差降低81%;比竞争方法快1.8–6.7倍。 Conclusion: 成本体并非立体匹配高性能的必要条件,基于形变的轻量设计可兼顾精度与效率。 Abstract: We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.

[71] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

Katarina Trojachanec Dineva,Stefan Andonov,Ilinka Ivanoska,Ivan Kitanovski,Sasho Gramatikov,Tamara Kostova,Monika Simjanoska Misheva,Kostadin Mishev

Main category: cs.CV

TL;DR: This paper benchmarks 20 multimodal large language models on 2D neuroimaging tasks (MRI/CT) across diagnosis, subtype, modality, sequence, and plane prediction, evaluating classification with abstention, calibration, structured-output validity, and efficiency; results show imaging attribute recognition is nearly solved but diagnostic reasoning—especially subtypes—is hard, with Gemini-2.5-Pro and GPT-5-Chat leading in diagnostics, Gemini-2.5-Flash best in efficiency, and MedGemma-1.5-4B the top open-weight model.

Details Motivation: To systematically evaluate the reliability, operational trade-offs, and diagnostic capabilities of vision-enabled large language models in neuroimaging—where current understanding remains insufficient. Method: A comprehensive benchmarking study using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls; models generate five simultaneous outputs; evaluation spans discriminative classification with abstention, calibration, structured-output validity, and computational efficiency; a multi-phase framework controls for selection bias. Result: Imaging attributes (modality, plane) are nearly solved; diagnostic reasoning—especially subtype prediction—is challenging; tumor classification is most reliable, stroke moderately solvable, MS and rare abnormalities remain difficult; few-shot prompting improves some models but increases cost/latency; Gemini-2.5-Pro and GPT-5-Chat achieve strongest diagnostic performance; Gemini-2.5-Flash offers best efficiency-performance trade-off; MedGemma-1.5-4B is the top open-weight model, approaching proprietary zero-shot performance under few-shot prompting while ensuring perfect structured output. Conclusion: The study provides practical, standardized insights into performance, reliability, and efficiency trade-offs of multimodal LLMs in neuroimaging, supporting future clinical deployment and evaluation frameworks. Abstract: Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

[72] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment

Jinkui Hao,Gorkem Durak,Halil Ertugrul Aktas,Ulas Bagci,Bradley D. Allen,Nilay S. Shah,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出CORA,一种基于病理中心、合成驱动的3D视觉基础模型,利用解剖引导的病灶合成引擎,在12801例无标注冠脉CTA数据上自监督训练,显著提升斑块表征、狭窄检测与冠脉分割等任务性能,并通过融合大语言模型增强30天MACE风险分层能力。

Details Motivation: 临床转化受限于专家标注数据稀缺;现有无标签预训练方法(如掩码图像建模)偏向全局解剖统计,难以捕捉局部冠状动脉斑块病理特征。 Method: 提出CORA模型,采用病理中心、合成驱动的自监督框架:设计解剖引导的病灶合成引擎,生成模拟血管异常,使表征学习聚焦于临床相关病变而非背景解剖;在12,801例无标注CCTA体积数据上训练;并耦合大语言模型构建多模态框架用于MACE风险预测。 Result: 在九家独立医院多中心数据集上,CORA在斑块表征、狭窄检测、冠脉分割等任务中持续超越SOTA 3D视觉基础模型,最高提升29%;多模态版本显著改善30天MACE风险分层性能。 Conclusion: CORA是一种可扩展、可拓展的心血管影像基础模型,统一支持解剖评估与风险预测,为临床转化提供了新范式。 Abstract: Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29\% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.

[73] Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出TIES框架,通过动态平衡注意力强度与层间排序一致性来选择视觉token,显著降低VLA模型推理延迟并提升任务成功率,无需额外训练。

Details Motivation: 现有VLA模型因处理密集视觉token导致高推理延迟,且主流token缩减方法依赖静态的注意力幅值选择,忽视了其任务依赖性,甚至可能损害策略性能。 Method: 提出TIES(Tau引导的跨层高效选择)动态框架,利用层间token排名一致性作为指导信号,自适应地权衡注意力大小与排序一致性,实现无需额外训练的鲁棒token选择。 Result: 在CogACT + SIMPLER基准上,TIES将平均成功率提升6%,token使用量减少78%,并在多种解码器和基准上展现出强泛化能力。 Conclusion: 注意力幅值并非普适的token重要性指标,引入层间排序一致性可更可靠地指导动态token选择,TIES为高效VLA推理提供了新范式。 Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

[74] Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration

Lukas Kratochvila,Jakub Stefansky,Simon Bilik,Robert Rous,Tomas Zemcik,Michal Wolny,Frantisek Rusnak,Ondrej Cech,Karel Horak

Main category: cs.CV

TL;DR: 本文提出了一种用于无人机自动巡检系统的烟雾探测器识别方法,对比了YOLOv11、SSD和RT-DETRv2等模型及多种数据策略,在真实与半合成数据上训练并评估,YOLOv11n以mAP@0.5=0.884表现最优。

Details Motivation: 烟雾探测器常安装于高处或难以到达的位置,人工巡检危险且成本高,亟需基于无人机的自动识别系统。 Method: 比较YOLOv11、SSD和RT-DETRv2三种目标检测模型(含不同骨干网络),结合真实与半合成数据及多种数据增强策略进行训练,并在两种含运动模糊、低分辨率等挑战性场景的测试集上评估。 Result: YOLOv11n在mAP@0.5指标上达到0.884,为最佳性能模型;代码、预训练模型与数据集均已开源。 Conclusion: YOLOv11n在资源受限的嵌入式无人机平台上兼顾精度与效率,验证了其在实际消防设备自动巡检中的可行性与优势。 Abstract: Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.

[75] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang

Main category: cs.CV

TL;DR: 本文提出了首个双语文本到动作生成基准BiHumanML3D及简单有效的基线模型Bilingual Motion Diffusion (BiMD),通过跨语言对齐(CLA)提升双语动作生成质量,尤其支持零样本语码转换,在指标上显著优于单语模型和翻译基线。

Details Motivation: 现有文本到动作生成方法受限于双语数据集缺失和语言模型跨语言语义理解能力不足。 Method: 构建首个双语文本-动作基准BiHumanML3D(基于LLM辅助标注+人工校正),并提出带跨语言对齐(CLA)模块的双语运动扩散模型BiMD,以统一条件空间支持双语输入及零样本语码切换。 Result: BiMD+CLA在BiHumanML3D上取得FID 0.045(优于0.169)和R@3 82.8%(优于80.8%),显著超越单语扩散模型和翻译基线。 Conclusion: 双语数据集BiHumanML3D与跨语言对齐策略CLA对跨语言动作合成至关重要且可靠,为该方向奠定基础。 Abstract: Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}

[76] OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Xiaoyu Tang,Jun Dong,Jintao Cheng,Rui Fan

Main category: cs.CV

TL;DR: 本文提出跨域遥感视觉定位(CD-RSVG)新任务,构建首个大规模光学/SAR融合数据集OptSAR-RSVG,并设计高效模型OptiSAR-Net++,通过PL-MoE、CLIP对比学习、动态对抗负采样及文本引导双门融合等模块,在定位精度与效率上达到SOTA。

Details Motivation: 现有遥感视觉定位方法局限于单一传感器(光学或SAR),难以满足真实场景中多源数据协同需求,亟需支持跨域(光学+SAR)的统一建模方法。 Method: 提出OptiSAR-Net++框架:1)补丁级低秩自适应混合专家(PL-MoE)实现跨域特征解耦;2)基于CLIP的对比学习范式+动态对抗负采样,替代计算昂贵的Transformer解码;3)文本引导双门融合模块(TGDF-SSA)与区域感知辅助头提升语义-视觉对齐与空间建模。 Result: 在自建OptSAR-RSVG和公开DIOR-RSVG基准上均取得SOTA性能,显著提升定位精度与推理效率。 Conclusion: CD-RSVG是更具实用价值的新任务方向,OptiSAR-Net++验证了跨域特征解耦与高效跨模态匹配的有效性,为多源遥感理解提供了新范式。 Abstract: Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

[77] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为概率概念图推理(PCGR)的可解释、可演化的多模态虚假信息检测框架,通过构建人类可理解的概念图并进行分层注意力推理,实现了高准确率与强鲁棒性。

Details Motivation: 传统多模态虚假信息检测器是不透明的黑箱,且难以应对新型操纵手段,亟需可解释、可演化的解决方案。 Method: PCGR采用‘先构建后推理’范式:首先利用多模态大语言模型(MLLMs)自动发现并验证高层概念,构建人类可理解的概念图;然后在该图上应用分层注意力机制进行推理,生成可解释的推理链。 Result: PCGR在多模态虚假信息检测任务中达到当前最优精度和对新兴操纵类型的鲁棒性,在粗粒度检测和细粒度操纵识别两方面均优于先前方法。 Conclusion: PCGR为多模态虚假信息检测提供了可解释、可演化的新范式,兼顾性能与透明性,具有实际部署潜力。 Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

[78] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform

Yan Meng,Jack Cook,X. Y. Han,Kaan Duman,Shauna Otto,Dhiraj Pangal,Jonathan Chainey,Ruth Lau,Margaux Masson-Forsythe,Daniel A. Donoho,Danielle Levy,Gabriel Zada,Sébastien Froelich,Juan Fernandez-Miranda,Mike Chang

Main category: cs.CV

TL;DR: 本文提出了一种用于垂体瘤手术视频的手术阶段识别框架,结合自监督学习、时序建模与可扩展标注策略,在测试集上达到90%准确率,并构建了支持上传、自动分析与数据共建的在线协作平台。

Details Motivation: 准确的手术阶段识别对术中决策支持、手术教育和性能评估至关重要,但面临标注数据稀缺、手术变异性大等挑战。 Method: 采用自监督预训练(基于251个无标签PTS视频)提取特征,再在81例标注手术视频上使用焦点损失、渐进式解冻和动态采样策略进行微调;同时开发了支持视频上传、自动分析与数据共享的在线协作平台。 Result: 在独立测试集上达到90%准确率,优于当前最优方法,并展现出跨病例的良好泛化能力。 Conclusion: 该框架有效缓解了标注数据不足问题,提升了手术阶段识别性能,并通过协作平台推动了数据积累与模型持续优化,为智能手术分析提供了实用化路径。 Abstract: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.

[79] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data

Haresh Rengaraj Rajamohan,Yuxuan Chen,Kyunghyun Cho,Cem M. Deniz

Main category: cs.CV

TL;DR: 本研究评估了自监督学习(SSL)在膝骨关节炎(OA)诊断与预后建模中的效果,发现图像-文本多模态SSL虽因数据严重偏倚(93%为KL 3级)未能提升诊断性能,却显著改善了4年结构进展等预后预测任务。

Details Motivation: 探究自监督学习(特别是基于真实世界未标注医院数据的SSL)是否能超越ImageNet预训练,在膝骨关节炎的诊断(KL分级)和预后(结构进展)建模中提供更优表征。 Method: 比较两类SSL预训练:(i) 基于OAI/MOST/NYU队列膝X光片的图像单模态SSL;(ii) 基于未筛选医院膝X光片及其配对放射科医生报告的图像-文本多模态SSL;并在KL分级诊断与4年结构进展/发生预测任务上,对比其线性探针与全量微调性能。 Result: 图像单模态SSL在线性探针下提升KL分级准确率,但全量微调不优于ImageNet;多模态SSL在诊断任务中无提升(归因于预训练数据严重KL 3级偏倚),但在预后任务中显著优于ImageNet(如MOST外部验证AUROC达0.701 vs. 0.599,仅用10%标注数据)。 Conclusion: 未筛选的医院图像-文本数据虽因分布偏倚不适用于诊断建模,但当预训练与下游预后任务分布一致时,可提供强泛化表征,凸显SSL任务对齐的重要性。 Abstract: This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution

[80] ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects

Jing Yang,Krithika Dharanikota,Emily Jia,Haiwei Chen,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出一个大规模真实世界偏振反射与材质数据集,包含218个日常物体的120万张高分辨率图像,并支持多种采集维度;基于该数据集训练的逆向与正向渲染模型在材质分解、重光照和稀疏视角三维重建任务上显著提升性能。

Details Motivation: 真实材质反射建模困难主要源于实测反射数据稀缺,现有方法依赖简化光照和低真实感的合成数据,导致模型难以泛化到真实图像。 Method: 构建了一个基于8相机、346光源Light Stage并配备交叉/平行偏振装置的大规模真实世界偏振反射与材质数据集,覆盖多视角、多光照、偏振、反射分离及材质属性五个维度,并提供扩散-镜面分离、解析推导的漫反射/镜面反射反照率及法线。 Result: 在本数据集上训练和评估的SOTA逆向与正向渲染模型,在固有图像分解、重光照和稀疏视角三维重建任务中,显著提升了材质分离精度、光照保真度和几何一致性。 Conclusion: 该工作为基于物理的材质理解建立了新基准,推动逆向渲染模型从合成训练走向真实世界泛化。 Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: https://jingyangcarl.github.io/ICTPolarReal/

[81] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

Xuepeng Jing,Wenhuan Lu,Hao Meng,Zhizhi Yu,Jianguo Wei

Main category: cs.CV

TL;DR: 本文提出TIGFlow-GRPO,一种两阶段生成框架,将基于流的轨迹生成与行为规则对齐:第一阶段用带轨迹交互图(TIG)模块的条件流匹配(CFM)增强上下文建模;第二阶段通过Flow-GRPO后训练,将确定性流展开重定义为随机SDE采样,并引入兼顾社会合规性与物理可行性的复合奖励,提升预测准确性、长时稳定性及行为合理性。

Details Motivation: 现有基于条件流匹配(CFM)的轨迹预测方法主要依赖监督拟合,难以充分反映社会规范和场景约束,导致生成轨迹行为不合理或物理不可行。 Method: 提出两阶段框架TIGFlow-GRPO:第一阶段构建CFM预测器并嵌入Trajectory-Interaction-Graph(TIG)模块,显式建模智能体间及智能体与场景间的细粒度交互;第二阶段采用Flow-GRPO后训练,将确定性ODE rollout转化为随机SDE采样,并设计融合视角感知社会合规性与地图感知物理可行性的复合奖励函数,通过强化学习优化多模态预测。 Result: 在ETH/UCY和SDD数据集上,TIGFlow-GRPO显著提升了轨迹预测精度与长时预测稳定性,同时生成更符合社会规范和物理约束的轨迹。 Conclusion: TIGFlow-GRPO有效 bridged flow-based trajectory modeling and behavior-aware alignment,为动态多媒体环境中的智能轨迹预测提供了新范式。 Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.

[82] Infinite Gaze Generation for Videos with Autoregressive Diffusion

Jenna Kang,Colin Groth,Tong Wu,Finley Torrens,Patsorn Sangkloy,Gordon Wetzstein,Qi Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于自回归扩散模型的无限时域原始眼动轨迹生成框架,能对任意长度视频进行高时空精度和真实感的眼动预测。

Details Motivation: 传统显著性图和扫描路径难以刻画原始眼动的细粒度时间动态,且现有模型受限于短时窗口(约3-5秒),无法建模真实场景中的长程行为依赖。 Method: 提出一种以显著性感知视觉隐空间为条件的自回归扩散模型,用于生成具有连续空间坐标和高分辨率时间戳的眼动轨迹。 Result: 在定量与定性评估中,该方法在长程时空精度和轨迹真实性上显著优于现有方法。 Conclusion: 该生成式框架成功实现了对任意长度视频的无限时域原始眼动预测,提升了视频场景理解与多模态交互能力。 Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

[83] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

Yasong Dai,Zeeshan Hayder,David Ahmedt-Aristizabal,Hongdong Li

Main category: cs.CV

TL;DR: 本文提出BiFM(双向流匹配)框架,统一建模生成与反演过程,通过估计双向平均速度场并引入连续时间区间监督与双向一致性目标,在少量步数下实现高质量图像编辑与生成。

Details Motivation: 现有少步采样反演方法在前向过程逼近上表现差,且依赖预训练生成器和辅助模块,导致可扩展性与跨架构泛化能力受限。 Method: BiFM联合学习生成与反演,估计'image→noise'和'noise→image'两个方向的平均速度场,约束于共享的瞬时速度场;采用连续时间区间监督、双向一致性损失及轻量级时间区间嵌入进行训练。 Result: BiFM在多种图像编辑与生成任务中显著优于现有少步方法,支持单步反演,且可无缝集成到主流扩散与流匹配骨干网络中。 Conclusion: BiFM提供了一种通用、高效、可扩展的少步图像生成与语义保持编辑新范式。 Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

[84] Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

ZeBin Ji,Yang Hu,Xiuli Bi,Bo Liu,Bin Xiao

Main category: cs.CV

TL;DR: 本文提出了一种Select-Hypothesize-Verify框架,通过激活分布分析选择代表性样本、生成概念假设并验证其对目标神经元的激活程度,从而提升神经元概念解释的准确性。

Details Motivation: 现有神经元概念解释方法假设每个神经元功能明确且具有判别性,但实际中存在冗余或误导性神经元,导致解释不可靠。 Method: 提出Select-Hypothesize-Verify框架:1)基于激活分布分析选取最具代表性的激活样本;2)为选定神经元生成自然语言概念假设;3)验证该概念是否能高激活对应神经元。 Result: 实验表明,所提方法生成的概念对目标神经元的激活概率约为当前最优方法的1.5倍,概念解释更准确。 Conclusion: 引入功能验证机制并结合三阶段框架,可显著提升神经元概念解释的可靠性与准确性,有助于更可信地理解神经网络决策机制。 Abstract: It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network's decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron's well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.

[85] Self-Corrected Image Generation with Explainable Latent Rewards

Yinyi Luo,Hrishikesh Gokhale,Marios Savvides,Jindong Wang,Shengfeng He

Main category: cs.CV

TL;DR: 本文提出xLARD框架,利用多模态大语言模型通过可解释的潜在奖励(Explainable Latent Rewards)实现文本到图像生成过程中的自校正,提升细粒度语义与空间关系对齐能力。

Details Motivation: 文本到图像生成中,前馈式生成难以在生成前准确预判复杂提示的对齐效果;而图像评估相对容易,存在生成与评估之间的不对称性。 Method: 提出xLARD框架,包含轻量级校正器和可微分的潜在编辑到奖励信号映射机制,利用多模态大语言模型生成结构化反馈,指导潜在表征优化。 Result: 在多种生成与编辑任务上验证了xLARD能提升语义对齐精度与视觉保真度,同时保持原有生成先验。 Conclusion: xLARD通过将非可微的图像级评估转化为可微的潜在级引导,实现了生成过程的自我理解、评估与修正,为可控文本到图像生成提供了新范式。 Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

[86] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

Yilin Ni,Wenjie Li,Zhengxue Wang,Juncheng Li,Guangwei Gao,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的物理感知语义扩散模型PASDiff,结合光度约束与面部结构先验,有效解决低光照人脸图像的多重退化问题。

Details Motivation: 真实场景下的低光照人脸图像存在光照不足、模糊、噪声和可见性差等多种退化;现有级联方法误差累积严重,而通用联合模型缺乏显式面部先验,难以恢复清晰的人脸结构。 Method: 提出PASDiff模型:1)利用逆强度加权和Retinex理论引入光度约束,实现合理的照度与色彩分布恢复;2)设计风格无关的结构注入模块(SASI),从现成面部先验中提取结构并滤除其光度偏差,将身份特征与物理约束无缝融合;3)构建真实世界低光照人脸数据集WildDark-Face(700张图像)。 Result: 在多个指标上显著优于现有方法,在自然照度、色彩恢复和身份一致性之间取得更优平衡。 Conclusion: PASDiff通过物理建模与语义先验的协同,实现了无需训练的高质量低光照人脸增强,为真实场景应用提供了新思路。 Abstract: Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.

[87] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

Dohwan Ko,Jinyoung Park,Seoung Choi,Sanghyeok Lee,Seohyun Lee,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 本文提出MoE-GRPO,一种基于强化学习(GRPO)的专家路由优化框架,用于提升视觉语言模型(VLMs)中混合专家(MoE)的专家选择多样性,缓解专家过拟合,并实现任务级专家专业化。

Details Motivation: 现有MoE在VLMs中采用的确定性top-K路由机制易忽略更优专家组合、导致专家过拟合,缺乏选择多样性。 Method: 将专家选择建模为序列决策问题,采用Group Relative Policy Optimization(GRPO)进行强化学习优化;引入模态感知的路由器引导机制,抑制对特定模态不活跃专家的探索,提升训练稳定性与效率。 Result: 在多模态图像与视频基准上,MoE-GRPO显著优于标准top-K及其变体,展现出更丰富的专家选择多样性、缓解专家过拟合,并支持任务级专家专业化。 Conclusion: MoE-GRPO通过RL驱动的自适应路由与模态感知引导,有效提升了MoE-VLM的泛化性与效率,为多模态稀疏建模提供了新范式。 Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

[88] Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

Yusri Al-Sanaani,Rebecca Thornhill,Pablo Nery,Elena Pena,Robert deKemp,Calum Redpath,David Birnie,Sreeraman Rajan

Main category: cs.CV

TL;DR: 本文提出了一种基于模型无关元学习(MAML)的少样本(K-shot)3D左心房壁分割框架,结合多任务元训练和边界感知复合损失,在标注稀缺条件下显著提升薄结构分割精度与鲁棒性。

Details Motivation: 左心房壁在延迟钆增强MRI中因结构薄、对比度低及专家标注稀缺,分割困难,亟需少样本下高精度、强泛化能力的方法。 Method: 采用Model-Agnostic Meta-Learning(MAML)框架,进行K-shot(5/10/20)3D左心房壁分割;元训练阶段联合左、右心房腔体分割作为辅助任务;引入边界感知复合损失以提升薄结构边界精度。 Result: 在保留测试集上,5-shot时Dice达0.64(优于监督微调的0.52),HD95为5.70 mm(优于7.60 mm);20-shot时接近全监督性能(0.69 vs. 0.71);在未见域偏移和本地队列上仍保持稳健(5-shot时Dice分别为0.59和0.57)。 Conclusion: 该MAML方法可有效提升少样本下左心房壁薄结构分割的准确性与鲁棒性,有望降低临床应用中标注成本,助力心房重构评估。 Abstract: Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall's thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.

[89] Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets

Peng Wu,Yuting Yan,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了首个面向事件流的视频异常检测框架EWAD,并构建了多个同步事件流与RGB视频的基准数据集,推动事件驱动的异常检测研究。

Details Motivation: 事件相机具有低冗余、聚焦动态运动和天然隐私保护等特性,非常适合视频异常检测,但缺乏专用的数据集和有效的建模方法,严重阻碍了该领域发展。 Method: 构建了多个同步事件流与RGB视频的基准数据集;提出EWAD框架,包含事件密度感知的动态采样策略、密度调制的时间建模方法、以及RGB到事件的知识蒸馏机制。 Result: 在三个新构建的基准上,EWAD显著优于现有方法,验证了事件驱动建模在视频异常检测中的潜力与有效性。 Conclusion: 本工作首次系统性地将事件流引入视频异常检测,建立了基准数据集与有效框架,为该方向奠定了基础。 Abstract: Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.

[90] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance

Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan

Main category: cs.CV

TL;DR: 本文提出C2W-Tune两阶段迁移框架,利用高精度左心房腔体分割模型作为解剖先验,提升3D LGE-MRI中薄左心房壁的分割精度;通过腔体预训练与渐进式解冻微调策略,在多个边界指标上显著优于从零训练基线。

Details Motivation: 左心房壁在3D LGE-MRI中因结构薄、解剖复杂、对比度低而难以准确分割,影响壁厚映射与纤维化量化。 Method: 提出C2W-Tune:第一阶段用带ResNeXt编码器和实例归一化的3D U-Net预训练左心房腔体分割;第二阶段迁移权重并采用渐进式层解冻策略微调以适应壁分割,保留心内膜特征并优化壁特异性细节。 Result: 在2018 LA Segmentation Challenge数据集上,壁Dice达0.814(+0.191),1mm Surface Dice达0.731(+0.178),HD95降至2.55 mm(−0.40 mm),ASSD降至0.63 mm(−0.08 mm);仅用70例训练数据时仍达Dice 0.78、HD95 3.15 mm,优于多类别基准(Dice≈0.6–0.7)。 Conclusion: 基于解剖结构的可控任务迁移可有效提升3D LGE-MRI中薄左心房壁分割的边界精度。 Abstract: Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall's thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.

[91] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting

Junoh Leea,Junmyeong Lee,Yeon-Ji Song,Inhwan Bae,Jisu Shin,Hae-Gon Jeon,Jin-Hwa Kim

Main category: cs.CV

TL;DR: 本文提出了一种在4D场景中显式保持高斯椭球局部几何结构随时间一致性的新方法,通过视图空间射线分组与α加权约束,避免依赖光流等外部先验,提升了单目动态3D重建的物理合理性和质量。

Details Motivation: 现有基于3D高斯溅射的动态场景重建方法难以建模符合真实物理规律的运动,尤其在单目视频数据下,高斯运动不一致导致局部几何结构失真、重建质量下降,且多依赖光学流或2D轨迹等外部先验。 Method: 提出视图空间射线分组策略:对同一条射线穿过的、α-blending权重超过阈值的高斯进行聚类,并对其施加空间分布一致性约束,从而在时序上稳定局部几何结构,实现更符合物理规律的运动建模。 Result: 在多个具挑战性的单目数据集上验证,该方法集成于两个基线模型后,显著提升时间一致性与重建质量,优于现有主流方法。 Conclusion: 所提方法无需外部先验即可有效建模物理合理的动态高斯运动,为单目4D场景重建提供了更鲁棒、自洽的几何时序建模范式。 Abstract: The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $α$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.

[92] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

WenXi Wang,JunQi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的分布式车辆控制方法,使普通车辆仅利用局部信息即可实时协同让行应急车辆,兼顾效率、安全与适应性。

Details Motivation: 现有应急车辆快速通行方法(集中式优化求解和强化学习)存在计算成本高、难以扩展到大规模或动态交通场景的问题,亟需一种低开销、高适应性且安全可靠的分布式解决方案。 Method: 提出基于局部信息的分布式车辆控制方法,理论上证明其近似等价于全局信息下的最优控制;进一步设计分布式冲突消解机制,确保决策安全且避免单点失效。 Result: 在真实交通数据集上的仿真表明,该方法决策更快、对普通车辆干扰更小,并在不同交通密度和路网结构下展现出更强的可扩展性。 Conclusion: 所提分布式方法克服了传统集中式与学习式方法的根本局限,在保证安全前提下实现了高效、实时、自适应且可扩展的应急车辆通行协同控制。 Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.

[93] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu,Necva Bolucu,Stephen Wan,Dadong Wang,Jiahao Xia,Jian Zhang

Main category: cs.CV

TL;DR: 本文提出SGREC方法,利用查询驱动的场景图作为视觉语言模型(VLM)与大语言模型(LLM)之间的结构化中介,实现可解释的零样本指代表达理解(REC),在多个基准上达到最优性能。

Details Motivation: 现有VLM难以捕捉细粒度视觉细节和复杂对象关系,而LLM虽擅长高层语义推理,却无法直接将视觉特征抽象为文本语义,限制了其在REC任务中的应用。 Method: 提出SGREC方法:首先用VLM构建查询驱动的场景图,显式编码空间关系、描述性标注和对象交互;再将该结构化图作为桥梁,供LLM进行目标对象推理并生成可解释的决策说明。 Result: SGREC在多个零样本REC基准上取得SOTA结果,包括RefCOCO val(66.78%)、RefCOCO+ testB(53.43%)和RefCOCOg val(73.28%)。 Conclusion: SGREC通过引入查询驱动的场景图,有效融合VLM的视觉感知能力与LLM的语义推理能力,显著提升零样本REC性能,并保证推理过程的可解释性。 Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

[94] Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning

Md. Rokon Mia,Rakib Hossain Sajib,Abdullah Al Noman,Abir Ahmed,B M Taslimul Haque

Main category: cs.CV

TL;DR: 本文提出了一种结合Center Loss和ArcFace Loss的双损失框架,用于提升水稻叶片病害细粒度分类性能,在Rice Leaf Dataset上达到超99%的准确率。

Details Motivation: 传统深度学习模型依赖交叉熵损失,在水稻病害数据集上因类内差异大、类间相似性高而表现不佳,亟需更鲁棒的损失函数设计。 Method: 提出融合Center Loss(增强类内紧凑性)与ArcFace Loss(增大类间角度间隔)的双损失框架,并分别集成到InceptionNetV3、DenseNet201和EfficientNetB0三种主流骨干网络中进行训练。 Result: 在Rice Leaf Dataset上,三种骨干网络分别达到99.6%、99.2%和99.2%的分类准确率;验证了角间距约束与中心约束协同可显著提升特征判别力。 Conclusion: 该双损失框架无需修改网络结构,轻量高效,具备良好泛化性与实用性,适用于农业实际场景中的水稻病害早期识别。 Abstract: Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world's population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.

[95] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields

Thanh-Hai Le,Hoang-Hau Tran,Trong-Nghia Vu

Main category: cs.CV

TL;DR: Few TensoRF 是一种结合 TensorRF 表示与 FreeNeRF 频率驱动正则化的稀疏视图 3D 重建方法,显著提升重建质量与稳定性,同时保持快速训练(约10–15分钟)和高渲染效率。

Details Motivation: 解决稀疏输入视角下 3D 重建不稳定、质量差的问题,尤其在数据受限场景中提升泛化性与鲁棒性。 Method: 将 TensorRF 的高效张量表示与 FreeNeRF 的频率驱动少样本正则化结合,并引入频率掩码与遮挡掩码以增强稀疏视角下的建模能力。 Result: 在 Synthesis NeRF 上 PSNR 达 23.70 dB(微调后 24.52 dB),较 TensorRF 提升超 2 dB;在 THuman 2.0 上仅用 8 张图像即达 27.37–34.00 dB;训练时间维持在约 10–15 分钟。 Conclusion: Few TensoRF 是一种高效、数据经济的实时 3D 重建框架,适用于多样场景,尤其在稀疏视图条件下表现突出。 Abstract: This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF's efficient tensor based representation with FreeNeRF's frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF's fast \(\approx10-15\) minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.

[96] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Zhangyu Jin,Maksim Siniukov,Deuksin Kwon,Ashutosh Chaubey,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出GDPO-Listener框架,通过自回归流匹配、分组奖励解耦策略优化及语义文本控制,显著提升双人交互中说话者与倾听者3D头部运动的真实感与表现力,尤其改善倾听者动作静态化问题。

Details Motivation: 现有方法在生成倾听者头部运动时存在‘回归均值’问题(动作趋于静止),且缺乏对复杂非言语动作的建模能力。 Method: 提出GDPO-Listener框架:1)采用自回归流匹配实现稳定监督学习;2)引入分组奖励解耦策略优化(GDPO),对FLAME参数分组进行独立奖励归一化以鼓励高方差表达性动作;3)支持显式语义文本控制以定制响应。 Result: 在Seamless Interaction和DualTalk数据集上验证,该方法在长期运动方差、视觉表现力和语义可控性方面均优于现有基线。 Conclusion: GDPO-Listener有效解决了倾听者运动静态化问题,提升了双人交互中3D头部运动的表达多样性与可控性,为虚拟人合成提供了新范式。 Abstract: Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

[97] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Zhe Gao,Shiyu Shen,Taifeng Chai,Weinong Wang,Haotian Xu,Xing W,Wenbin Li,Qi Fan,Yang Gao,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出VideoTIR,一种基于强化学习的长视频理解方法,通过多级工具调用、TAGPO策略优化和沙盒轨迹合成,显著缓解多模态大模型在长视频理解中的幻觉问题。

Details Motivation: 现有MLLMs在长视频理解中因图文token不平衡易产生幻觉;虽有SFT工具调用方法,但依赖大量高质量标注数据且调用路径受限。 Method: 提出VideoTIR框架:1)支持Zero-RL与SFT冷启动的RL训练;2)设计Toolkit Action Grouped Policy Optimization(TAGPO),通过分步奖励和失败rollout复用提升调用效率;3)构建沙盒式轨迹合成框架生成高质量训练轨迹。 Result: 在三个长视频问答基准上,VideoTIR在准确率与推理效率上均显著优于现有方法,有效缓解幻觉并减少冗余工具调用。 Conclusion: 基于RL的多级工具协同调用是提升长视频理解性能与鲁棒性的可行路径,TAGPO与沙盒轨迹合成可推广至其他复杂多模态决策任务。 Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

[98] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering

Xu Liu

Main category: cs.CV

TL;DR: 本文提出CARE——一种无需训练的可控医学图像恢复框架,通过双潜在分支策略平衡数据保真度与生成先验引导,在推理阶段实现结构保持与细节增强的动态权衡,并引入风险感知自适应控制器提升临床安全性。

Details Motivation: 现有医学图像恢复方法依赖任务特定重训练,缺乏对重建保真度与先验增强之间权衡的可控性,易导致诊断关键结构失真或幻觉细节,临床风险高。 Method: 提出CARE框架:采用双潜在分支(一枝保障数据保真与解剖一致性,一枝利用生成先验修复退化信息);设计风险感知自适应控制器,依据局部结构可靠性与恢复不确定性动态调节两分支贡献。 Result: 在噪声与不完整医学影像场景中验证,CARE在恢复质量上表现优异,显著提升临床相关结构保真度,降低不合理重建风险。 Conclusion: CARE为医学图像恢复提供了无需再训练、高度可控且临床更安全的部署方案,是迈向实用化、可信赖医学AI的重要进展。 Abstract: Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.

[99] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation

Jianbo Qi,Mengyao Li,Baogui Jiang,Yidan Chen,Qiao Wang

Main category: cs.CV

TL;DR: 本文提出GeoNDC,一种基于隐式神经场的可查询地球观测数据立方体,将遥感数据编码为连续时空表示,在实现高压缩率(95:1)的同时支持实时查询、连续时间重建与高保真度恢复。

Details Motivation: 现有卫星遥感数据以离散栅格文件形式存储,导致存储、传输和查询成本高昂,缺乏统一、高效、AI-ready的数据表示方式。 Method: 提出GeoNDC——一种基于连续时空隐式神经场(neural implicit field)的神经数据立方体,将多源地球观测数据(如MODIS、Sentinel-2、HiGLASS)映射为可微分、可查询的紧凑函数表示,支持按需时空采样与端到端训练。 Result: 在20年全球MODIS数据上压缩率达95:1(0.44 GB),光谱保真度高(R² > 0.98,RMSE = 0.021);在Sentinel-2上云遮挡下实现R² > 0.85的时序重建;在HiGLASS产品上达R² > 0.98;全部实验可在消费级硬件运行。 Conclusion: GeoNDC提供了一种AI-native、统一的行星尺度遥感数据表示范式,集查询、重建与压缩于一体,有望成为原始数据档案的重要补充分析层。 Abstract: Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5\,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10\,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ($R^2 > 0.85$) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ($R^2 > 0.98$). The representation compresses the 20-year MODIS archive to 0.44\,GB -- approximately 95:1 relative to an optimized Int16 baseline -- with high spectral fidelity (mean $R^2 > 0.98$, mean RMSE $= 0.021$). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.

[100] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

Wonjoon Lee,Sungmin Woo,Donghyeong Kim,Jungho Lee,Sangheon Park,Sangyoun Lee

Main category: cs.CV

TL;DR: 本文提出MoRGS框架,通过引入光流作为轻量运动线索、学习高斯运动偏移场及运动置信度,实现高效在线4D动态场景重建,显著提升运动保真度与时间一致性。

Details Motivation: 现有在线4D重建方法缺乏显式运动建模,导致每个高斯椭球体的运动仅由像素残差驱动,无法反映真实3D运动。 Method: 提出MoRGS:利用稀疏关键视图的光流作为运动正则化信号;学习每高斯运动偏移场以对齐投影3D运动与观测光流;引入每高斯运动置信度区分动静区域并加权属性更新。 Result: 在多个动态场景上达到在线方法中最优的重建质量与运动保真度,同时保持流式处理能力。 Conclusion: 显式建模每高斯运动并结合多源运动监督可显著提升在线4D重建的真实感与时序稳定性。 Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

[101] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

Liyuan Zhu,Manjunath Narayana,Michal Stary,Will Hutchcroft,Gordon Wetzstein,Iro Armeni

Main category: cs.CV

TL;DR: GaussFusion 是一种通过几何信息引导的视频生成方法,用于提升野外场景下3D高斯泼溅(3DGS)重建质量,有效缓解浮点伪影、闪烁和模糊等问题,并支持多种重建范式,实现实时高质量新视角合成。

Details Motivation: 解决3D高斯泼溅在真实场景中因相机位姿误差、覆盖不全和几何初始化噪声导致的浮点伪影、闪烁和模糊等常见问题。 Method: 提出几何信息引导的视频到视频生成器,利用深度、法线、不透明度和协方差等高斯原语渲染视频缓冲区作为输入,结合人工合成的多样化退化数据进行训练与优化。 Result: 在新视角合成基准上达到SOTA性能;轻量版本可在21 FPS实时运行,适用于交互式3D应用。 Conclusion: GaussFusion是一种通用、鲁棒且高效的后处理框架,显著提升了3DGS在复杂真实场景中的重建质量与时间一致性。 Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.

[102] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics

Jing Tao,Taihang Lei,Banglei Guan,Ying Qu,Xudong Na,Likun Ma,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种结合空间可变曝光(SVE)相机与神经形态事件相机的闭环Event--SVE测量系统,用于在浓烟、高动态范围和微秒级粒子运动等极端条件下实现高能推进剂燃烧的实时三维监测。

Details Motivation: 传统成像在高动态范围、微秒级粒子运动和浓烟共存条件下易出现饱和、运动模糊和粒子提取不稳定等问题,难以实现实时高能推进剂燃烧监测。 Method: 构建闭环Event--SVE系统:SVE相机生成HDR图像并采用烟雾感知融合策略;利用多线索烟雾似然图分离粒子发射与烟雾散射,获得校准强度图;以HDR图为绝对强度参考,抑制烟雾引起的事件伪影,并提升粒子状态判别能力;基于净化后的事件数据,通过双目事件驱动3D流水线进行特征提取与三角测量,估计分离高度与等效粒子尺寸。 Result: 在硼基推进剂实验中成功获取多模态等效半径统计结果,捕获了常规传感器难以观测的快速分离瞬态过程;最大标定误差为0.56%;实现了烟雾遮蔽下微秒级分辨率的三维燃烧测量。 Conclusion: 该框架为烟雾干扰、高动态范围条件下的微秒级三维燃烧测量提供了实用且标定一致的解决方案。 Abstract: Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event--SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.

[103] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

Xuankai Zhang,Junjin Xiao,Shangwei Huang,Wei-shi Zheng,Qing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于单目视频的高质量动态高斯点绘方法,通过SE(3) B样条运动基显式建模高斯椭球的位置与朝向连续变形,并引入自适应控制机制、软分割重建策略及多视角扩散模型提升性能。

Details Motivation: 现有动态高斯点绘方法难以精确建模连续时空变形,且易受长时运动干扰和过拟合影响,需兼顾建模能力与计算效率。 Method: 采用SE(3) B样条运动基显式建模高斯椭球的连续位置与姿态变形;设计自适应机制动态调整运动基与控制点数量;提出软分割重建缓解长间隔运动干扰;引入多视角扩散模型提供强先验多视图线索。 Result: 在新视角合成任务上显著超越当前最先进方法,在多个动态场景数据集上取得SOTA性能。 Conclusion: 所提方法在保持高渲染质量的同时提升了动态场景建模的准确性、鲁棒性与泛化能力,为单目动态神经辐射场提供了新思路。 Abstract: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.

[104] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Junpeng Ma,Sashuai Zhou,Guanghao Li,Xin Gao,Yue Cao,Hengyu Zeng,Yuxiang Yan,Zhibin Wang,Jun Song,Bo Zheng,Shanghang Zhang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出GIFT框架,通过评估视频帧的内在不可替代性来选择关键帧,以解决现有方法因贪心决策和相关性与多样性解耦评估导致的局部最优问题。

Details Motivation: 现有视频大语言模型处理密集帧计算成本高,且关键帧选择方法存在贪心决策和相关性与多样性解耦评估的问题,易陷入局部最优并误选无关噪声帧。 Method: 提出GIFT框架,包括定向多样性(Directed Diversity)量化帧在相关性条件下的独特性,构建统一的不可替代性得分;以及预算感知精炼策略(Budget-Aware Refinement),通过自适应迭代过程优先选取最高不可替代性帧,并随预算扩展逐步构建关键时间上下文。 Result: 在LLaVA-Video-7B上,GIFT在长视频基准测试中相较均匀采样平均提升达12.5%。 Conclusion: GIFT是一种无需训练的关键帧选择新范式,能更有效地平衡帧的相关性与多样性,在降低计算成本的同时显著提升视频理解性能。 Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

[105] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Nanxiang Jiang,Zhaoxin Fan,Baisen Wang,Daiheng Gao,Junhang Cheng,Jifeng Guo,Yalan Qin,Yeying Jin,Hongwei Zheng,Faguo Wu,Wenjun Wu

Main category: cs.CV

TL;DR: 本文提出了Z-Erase,首个专为单流扩散Transformer(如Z-Image)设计的概念擦除方法,通过解耦更新机制与拉格朗日引导的自适应调制算法,在避免生成崩溃的同时实现敏感概念擦除与图像保真度的平衡,并理论证明其收敛性,实验表明其性能达到SOTA。

Details Motivation: 现有概念擦除方法在单流扩散Transformer(如Z-Image)上易导致生成崩溃,而该架构因文本与图像token共享参数、统一建模,亟需适配的新擦除方法。 Method: 提出Stream Disentangled Concept Erasure Framework以解耦参数更新,并在此框架下设计Lagrangian-Guided Adaptive Erasure Modulation算法,在约束优化中平衡擦除与保留;同时提供收敛性分析,证明其可收敛至Pareto平稳点。 Result: Z-Erase有效克服生成崩溃问题,在多种任务上取得SOTA性能。 Conclusion: Z-Erase是首个成功适配单流T2I模型的概念擦除方法,兼具理论严谨性与实践有效性,为该新兴范式下的安全可控生成提供了新路径。 Abstract: Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

[106] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Guoyin Wang,Jiancan Wu,Xiang Wang,Xiangnan He

Main category: cs.CV

TL;DR: 本文提出了一种面向多模态大语言模型(MLLMs)的强化学习奖励可验证方法(RLVR)扩展方案,通过Token-Reweighting(ToR)策略建模感知类与推理类token间的耦合关系,提升视觉定位与符号推理协同性能。

Details Motivation: 现有RLVR方法难以处理MLLMs中感知相关token与推理相关token交织混杂的问题,二者分别对应视觉接地与符号推理能力,且相互依赖,单独优化效果差。 Method: 基于token级实证分析发现感知与推理token需联合优化;进而提出即插即用的Token-Reweighting(ToR)策略,在RLVR训练中动态识别并重加权两类关键token。 Result: ToR在GRPO、DAPO等基线方法上均带来稳定提升,在多个多模态推理基准上达到SOTA,同时兼顾准确视觉接地与连贯推理。 Conclusion: 感知与推理token具有内在耦合性,显式建模其互依关系(如ToR)是提升MLLMs多模态强化学习效果的关键路径。 Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.

[107] Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization

Haoran Pei,Yuguang Yang,Kexin Liu,Juan Zhang,Baochang Zhang

Main category: cs.CV

TL;DR: 本文提出分层因果Dropout(HCD)方法,通过通道级因果掩码和矩阵互信息(MMI)目标实现因果特征分离与去偏,结合StyleMix-VICReg提升训练稳定性,在OOD泛化任务中优于现有先进方法。

Details Motivation: 深度学习模型易依赖非因果的域特异性上下文(即捷径特征),导致跨域泛化性能不稳定;现有不变性学习方法难以在深层混合表征中有效解耦因果与虚假特征。 Method: 提出Hierarchical Causal Dropout(HCD):1)使用通道级因果掩码强制特征稀疏,实现表征层面的因果干预;2)引入Matrix-based Mutual Information(MMI)损失,最小化潜变量与域标签的互信息、最大化其与类别标签的互信息;3)嵌入StyleMix驱动的VICReg模块以保障掩码不误删关键因果特征。 Result: 在多个OOD基准数据集上,HCD显著优于当前顶尖方法,验证了其在解耦因果特征、抑制捷径学习及提升跨域泛化能力方面的有效性与稳定性。 Conclusion: HCD通过可解释的因果掩码机制与信息论驱动的优化目标,为OOD泛化提供了一种鲁棒且原理清晰的解决方案,推动了因果表示学习在视觉任务中的实际应用。 Abstract: Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning problem.In this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class labels.To ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.

[108] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

Chengxu Yang,Jingling Yuan,Chuang Hu,Jiawei Jiang

Main category: cs.CV

TL;DR: 本文提出CLVA(跨层视觉锚点)方法,通过增强中间层关键视觉特征并抑制深层注意力回归的噪声,缓解多模态大语言模型中的物体幻觉问题,无需额外训练且计算开销小。

Details Motivation: 现有方法在解释最终层注意力漂移方面缺乏足够可解释性,且未充分关注中间层视觉锚点对输出可靠性的影响。 Method: 提出无需训练的CLVA方法,利用注意力动态中捕获的关键中间层视觉锚点,强化中间层特征、抑制早期视觉噪声向深层的回归,从而将深层注意力拉回正确视觉区域。 Result: 在多种架构和基准上验证了CLVA的有效性,显著缓解物体幻觉,同时未明显增加计算时间和GPU显存消耗。 Conclusion: 中间层视觉锚点比最终层更关键;通过跨层机制调控注意力可有效抑制幻觉,且无需训练,具备强泛化性和实用性。 Abstract: Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

[109] THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Tzu-Yen Ma,Bo Zhang,Zichen Tang,Junpeng Ding,Haolin Tian,Yuanze Li,Zhuodi Hao,Zixin Ding,Zirui Wang,Xinyu Yu,Shiyao Peng,Yizhuo Zhao,Ruomeng Jiang,Yiling Huang,Peizhi Zhao,Jiayuan Chen,Weisheng Tan,Haocheng Gao,Yang Liu,Jiacheng Liu,Zhongjun Yang,Jiayu Huang,Haihong E

Main category: cs.CV

TL;DR: THEMIS是一个面向视觉学术造假推理的多任务基准,包含4000+真实撤稿案例衍生问题、5类造假类型与16种细粒度篡改操作,并支持多维能力评估;现有最强MLLM(GPT-5)仅达56.15%准确率,表明其具有高挑战性。

Details Motivation: 现有多模态大模型评测基准难以反映真实学术造假场景的复杂性与多样性,缺乏对视觉欺诈推理能力的系统性、细粒度评估。 Method: 构建 THEMIS 多任务基准:基于真实撤稿论文和合成多模态数据,覆盖7类学术场景、5类造假类型、16种细粒度图像篡改操作;建立造假类型到5项核心视觉欺诈推理能力的映射;在16个主流MLLM上进行评测。 Result: GPT-5在THEMIS上取得最高整体准确率56.15%,其余模型表现更低,验证了基准的高难度与区分度;各模型在不同欺诈类型和推理能力维度上表现出显著差异。 Conclusion: THEMIS填补了面向真实学术视觉欺诈推理的评测空白,为推动MLLM在高复杂度、高真实性欺诈识别任务中的发展提供了严谨、可扩展的评估平台。 Abstract: We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

[110] Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

Main category: cs.CV

TL;DR: Pixelis是一种直接在像素空间操作的视觉-语言智能体,通过可执行的像素级工具链(如缩放、分割、跟踪等)进行学习与行动,在图像和视频任务上显著提升性能,并支持无标签测试时自适应。

Details Motivation: 现有视觉-语言系统多为静态观察者,缺乏物理交互能力,难以在分布偏移下安全改进;需通过行动而非静态描述来实现具身、可泛化的视觉智能。 Method: Pixelis采用三阶段训练:(1) 基于思维链动作轨迹的监督微调,引入掩码模仿损失与辅助头稳定像素对齐参数;(2) 好奇-一致性奖励微调,联合预测误差驱动的好奇心、相邻步连贯性及效率先验;(3) 像素级测试时强化学习,通过邻域检索、轨迹投票与EMA-KL安全约束实现无监督在线适应。 Result: 在六个公开图像/视频基准上平均相对提升+4.08%(最高+6.03%),生成更短、可审计的工具链,并在测试时学习中保持KL约束于安全通道内。 Conclusion: Pixelis证明了直接在像素空间行动可增强视觉推理的物理实在性与可操作性,为无需外部反馈的具身自适应提供了新范式。 Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

[111] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning

Yuqiao Zeng,Xu Wang,Tengfei Liang,Yiqing Hao,Yi Jin,Hui Yu

Main category: cs.CV

TL;DR: 本文提出RL-MBA框架,利用强化学习实现模态平衡与难度感知的多模态主动学习,通过自适应调整模态权重和基于证据融合的样本难度估计,提升有限标注预算下的分类精度与模态公平性。

Details Motivation: 现有多模态主动学习方法常假设模态重要性稳定、选择规则固定,忽略了多模态学习中模态价值和样本难度随训练动态变化的问题。 Method: 提出RL-MBA:将样本选择建模为马尔可夫决策过程;包含两个核心组件——自适应模态贡献平衡(AMCB)和基于证据融合的难度感知策略调整(EFDA);利用强化学习动态优化模态权重与样本选择策略。 Result: 在Food101、KineticsSound和VGGSound数据集上,RL-MBA在有限标注预算下显著优于强基线,同时提升分类准确率和模态公平性。 Conclusion: 动态建模模态贡献与样本难度对多模态主动学习至关重要;RL-MBA提供了一种可扩展、自适应的解决方案,兼顾性能与模态平衡。 Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.

[112] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Chenglong Wang,Yifu Huo,Yang Gan,Qiaozhi He,Qi Meng,Bei Li,Yan Wang,Junfu Liu,Tianhua Zhou,Jingbo Zhu,Tong Xiao

Main category: cs.CV

TL;DR: 本文提出多阶段强化学习(MSRL)方法,通过利用大规模文本偏好数据预训练、逐步迁移至多模态任务,并结合跨模态知识蒸馏,在无需额外多模态标注的情况下显著提升生成式多模态奖励模型性能。

Details Motivation: 现有基于强化学习从可验证奖励(RLVR)训练的多模态奖励模型(MRM)严重依赖昂贵且稀缺的多模态偏好标注数据,难以扩展。 Method: 提出多阶段强化学习(MSRL)框架:第一阶段在大规模文本偏好数据上学习通用奖励推理能力;第二阶段通过图像描述(caption-based)强化学习迁移;第三阶段进行端到端多模态强化学习;并引入跨模态知识蒸馏以增强偏好泛化能力。 Result: 在VL-RewardBench上准确率从66.6%提升至75.9%,在GenAI-Bench上从70.2%提升至75.7%,且无需新增多模态偏好标注。 Conclusion: MSRL实现了对生成式MRM的可扩展RLVR训练,在视觉理解和视觉生成任务上均取得显著性能提升,为低资源多模态奖励建模提供了新范式。 Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.

[113] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

Yuto Matsuo,Yoshihiro Fukuhara,Yuki M. Asano,Rintaro Yanagi,Hirokatsu Kataoka,Akio Nakamura

Main category: cs.CV

TL;DR: 本文提出了一种基于莫尔干涉图案的轻量级、无存储、无需外部数据的程序化图像增强方法,显著提升了视觉Transformer在多种鲁棒性基准(如ImageNet-C/R和对抗样本)上的性能。

Details Motivation: 现有数据增强方法(如扩散模型或特征混合)计算开销大或依赖外部数据,亟需一种高效、轻量且不依赖外部资源的替代方案。 Method: 提出基于解析莫尔干涉图案的程序化增强方法,利用闭式数学公式实时生成多尺度结构化扰动,直接在内存中合成并混合图像,零存储、零外部依赖。 Result: 在Vision Transformer上实验表明,该方法在ImageNet-C、ImageNet-R及对抗鲁棒性基准上持续超越标准增强与现有无外部数据增强方法。 Conclusion: 解析干涉图案是一种实用、高效的生成式增强替代方案,为鲁棒训练提供了新思路。 Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

[114] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

Jiawei Lin,Wanrong Zhu,Vlad I Morariu,Christopher Tensmeyer

Main category: cs.CV

TL;DR: AnyDoc是一个面向多任务、多类别的文档生成框架,通过构建大规模合成数据集DocHTML并结合高度感知的强化学习后训练策略,显著提升了多模态大模型在文档生成任务上的性能。

Details Motivation: 现有文档生成方法受限于人工构建数据集规模小、覆盖类别少,难以支持多样化、实用化的文档生成需求。 Method: 提出AnyDoc框架:1)构建可扩展的HTML/CSS文档合成流水线,生成包含265K样本、111类、32种风格的DocHTML数据集,并附带完整元数据;2)基于该数据集微调多模态大语言模型(MLLMs),支持意图→文档、文档反渲染、元素→文档三类任务;3)引入高度感知强化学习(HARL)后训练,以预测高度与目标高度差为奖励信号,缓解内容溢出问题。 Result: AnyDoc在三个文档生成任务上均优于通用MLLM和专用基线模型,定性与定量实验验证了其有效性。 Conclusion: AnyDoc通过数据合成、多任务建模与HARL优化的协同设计,为通用化、高质量文档生成提供了可行路径,推动AI驱动内容创作向更实用、更可控方向发展。 Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

[115] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

Minh-Quan Viet Bui,Jaeho Moon,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出AirSplat框架,将3D视觉基础模型(3DVFMs)的几何先验有效迁移到无需位姿输入的高质量新视角合成(NVS)任务中,通过自一致性位姿对齐(SCPA)和基于评分的不透明度匹配(ROM)两大技术提升重建质量。

Details Motivation: 3D视觉基础模型虽在零样本几何估计上表现优异,但直接用于通用新视角合成仍面临挑战,尤其在无需相机位姿输入的场景下。 Method: 提出AirSplat训练框架,包含两项关键技术:(1) 自一致性位姿对齐(SCPA),在训练中构建反馈回路实现像素级对齐监督;(2) 基于评分的不透明度匹配(ROM),利用稀疏视角NVS教师模型提供的局部3D几何一致性知识筛选低质量渲染原语。 Result: 在大规模基准测试中显著超越现有无位姿NVS方法,在重建质量上取得更优性能。 Conclusion: AirSplat验证了将3DVFMs适配至NVS任务的可行性与有效性,为联合视觉几何估计与高质量视图合成提供了新范式。 Abstract: While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

[116] Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

Yaowen Chang,Zhen Cao,Xu Zheng,Xiaoxin Mi,Zhen Dong

Main category: cs.CV

TL;DR: 本文提出DAPASS框架,解决源数据不可用(SFUDA)下全景语义分割中的几何失真与伪标签噪声问题,通过PCGD和CRAM两个模块提升伪标签质量与多尺度特征对齐,显著提升mIoU。

Details Motivation: 全景语义分割面临严重几何失真和密集标注成本高的挑战;在源数据不可访问的源自由无监督域自适应(SFUDA)设定下,域偏移加剧、伪标签不可靠、小众类性能骤降。 Method: 提出DAPASS框架:1)Panoramic Confidence-Guided Denoising(PCGD)模块,利用扰动一致性与邻域置信度生成高保真、类别均衡的伪标签;2)Contextual Resolution Adversarial Module(CRAM),通过对抗学习对齐高分辨率局部细节与低分辨率全局语义,缓解尺度差异与畸变。 Result: 在Cityscapes-to-DensePASS(室外)和Stanford2D3D(室内)基准上分别达到55.04%(+2.05%)和70.38%(+1.54%)mIoU,达SOTA。 Conclusion: DAPASS有效克服SFUDA下全景分割的关键瓶颈,兼顾伪标签可靠性与畸变建模能力,为隐私敏感场景提供实用解决方案。 Abstract: Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

[117] Robust Principal Component Completion

Yinjian Wang,Wei Li,Yuanyuan Gui,James E. Fowler,Gemine Vivone

Main category: cs.CV

TL;DR: 本文提出了一种新的鲁棒主成分补全(RPCC)框架,通过变分贝叶斯推断对概率性稀疏张量分解建模,直接学习稀疏成分的支持集,避免了传统RPCA中后处理阈值的需要,在合成与真实视频/高光谱数据上均表现出优异性能。

Details Motivation: 传统RPCA假设低秩背景与稀疏前景是相加关系,但实际中稀疏前景常为遮挡或替换背景元素,存在建模失配问题。 Method: 提出鲁棒主成分补全(RPCC)框架,利用完全概率化的贝叶斯稀疏张量分解模型,并通过变分贝叶斯推理求解;理论证明其收敛到支持集的硬分类器,无需后处理阈值。 Result: 在合成数据上达到近似最优估计;在真实彩色视频中实现鲁棒前景提取,在高光谱数据中实现有效异常检测。 Conclusion: RPCC通过建模支持集而非显式稀疏成分,更贴合遮挡/替换场景,提升了鲁棒性和实用性,且具备理论收敛保证。 Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.

[118] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Taegyoon Yoon,Yegyu Han,Seojin Ji,Jaewoo Park,Sojeong Kim,Taein Kwon,Hyung-Sin Kim

Main category: cs.CV

TL;DR: 本文提出了EgoXtreme数据集,用于解决现有6D物体姿态估计基准在真实第一人称视角(如智能眼镜)应用中面临的运动模糊、动态光照和视觉遮挡等挑战,并验证了当前方法在极端条件下的性能局限及时间信息利用的潜力。

Details Motivation: 现有6D姿态估计基准无法反映真实第一人称场景(如智能眼镜)中的严重运动模糊、动态光照和视觉遮挡等挑战,导致实验室与现实应用间存在显著鸿沟。 Method: 构建了全新的大规模第一人称视角6D姿态估计数据集EgoXtreme,涵盖工业维修、体育和应急救援三类极端场景;并在其上评估了现有泛化型姿态估计算法,测试图像恢复与基于跟踪方法的有效性。 Result: SOTA泛化型姿态估计算法在EgoXtreme上表现显著下降,尤其在低光条件下;单纯图像复原(如去模糊)无改善;而基于跟踪的方法在快速运动场景中展现出性能增益。 Conclusion: EgoXtreme是推动开发适用于真实第一人称视觉的鲁棒姿态估计模型的关键资源。 Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/

[119] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad,Ammarah Hashmi,Junichi Yamagishi,Yusuke Yasuda,Yu Tsao,Chia-Wen Lin,Yan-Tsung Peng,Hsin-Min Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SAVe的自监督音视频深度伪造检测框架,完全基于真实视频进行训练,通过生成伪篡改样本和建模唇音同步性来提升检测鲁棒性和跨数据集泛化能力。

Details Motivation: 现有检测器过度依赖合成伪造数据,导致数据集和生成器偏差,难以泛化到未见过的伪造类型。 Method: SAVe采用自监督学习,在真实视频上生成身份保持、区域感知的自混合伪伪造样本,并结合音频-视觉对齐模块建模唇音同步性以捕捉跨模态不一致性。 Result: 在FakeAVCeleb和AV-LipSync-TIMIT数据集上展现出优异的域内性能和强跨数据集泛化能力。 Conclusion: 自监督学习是一种可扩展且鲁棒的多模态深度伪造检测范式,能有效缓解合成数据依赖问题。 Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

[120] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

Hongxu Ma,Guang Li,Shijie Wang,Dongzhan Zhou,Baoli Sun,Takahiro Ogawa,Miki Haseyama,Zhihui Wang

Main category: cs.CV

TL;DR: 本文提出FD²框架,专为细粒度数据集蒸馏设计,通过定位判别性区域、构建细粒度表征,并在预训练和蒸馏阶段引入反事实注意力学习与细粒度约束,显著提升细粒度数据集上的蒸馏效果。

Details Motivation: 现有解耦式数据集蒸馏方法依赖粗粒度类别标签监督,在细粒度数据集上导致类内差异大、类间差异小,且同类样本过度相似,削弱局部判别能力。 Method: FD²框架包含:1)预训练阶段采用反事实注意力学习聚合判别性表征更新类原型;2)蒸馏阶段引入细粒度特性约束(对齐本类原型并排斥他类)和相似性约束(增强同类样本间注意力多样性)。 Result: 在多个细粒度及通用数据集上实验表明,FD²能无缝集成到解耦式DD流程中,并在大多数设置下提升性能,展现出强泛化性和迁移性。 Conclusion: FD²有效解决了细粒度数据集蒸馏中的判别性信息丢失问题,为细粒度视觉任务的数据高效学习提供了新思路。 Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

[121] Learning to Rank Caption Chains for Video-Text Alignment

Ansel Blume,Burak Uzkent,Shalini Chaudhuri,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出了一种基于排序优化(ranking optimization)的视觉-语言模型训练方法,替代传统的二元直接偏好优化(DPO),以更好建模响应对视觉输入的保真度;通过视频字幕链降级生成全序挑战样本,在长文本生成与评估任务上优于DPO,并发现需微调视觉编码器,挑战了DPO仅为语言重加权的固有认知。

Details Motivation: 标准DPO采用二元胜/负判定,忽视了视觉-语言模型中‘失败’响应仍可能高度忠实于视觉输入这一关键细节,导致对视觉保真度建模不足。 Method: 提出基于排序优化的新范式,聚焦视频-文本对齐;利用反复字幕降级策略,大规模生成具有全序关系的挑战性视频字幕链。 Result: 排序优化在长形式内容生成与评估上优于二元DPO;且必须微调视觉编码器才有效,表明DPO并非纯语言层面操作。 Conclusion: 排序优化比二元DPO更能精细刻画响应对视觉输入的保真度;视觉编码器微调是提升性能的关键环节,揭示了多模态偏好学习中视觉与语言模块协同优化的必要性。 Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.

[122] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Chengyu Fang,Heng Guo,Zheng Jiang,Chunming He,Xiu Li,Minfeng Xu

Main category: cs.CV

TL;DR: Photon是一种用于3D医学影像问答的多模态大模型框架,通过可变长token序列、指令驱动的token调度与带梯度恢复的可微优化,实现高效、准确且可靠的视觉问答。

Details Motivation: 现有方法在扩展至3D医学影像时面临计算开销高、破坏体素连续性、掩盖细微病灶等问题。 Method: Photon提出可变长token表示、指令条件下的token调度机制、带梯度恢复的代理梯度传播,以及抑制语言偏置和增强证据可靠性的正则化目标。 Result: 在多种医学视觉问答任务上达到SOTA精度,同时降低资源消耗、加速训练与推理。 Conclusion: Photon在保持甚至提升性能的同时显著提升3D医学视觉问答的效率与可靠性,为临床多模态AI提供了实用化新路径。 Abstract: Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

[123] A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

SuYeon Kim,Wongyu Lee,MyeongAh Cho

Main category: cs.CV

TL;DR: 本文提出了一种语义解耦的统一模型用于3D异常检测,通过粗到细全局标记化、类别条件对比学习和几何引导解码器,缓解跨类别纠缠问题,在Real3D-AD和Anomaly-ShapeNet数据集上达到SOTA性能。

Details Motivation: 现有统一模型在多类别3D异常检测中存在跨类别纠缠(ICE)问题,导致语义先验错误和异常分数不可靠。 Method: 提出语义解耦的统一模型,包含:(i) 粗到细全局标记化构建实例级语义标识;(ii) 类别条件对比学习解耦类别语义;(iii) 几何引导解码器实现语义一致重建。 Result: 在Real3D-AD和Anomaly-ShapeNet上,对象级AUROC分别提升2.8%(统一模型)和9.1%(类别特定模型),显著提升统一3D异常检测可靠性。 Conclusion: 语义解耦有效缓解跨类别纠缠,所提框架兼顾统一建模能力与检测可靠性,推动3D异常检测实用化。 Abstract: 3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

[124] SportSkills: Physical Skill Learning from Sports Instructional Videos

Kumar Ashutosh,Chi Hsuan Wu,Kristen Grauman

Main category: cs.CV

TL;DR: 本文介绍了SportSkills,首个面向物理技能学习的大规模体育视频数据集,包含36万 instructional 视频和63万视觉示范,并提出基于该数据集的错误条件化 instructional 视频检索任务,显著提升模型对细粒度动作理解与个性化教学反馈生成能力。

Details Motivation: 现有大规模视频数据集侧重于通用人类活动,缺乏对物理技能学习所需的细粒度活动的深度覆盖。 Method: 构建SportSkills数据集(36万 instructional 视频、63万视觉示范、55项运动),设计细粒度动作理解实验,并提出错误条件化 instructional 视频检索新任务;结合专业教练评估进行正式验证。 Result: 在相同模型下,基于SportSkills的表征学习相较传统活动中心数据集性能提升高达4倍;错误条件化检索任务经专业教练评估,显著提升了视频模型为用户查询提供个性化视觉指导的能力。 Conclusion: SportSkills填补了物理技能学习领域高质量视频数据的空白,推动了细粒度动作理解与可操作性教学反馈生成的研究进展。 Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

[125] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

Bin Yang,Mohamed Abdelsamad,Miao Zhang,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: 本文提出PointINS,一种面向实例的自监督学习框架,通过几何感知学习提升点云表示能力,增强实例感知能力,在室内实例分割和室外全景分割任务上取得显著性能提升。

Details Motivation: 现有自监督学习方法在点云上虽提升了语义理解,但难以迁移到实例定位任务,缺乏实例感知能力,阻碍了通用3D基础模型的发展。 Method: PointINS引入正交偏移分支联合学习高层语义与几何推理,并提出两种正则化策略:偏移分布正则化(ODR)和空间聚类正则化(SCR),以增强实例定位鲁棒性。 Result: 在五个数据集上的实验表明,PointINS平均提升室内实例分割mAP 3.5%,室外全景分割PQ 4.1%。 Conclusion: PointINS有效 bridged 语义感知与实例感知之间的鸿沟,为可扩展的3D基础模型提供了新路径。 Abstract: Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

[126] ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

Xike Zhang,Maoyuan Ye,Juhua Liu,Bo Du

Main category: cs.CV

TL;DR: 本文提出ET-SAM,一种基于SAM的高效统一场景文本检测与版面分析框架,通过轻量级点解码器生成词热图减少提示点数量,并设计联合训练策略融合多源异构标注数据,在显著加速推理(约3倍)的同时提升多项基准指标。

Details Motivation: 现有基于SAM的方法依赖像素级文本分割采样大量前景点作为提示,导致推理延迟高、数据利用率低。 Method: 提出ET-SAM框架,包含轻量级点解码器(生成词热图以减少前景点)和分层掩码解码器;设计联合训练策略,整合多级、词级、行级标注数据,并为不同标注类型引入三组可学习任务提示。 Result: 相比先前SAM架构,ET-SAM在HierText上实现约3倍推理加速,在Total-Text、CTW1500和ICDAR15上平均F-score提升11.0%。 Conclusion: ET-SAM在保证性能的同时显著提升推理效率并增强对异构标注数据的利用能力,为统一文本检测与版面分析提供了更实用的解决方案。 Abstract: Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

[127] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

Shiji Zhao,Shukun Xiong,Maoxun Yuan,Yao Huang,Ranjie Duan,Qing Guo,Jiansheng Chen,Haibin Duan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于红外物理知识引导的对抗训练方法(KGAT),通过建模和利用不同类别间的热辐射关系,提升红外目标检测模型在面对对抗样本和常见扰动时的鲁棒性与准确率。

Details Motivation: 现有红外目标检测方法多采用数据驱动范式,忽视红外图像特有的物理特性(如热辐射规律),导致模型鲁棒性不足,易受对抗攻击和常见损坏影响。 Method: 基于红外物理知识,理论建模不同类别的相对热辐射关系(以灰度值排序表征),量化其稳定性,并将该知识嵌入对抗训练过程,构建知识引导的对抗训练框架(KGAT)。 Result: 在三个红外数据集和六种主流红外检测模型上的实验表明,KGAT显著提升了模型在干净样本下的精度以及对对抗攻击和常见损坏的鲁棒性。 Conclusion: 引入可解释、稳定的红外物理先验知识(热辐射关系)能有效增强检测模型的泛化性与鲁棒性,为面向复杂环境的可靠红外感知提供了新思路。 Abstract: In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.

[128] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam,John Quarles,Kevin Desai

Main category: cs.CV

TL;DR: 本文提出AG-EgoPose,一种双流框架,融合短/长时运动上下文与细粒度空间线索,用于鱼眼相机输入下的第一人称3D人体姿态估计,显著提升性能。

Details Motivation: 第一人称视角下存在严重透视畸变、身体可见性低及相机运动复杂等问题,现有方法难以有效利用视频中的丰富运动上下文。 Method: 提出双流框架AG-EgoPose:空间流用共享权重ResNet-18生成2D关节点热图和空间特征令牌;时间流用ResNet-50+动作识别骨干提取运动动态;二者通过含可学习关节点令牌的Transformer解码器融合优化。 Result: 在真实世界数据集上达到SOTA性能,定量与定性指标均领先。 Conclusion: AG-EgoPose通过协同建模时空信息与解剖约束,在egocentric 3D姿态估计中实现了鲁棒且准确的性能提升。 Abstract: Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

[129] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Marvin Seyfarth,Salman Ul Hassan Dar,Yannik Frisch,Philipp Wild,Norbert Frey,Florian André,Sandy Engelhardt

Main category: cs.CV

TL;DR: 本文提出了VolDiT,首个纯Transformer架构的3D扩散模型,用于体数据医学图像合成,通过体素块嵌入、全局自注意力和时序门控控制适配器实现高保真、可控、全局一致的生成。

Details Motivation: 现有3D医学图像生成方法多基于卷积U-Net,在感受野、全局建模和灵活条件控制方面存在局限;需探索更适配3D体数据且具强建模能力的纯Transformer扩散架构。 Method: 提出VolDiT:1)3D体素块嵌入将原始体积映射为3D token序列;2)全层全局自注意力建模长程依赖;3)引入timestep-gated control adapter,将分割掩码编码为可学习控制token并注入各transformer层以实现空间精准引导。 Result: 在高分辨率3D医学图像合成任务上,VolDiT相比SOTA 3D潜在扩散U-Net模型展现出更优的全局一致性、生成保真度与条件可控性。 Conclusion: 纯Transformer结构的扩散模型(VolDiT)为体数据医学图像合成提供了更灵活、可扩展且高性能的新范式,验证了其替代传统卷积骨干的潜力。 Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

[130] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

Jiahao Wang,Hualian Sheng,Sijia Cai,Yuxiao Yang,Weizhan Zhang,Caixia Yan,Bing Deng,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出AnyID框架,通过可扩展的全参考架构和主参考生成范式,实现高保真身份保持的视频生成,并利用强化学习进行微调以提升身份保真度和提示可控性。

Details Motivation: 现有方法通常仅针对单一身份参考进行设计和优化,限制了创意灵活性,且单一参考导致身份重建任务本质上是不适定的,难以在新场景中忠实再现身份。 Method: 提出AnyID框架,包括:1)可扩展的全参考架构,统一处理异构身份输入(如人脸、肖像、视频);2)主参考生成范式,指定一个参考作为规范锚点,并引入新型差分提示以实现属性级精确可控;训练采用大规模精选数据集,并通过基于人类评估偏好数据集的强化学习进行最终微调。 Result: 实验表明AnyID在不同任务设置下均实现了超高的身份保真度和优越的属性级可控性。 Conclusion: AnyID有效解决了多源身份输入与身份保真度之间的矛盾,为身份保持视频生成提供了更灵活、更鲁棒的新范式。 Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.

[131] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

Marvin Seyfarth,Sarah Kaye Müller,Arman Ghanaat,Isabelle Ayx,Fabian Fastenrath,Philipp Wild,Alexander Hertel,Theano Papavassiliu,Salman Ul Hassan Dar,Sandy Engelhardt

Main category: cs.CV

TL;DR: 本文提出CardioDiT,一种基于扩散Transformer的全4D潜在扩散框架,用于短轴心脏电影MRI(CMR)合成,通过联合建模3D空间与时间维度,避免传统方法中时空解耦带来的结构偏差,显著提升图像的时空一致性与生理合理性。

Details Motivation: 现有生成模型难以直接建模心脏电影MRI(CMR)这种具有强时间同步性的4D数据,常用时空解耦或辅助约束策略易引入结构偏差,导致时空不连续或动力学不真实。 Method: 提出CardioDiT:采用时空VQ-VAE将2D+t切片压缩为紧凑潜变量,并利用扩散Transformer对完整3D+t体积进行端到端联合建模,实现空间与时间的全程耦合。 Result: 在公开及私有CMR数据集上验证,CardioDiT相比基线方法显著提升层间一致性、时间运动连贯性及心脏功能分布的真实性。 Conclusion: 显式4D建模结合扩散Transformer为心脏影像的时空合成提供了更合理、更统一的基础框架。 Abstract: Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at https://github.com/Cardio-AI/cardiodit.

[132] TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

Peng Wen,Yuting Wang,Qiurui Wang

Main category: cs.CV

TL;DR: 本文提出了TacSIm,一个用于足球战术风格模仿的大规模数据集与基准,旨在通过广播视频精确复现真实球队的战术行为,而非仅优化奖励目标;它支持空间占用与运动向量相似性评估,并在统一虚拟环境中对全队行为进行定量与可视化评测。

Details Motivation: 现有足球模仿研究多聚焦于奖励导向目标(如进球数、胜率),忽视了对真实球队战术行为的准确复现。 Method: 构建TacSIm数据集与基准,从英超比赛广播视频中提取单视角下22名球员的起始位置与动作,映射至标准球场坐标系;定义基于空间占有率与运动向量相似性的战术风格模仿评估协议;在统一虚拟环境中运行多种基线方法生成全队行为。 Result: TacSIm提供了首个面向战术风格模仿的显式任务设定、标准化评估指标及可复现的虚拟实验环境,支持空间与时间维度上的战术协调性量化与可视化分析。 Conclusion: TacSIm建立了从广播视频到仿真模拟的端到端严谨基准,推动足球AI从结果优化转向风格一致的战术行为建模。 Abstract: Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.

[133] CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

Shaojin Bai,Yuting Su,Weizhi Nie

Main category: cs.CV

TL;DR: 本文提出CIV-DG框架,利用条件工具变量(Conditional Instrumental Variables)解决医疗AI中因选择偏差导致的跨站点泛化问题,通过DeepGMM架构实现病理语义与扫描伪影的解耦,在Camelyon17和胸部X光数据集上显著提升模型鲁棒性。

Details Motivation: 医疗AI跨站点泛化能力受限于由患者人口统计特征非随机决定医院分配所引发的选择偏差,传统域泛化方法无法消除站点特异性伪影与诊断标签间的虚假相关。 Method: 提出基于条件工具变量(CIV)的因果框架CIV-DG,放宽标准工具变量法对随机分配的强假设,结合深度广义矩估计(DeepGMM)架构,引入条件判别器在人口统计亚组内最小化矩条件违反并保证工具变量与误差正交。 Result: 在Camelyon17和大规模胸部X光数据集上的实验表明,CIV-DG显著优于现有主流方法。 Conclusion: 条件因果机制可有效缓解结构性混杂,提升医疗AI模型的跨站点泛化鲁棒性。 Abstract: Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.

[134] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Jiahao Tian,Chenxi Song,Wei Cheng,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出FreeLOC框架,通过视频相对位置重编码(VRPR)和分层稀疏注意力(TSA)解决预训练视频扩散模型生成长视频时的帧级位置和上下文长度分布外问题,无需额外训练,显著提升长视频生成的质量与时间一致性。

Details Motivation: 预训练视频扩散模型在生成长视频时因帧级相对位置和上下文长度分布外(O.O.D.)问题导致视觉质量严重下降。 Method: 提出训练无关、层自适应的FreeLOC框架,包含:1)视频相对位置重编码(VRPR),分层次重编码时间相对位置以对齐预训练分布;2)分层稀疏注意力(TSA),跨时间尺度调控注意力密度以兼顾局部细节与长程依赖;3)层自适应探测机制,识别各Transformer层对O.O.D问题的敏感性并选择性应用上述技术。 Result: 在多个基准上显著超越现有训练无关方法,达到长视频生成在时间一致性和视觉质量方面的SOTA性能。 Conclusion: FreeLOC是一种高效、即插即用的推理优化框架,有效缓解长视频生成中的关键分布外挑战,为扩散模型的长序列建模提供了新思路。 Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

[135] SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment

Pengyu Chen,Haotian Sa,Yiwei Hu,Yuhan Cheng,Junbo Wang

Main category: cs.CV

TL;DR: 本文提出SDD-YOLO,一种面向地面到空中(G2A)反无人机监视的小目标检测框架,通过高分辨率P2检测头、无DFL/NMS架构及MuSGD混合训练策略,在自建大型G2A数据集DroneSOD-30K上实现86.0% mAP@0.5,并兼顾高帧率与边缘部署能力。

Details Motivation: 地面视角检测小型无人机面临像素占有率极低、空域背景杂乱及实时性要求严苛等挑战,现有YOLO模型缺乏对亚像素目标的足够特征分辨力且部署复杂。 Method: 提出SDD-YOLO框架:引入P2高分辨率检测头(4倍下采样)以捕获微小目标细节;融合YOLO26的无DFL、无NMS轻量架构;采用MuSGD混合训练策略(含ProgLoss和STAL),缓解稀疏小目标信号引起的梯度振荡。同时构建大规模G2A数据集DroneSOD-30K(约3万张标注图像)。 Result: SDD-YOLO-n在DroneSOD-30K上达到86.0% mAP@0.5,较YOLOv5n提升7.8个百分点;推理速度达226 FPS(RTX 5090)和35 FPS(Xeon CPU)。 Conclusion: SDD-YOLO在精度与效率间取得良好平衡,显著提升了G2A场景下小无人机检测性能,具备强边缘部署潜力。 Abstract: Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.

[136] Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

Jonas Hein,Lilian Calvet,Matthias Seibold,Siyu Tang,Marc Pollefeys,Philipp Fürnstahl

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多视角6D位姿估计方法,仅需纹理CAD模型作为先验,即可实现对未见过手术器械的高精度检测与位姿估计。

Details Motivation: 监督方法缺乏对新/未知器械的泛化能力,且依赖大量标注数据,而手术场景中器械种类多、标注困难,亟需一种灵活、无需训练的解决方案。 Method: 分为检测与位姿估计两阶段:检测阶段利用预训练特征提取器匹配渲染模板生成掩码提议,并通过多视角几何一致性进行3D三角化与筛选;位姿估计阶段采用交叉注意力机制迭代优化位姿假设,并引入多视角、遮挡感知的轮廓配准进行最终精调。 Result: 在真实手术MVPSP数据集上验证,达到毫米级精度,性能媲美监督方法,且完全泛化至未见器械。 Conclusion: 该方法融合基础模型、多视角几何与轮廓优化,实现了无需任务特定训练的高精度、通用型手术器械6D位姿估计,支持动态临床环境下的鲁棒跟踪与场景理解。 Abstract: Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

[137] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models

Sazzad Hossain,Saiful Islam,Muhammad Ibrahim,Md. Rasel Ahmed,Md Shuayb,Ahmedul Kabir

Main category: cs.CV

TL;DR: 本文构建了一个面向孟加拉国五种常见皮肤病(接触性皮炎、白癜风、湿疹、疥疮、癣)的公开图像数据集(共1612张,含增强),并评估了多种机器学习与深度学习模型的分类性能,旨在缓解基层皮肤科资源匮乏问题。

Details Motivation: 孟加拉国等人口稠密地区缺乏足够皮肤科医生和诊断设备,导致皮肤病漏诊误诊风险高、后果严重;而AI驱动的图像识别为基层辅助诊断提供了可行路径。 Method: 采集自Faridpur医学院门诊患者的5类皮肤病原始图像(共250张),经数据增强扩展至1612张,构建公开数据集;并在其上系统评估传统机器学习(如SVM、RF)与深度学习(如ResNet、VGG)模型的分类性能。 Result: 提供了首个聚焦南亚常见皮肤病的开源图像数据集;报告了多个模型在该数据集上的基准分类准确率(文中未列具体数值,但表明进行了完整实验评估)。 Conclusion: 该数据集具有区域代表性且具备全球适用潜力,可推动AI在资源受限地区皮肤病智能诊断中的研究与落地应用。 Abstract: Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.

[138] A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

Jiaming Liang,Chi-Man Pun

Main category: cs.CV

TL;DR: 本文提出了一种统一的空间对齐框架(SAF),通过同步变换输入和标签来提升基于变换的对抗攻击(TAAs)在语义分割和目标检测等空间结构化任务中的迁移性。

Details Motivation: 现有基于变换的对抗攻击(TAAs)在结构化任务(如语义分割、目标检测)上效果差或失效,主因是标签与输入在空间变换中未同步,导致空间错位和错误梯度。 Method: 提出空间对齐框架(SAF),包含空间对齐(SA)算法,使对抗攻击在对输入进行空间变换的同时,同步变换对应的空间结构化标签。 Result: 在非目标攻击中,SAF显著降低多个基准指标:Cityscapes平均mIoU从24.50降至11.34,Kvasir-SEG从49.91降至31.80,COCO平均mAP从17.89降至5.25。 Conclusion: 空间对齐是提升TAAs在结构化任务中迁移性的关键,SAF为结构化视觉任务的对抗攻击提供了有效且通用的解决方案。 Abstract: Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

[139] Efficient Preemptive Robustification with Image Sharpening

Jiaming Liang,Chi-Man Pun

Main category: cs.CV

TL;DR: 本文提出了一种无需代理模型、无需优化、无需生成器且人类可解释的预攻击防御方法——图像锐化,用于提升深度神经网络对对抗样本的鲁棒性。

Details Motivation: 现有预攻击防御方法(如preemptive robustification)存在依赖代理模型、计算开销大、缺乏可解释性等问题,亟需更高效、简洁、透明的鲁棒化策略。 Method: 受纹理强度与图像鲁棒性正相关启发,提出仅使用经典图像锐化操作作为前置鲁棒化手段,无需训练、优化或额外模型。 Result: 实验表明,锐化显著提升模型在迁移攻击场景下的鲁棒性,同时计算成本极低,且效果优于或媲美现有预攻击方法。 Conclusion: 图像锐化是一种简单、高效、可解释、轻量级的预攻击鲁棒化新范式,为对抗鲁棒性研究提供了新思路。 Abstract: Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.

[140] FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

Taejin Jeong,Joohyeok Kim,Jinyeong Kim,Chanyoung Kim,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出FEAST框架,通过全连接图和负感知注意力机制,结合离网格采样策略,提升空间转录组基因表达预测性能,并生成具有生物学意义的注意力图。

Details Motivation: 空间转录组(ST)成本高昂,限制其广泛应用;现有基于图神经网络的方法依赖预定义稀疏图,难以捕捉复杂的生物相互作用关系。 Method: 提出FEAST:1)将组织建模为全连接图以建模所有斑点对间交互;2)引入负感知注意力机制,同时建模兴奋性和抑制性相互作用;3)采用离网格采样策略,从中间区域提取图像以增强形态学上下文。 Result: 在公开ST数据集上,FEAST在基因表达预测任务中超越当前最优方法,并生成可解释、符合生物学规律的正/负注意力图。 Conclusion: FEAST通过更全面的结构建模与生物学启发的注意力设计,有效提升了空间基因表达推断的准确性与可解释性。 Abstract: Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.

[141] Semantic-Aware Prefix Learning for Token-Efficient Image Generation

Qingfeng Li,Haoxian Zhang,Xu He,Songlin Tang,Zhixue Fang,Xiaoqiang Liu,Pengfei Wan Guoqi Li

Main category: cs.CV

TL;DR: 本文提出SMAP语义感知前缀分词器,通过引入类别级语义条件和尾部token丢弃策略,使语义信息在视觉分词过程中不可或缺;并结合CARD混合生成模型验证其语义对齐潜空间在紧凑token预算下的生成有效性。

Details Motivation: 现有视觉分词器多基于重建目标训练,导致潜在表示语义薄弱;虽有方法增强语义对齐,但语义信号仍为辅助正则项,未成为表征学习的必要成分。 Method: 提出SMAP:在查询式1D分词框架中注入类别级语义条件,并设计尾token丢弃策略,迫使语义条件与早期潜变量在逐步缩减的token预算下承担更多表征责任;同时提出CARD混合因果自回归-扩散生成器以验证生成能力。 Result: 在ImageNet上实验表明,SMAP在离散与连续分词设置下均提升重建质量,且其语义对齐的潜空间在紧凑token预算下展现出优异的下游生成性能。 Conclusion: 语义信息可被设计为分词过程的核心驱动力而非辅助约束,SMAP通过结构化机制实现语义必要性,显著提升潜空间的生成实用性。 Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

[142] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

Yabin Zhang,Maya Varma,Yunhe Gao,Jean-Benoit Delbrouck,Jiaming Liu,Chong Wang,Curtis Langlotz

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、测试高效且理论可靠的OOD检测方法TANL,通过在测试时动态挖掘对OOD样本响应强的负标签,显著提升检测性能,尤其在ImageNet上FPR95大幅下降。

Details Motivation: 现有基于固定负标签的OOD检测方法因负标签对OOD样本激活不足,难以有效捕捉OOD特性。 Method: 提出Test-time Activated Negative Labels(TANL):在线识别高置信度测试样本,累积其在语料库上的分配概率构建标签激活度指标;进一步引入批自适应变体,并设计激活感知打分函数,强化高激活负标签的贡献。 Result: TANL在多种骨干网络和任务设置下均有效,在ImageNet上FPR95从17.5%显著降至9.8%。 Conclusion: TANL是一种训练无关、测试高效、理论可支撑的OOD检测新范式,通过动态、分布自适应地选择激活强的负标签,提升了检测精度与鲁棒性。 Abstract: Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\% to 9.8\%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.

[143] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection

Md Awsafur Rahman,Chandrakanth Gudavalli,Hardik Prajapati,B. S. Manjunath

Main category: cs.CV

TL;DR: 本文提出TITAnD方法,将轨迹异常检测转化为视觉问题,通过构建超光谱轨迹图像(HTI)统一稠密与稀疏轨迹表示,并设计循环分解Transformer(CFT)建模人类行为的周期性结构,在保持高精度的同时首次实现多月稠密GPS轨迹的高效异常检测。

Details Motivation: 现有稠密GPS方法计算代价高(二次方),无法支持多月分析;稀疏停留点方法虽可扩展但丢失细粒度信息,导致需不同架构且无法知识迁移。作者认为该瓶颈非必要,因人类轨迹在日内和日间均具自然二维周期结构。 Method: 提出TITAnD框架:1)将轨迹表示为超光谱轨迹图像(HTI),即‘天×一天内时刻’网格,各通道编码空间、语义、时间与运动学信息;2)将个体级检测建模为图像分类,时间定位建模为语义分割;3)设计循环分解Transformer(CFT),沿两个时间轴分解注意力机制,嵌入周期性先验并大幅降低计算开销。 Result: TITAnD在稀疏与稠密基准上均取得最优AUC-PR;相比UNet等视觉模型性能更优;相比标准Transformer快11–75倍且内存占用相当;首次实现多月稠密GPS轨迹的异常检测。 Conclusion: 将轨迹异常检测重定义为视觉问题,并结合结构感知建模(如周期性先验),是兼顾精度与效率的关键路径;TITAnD为跨尺度轨迹分析提供了统一、可扩展的新范式。 Abstract: Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

[144] Towards Practical Lossless Neural Compression for LiDAR Point Clouds

Pengpeng Yu,Haoran Li,Runqing Jiang,Dingquan Li,Jing Wang,Liang Lin,Yulan Guo

Main category: cs.CV

TL;DR: 本文提出了一种面向LiDAR点云的高效无损预测编码框架,包含几何重稠密化模块和跨尺度特征传播模块,并引入整数推理流程以保证跨平台比特级一致性,实现了实时速度下的竞争性压缩性能。

Details Motivation: LiDAR点云几何细节高度稀疏,导致上下文建模效率低,限制了现有压缩方法的速度与性能。 Method: 提出两轻量模块:1)几何重稠密化模块,通过迭代稠密化-特征提取-稀疏化实现高效预测;2)跨尺度特征传播模块,利用多分辨率占用线索引导分层特征传播;并设计整数-only推理流程保障比特级一致性。 Result: 在保持实时编码速度的同时,取得了具有竞争力的无损压缩性能,并避免了现有神经压缩方法中的熵编码崩溃问题。 Conclusion: 该紧凑表示框架有效平衡了计算效率与压缩性能,为高精度LiDAR点云的实时无损压缩提供了新思路。 Abstract: LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at https://github.com/pengpeng-yu/FastPCC.

[145] ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis

Moonyeon Jeong,Seunggi Min,Suhyeon Lee,Hongje Seong

Main category: cs.CV

TL;DR: ViewSplat提出一种视图自适应的3D高斯溅射方法,通过动态MLP根据目标视角实时修正高斯属性,提升未标定图像的新视角合成质量,兼顾高保真与实时渲染性能。

Details Motivation: 现有前馈式3D高斯溅射方法受限于静态高斯原语表达能力,难以在所有视角下保持高保真重建,存在 fidelity gap。 Method: 提出view-adaptive dynamic splatting:先预测基础高斯原语和动态MLP权重;渲染时,MLP以目标视角坐标为输入,输出各高斯属性(位置、尺度、旋转、不透明度、颜色)的视角相关残差更新。 Result: 在新视角合成任务上达到SOTA保真度,推理速度17 FPS,实时渲染达154 FPS。 Conclusion: 视图自适应动态溅射机制有效缓解了静态原语建模局限,显著提升未标定图像下的重建质量与泛化能力,同时保持高效性。 Abstract: We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).

[146] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

Yuhan Chen,Pengwen Dai,Chuan Wang,Dayan Wu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出EagleNet,通过细粒度关系学习和能量感知匹配机制,增强文本嵌入的上下文感知能力,提升文本-视频检索性能。

Details Motivation: 现有方法仅建模文本与视频帧间的交互,忽略视频内部帧间丰富的上下文关系,导致文本表达缺乏帧级上下文信息,造成模态间语义鸿沟。 Method: 提出Energy-Aware Fine-Grained Relationship Learning Network(EagleNet),包含:1)细粒度关系学习机制(FRL),构建文本-帧图并聚合文本候选以融入帧上下文;2)能量感知匹配(EAM),建模文本-帧交互能量以更好拟合真实图文对分布;3)用sigmoid对比损失替代传统softmax损失以提升跨模态对齐与训练稳定性。 Result: 在MSRVTT、DiDeMo、MSVD和VATEX四个主流文本-视频检索数据集上均取得SOTA或领先性能。 Conclusion: EagleNet通过显式建模帧内上下文与能量感知的细粒度交互,有效提升了文本表征的语义丰富性与跨模态对齐能力,验证了引入视频内部结构信息对文本-视频检索任务的重要性。 Abstract: Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.

[147] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

Weijia Li,Haoen Xiang,Tianxu Wang,Shuaibing Wu,Qiming Xia,Cheng Wang,Chenglu Wen

Main category: cs.CV

TL;DR: 本文提出了首个面向车辆-无人机协同感知的大规模真实世界多模态数据集V2U4Real,旨在解决地面协同感知在大范围遮挡和远距离感知上的局限性。

Details Motivation: 现代自动驾驶感知系统受限于遮挡、盲区和感知距离;现有车车(V2V)和车路(V2I)协同范式仅限于地面协作,难以应对大规模遮挡与复杂环境下的长距感知需求。 Method: 构建了首个真实世界Vehicle-to-UAV(V2U)协同感知多模态数据集V2U4Real,由地面车辆与无人机协同采集,配备多视角LiDAR和RGB相机;涵盖多种场景,提供56K LiDAR帧、56K图像及700K三维标注框;并建立单智能体检测、协同检测与目标跟踪三大基准。 Result: 在多个SOTA模型上验证了V2U协同可显著提升感知鲁棒性与远距离感知能力;数据集与代码已开源。 Conclusion: V2U4Real填补了跨视角协同感知数据集的空白,为突破传统地面协同局限、推动空地协同感知研究提供了关键基础支撑。 Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.

[148] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

Hongru Han,Tingrui Guo,Liming Zhang,Yan Su,Qiwen Xu,Zhuohua Ye

Main category: cs.CV

TL;DR: 本文提出可控低光照图像增强(CLE)框架,将传统确定性映射问题重构为条件生成任务,通过新基准Light100、HVI颜色空间中的噪声解耦监督及S2D改进的RWKV架构,实现亮度可控且色度保真,显著降低对gt-mean后处理的依赖。

Details Motivation: 传统低光照增强被视为确定性映射,但该任务本质病态(多模态解),导致预测与标签间亮度不一致,常需gt-mean后处理;亟需建模环境与传感器不确定性,实现可控、鲁棒增强。 Method: 提出可控低光照增强(CLE)范式;构建含连续真实光照变化的新基准Light100;在HVI颜色空间设计噪声解耦监督以分离照度调控与纹理恢复;采用Space-to-Depth策略改进RWKV模型,使SSM适配密集预测任务并保留局部归纳偏置。 Result: 在7个基准上达到领先或竞争性性能;亮度控制精准、色度保真度高;大幅减少甚至消除gt-mean后处理需求;验证了CLE范式在真实多光照场景下的有效性与鲁棒性。 Conclusion: 将LLIE从确定性映射转向可控条件生成是更符合物理本质的建模范式;CLE-RWKV框架结合Light100基准与HVI+S2D创新设计,为低光增强提供了可解释、可调节、实用性强的新路径。 Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating "gt-mean" post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the "scanning gap" inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.

[149] Adaptive Learned Image Compression with Graph Neural Networks

Yunuo Chen,Bing He,Zezheng Lyu,Hongwei Hu,Qunshan Gu,Yuan Tian,Guo Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络(GNN)的自适应图像压缩框架GLIC,通过构建双尺度图和动态调整节点邻域数量,克服CNN和Transformer在建模图像冗余时的刚性限制,在多个基准上达到SOTA性能。

Details Motivation: 现有主流学习型图像压缩方法(基于CNN或Transformer)具有固定感受野和静态连接模式,难以自适应地建模图像中空间变化的冗余(尤其是全局冗余),易将非冗余像素因欧氏距离近而错误耦合。 Method: 提出基于图神经网络(GNN)的GLIC框架:1)构建双尺度图以支持灵活、数据驱动的感受野;2)引入自适应连通性机制,根据局部内容复杂度动态调整每个节点的邻居数量。 Result: 在Kodak、Tecnick和CLIC数据集上,相比VTM-9.1分别实现19.29%、21.69%和18.71%的BD-rate降低,达到当前最优性能。 Conclusion: GLIC通过内容自适应的图结构建模,显著提升了对图像多尺度冗余的表征能力,验证了GNN在图像压缩任务中的有效性与潜力。 Abstract: Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.

[150] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Zhekai Chen,Yuqing Wang,Manyuan Zhang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出MacroData数据集和MacroBench基准,解决多参考图像生成任务中因数据瓶颈导致的性能下降问题,显著提升模型在多参考条件下的生成效果。

Details Motivation: 现有模型在多参考图像生成任务中性能随参考图像数量增加而严重下降,根本原因在于缺乏支持长上下文监督的结构化多参考数据集。 Method: 构建大规模多参考图像数据集MacroData(40万样本,每样本最多10张参考图),按定制化、插画、空间推理和时序动态四个维度组织;同时提出标准化评估基准MacroBench(4000样本);通过微调与消融实验验证方法有效性。 Result: 在MacroData上微调显著提升多参考生成性能;消融研究表明跨任务协同训练与长上下文处理策略具有协同增益。 Conclusion: MacroData与MacroBench为多参考图像生成提供了关键数据基础与评估标准,推动该方向发展。 Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

[151] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

Yongsung Kim,Wooseok Song,Jaihyun Lew,Hun Hwangbo,Jaehoon Lee,Sungroh Yoon

Main category: cs.CV

TL;DR: 本文提出HeSS-Guided Sparsification方法,通过量化注意力头对剪枝的敏感性差异,在保持高稀疏度下显著缓解VGGT模型性能下降。

Details Motivation: 现有稀疏化加速方法在VGGT中采用统一稀疏模式,忽视了不同注意力头对稀疏化的敏感性差异,导致精度大幅下降。 Method: 提出两阶段稀疏化流程:第一阶段用Head Sensitivity Score(HeSS)度量各头敏感性(基于小校准集上两类误差的Hessian近似);第二阶段依据HeSS动态分配注意力预算,对敏感头保留更多连接、对鲁棒头施加更高稀疏度。 Result: 实验表明HeSS能有效刻画头间敏感性异质性,所提方法在高稀疏率下显著抑制性能退化,且在不同稀疏水平下均表现出强鲁棒性。 Conclusion: 注意力头的稀疏敏感性具有显著异质性,基于HeSS的差异化稀疏策略可兼顾效率与精度,为Transformer类3D视觉模型的高效部署提供新思路。 Abstract: Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

[152] Image Rotation Angle Estimation: Comparing Circular-Aware Methods

Maximilian Woehrer

Main category: cs.CV

TL;DR: 本文系统研究了五种面向圆形拓扑的图像自动旋转估计方法,在多个现代网络架构上进行评估,发现基于概率的方法(尤其是圆形高斯分布)鲁棒性最佳,分类法在匹配良好时精度最高;最佳配置在DRC-D、COCO2014和COCO2017上分别达到1.23°、3.71°和2.84°的平均绝对误差。

Details Motivation: 角度具有环形拓扑结构,导致标准回归方法在边界处出现不连续性,使得自动图像旋转估计任务具有挑战性。 Method: 比较了五种环形感知方法:带环形损失的直接角度回归、角度分箱分类、单位向量回归、相位偏移编码器和环形高斯分布;采用ImageNet预训练模型迁移学习,并适配输出头以预测全局旋转角。 Result: 环形高斯分布方法在不同骨干网络上鲁棒性最强;分类法在适配良好的骨干(如EfficientViT-B3)上精度最高(DRC-D上MAE=1.23°),但存在训练不稳定问题;环形高斯+MambaOut Base达到1.24°且更鲁棒;在COCO2014和COCO2017上MAE分别达3.71°和2.84°,显著优于先前工作。 Conclusion: 面向环形结构的概率建模(如环形高斯分布)是旋转估计更稳健的选择,而分类法虽精度略高但泛化性受限;方法与骨干网络的协同设计对性能至关重要。 Abstract: Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.

[153] InstanceAnimator: Multi-Instance Sketch Video Colorization

Yinhan Zhang,Yue Ma,Bingyuan Wang,Kunyu Feng,Yeying Jin,Qifeng Chen,Anyi Rao,Zeyu Wang

Main category: cs.CV

TL;DR: InstanceAnimator 是一种基于扩散变换器的多实例草图视频着色框架,通过画布引导条件、实例匹配机制和自适应解耦控制模块,解决了现有方法在用户控制、实例对齐和细节保真度方面的不足。

Details Motivation: 现有方法存在三方面局限:依赖单参考帧导致用户控制不灵活、多角色场景下实例可控性差导致错位、细粒度区域细节保真度下降。 Method: 提出三个创新:1)画布引导条件,支持参考元素与背景自由放置;2)实例匹配机制,融合实例特征与草图以提升多角色对齐精度;3)自适应解耦控制模块,将角色、背景和文本语义特征注入扩散过程以增强细节表现。 Result: 大量实验表明,InstanceAnimator 在多实例着色任务中实现了更优的用户控制性、更高视觉质量及更强的实例一致性。 Conclusion: InstanceAnimator 有效克服了多实例草图视频着色中的关键挑战,为交互式视频着色提供了新范式。 Abstract: We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

[154] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

Jeannie Chung,Hanna Jang,Ingyeong Yang,Uiwon Hwang,Jaehyung Sim

Main category: cs.CV

TL;DR: 本文提出了一种关系型知识蒸馏框架CLIP-RD,包含垂直关系蒸馏(VRD)和交叉关系蒸馏(XRD),以更好保留CLIP教师模型的多模态嵌入结构关系,提升轻量级学生模型的零样本性能。

Details Motivation: 现有CLIP蒸馏方法未显式建模师生嵌入间的多向关系依赖,导致学生难以保持教师编码的结构关系。 Method: 提出VRD(保证跨模态师生蒸馏强度在分布层面一致)和XRD(施加跨模态师生相似性分布的双向对称性),联合建模多向关系结构。 Result: CLIP-RD在零样本分类任务上比现有方法提升0.8个百分点。 Conclusion: 通过显式建模师生嵌入间的多方向关系,CLIP-RD能更忠实地对齐学生与教师的嵌入几何结构,显著提升蒸馏效果。 Abstract: CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

[155] Multimodal Dataset Distillation via Phased Teacher Models

Shengbin Guo,Hang Zhao,Senqiao Yang,Chenyang Jiang,Yuhang Cheng,Xiangru Peng,Rui Shao,Zhuotao Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为PTM-ST的分阶段教师模型框架,用于多模态数据集蒸馏,通过阶段感知建模和捷径轨迹策略提升蒸馏稳定性与表达能力,在Flickr30k和COCO上显著优于现有方法。

Details Motivation: 现有方法难以捕获教师模型后期训练阶段中动态演化的复杂知识,导致学生模型性能下降和蒸馏数据质量受损,且存在跨阶段性能差距大、教师轨迹不稳定等问题。 Method: 提出Phased Teacher Model with Shortcut Trajectory(PTM-ST),采用阶段感知的教师建模和基于捷径的轨迹构建策略,以精确拟合教师模型在不同训练阶段的学习动态。 Result: 理论分析与实验表明,PTM-ST显著缓解优化振荡与阶段间知识断层,并降低存储开销;在Flickr30k和COCO上持续超越SOTA方法,Flickr30k上最高提升13.5%,平均提升9.53%。 Conclusion: PTM-ST是一种有效提升多模态数据集蒸馏稳定性、表达力与效率的新框架,为大规模图像-文本知识压缩与迁移提供了新思路。 Abstract: Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.

[156] FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection

Yingmei Zhang,Wangtao Bao,Yong Yang,Weiguo Wan,Qin Xiao,Xueting Zou

Main category: cs.CV

TL;DR: 本文提出FSGNet,一种结合频域感知与语义引导机制的轻量高效红外小目标检测框架,通过多向交互注意力、多尺度频域滤波和全局语义引导流,缓解U-Net语义退化问题,提升小目标定位精度。

Details Motivation: U-Net在红外小目标检测中存在深层到浅层特征传递时的语义退化问题,导致小目标精确定位能力受限。 Method: 提出FSGNet框架,包含:1)编码器中多方向交互注意力模块以增强对低对比度小目标的敏感性;2)基于FFT的多尺度频域感知模块抑制跳连中的背景干扰;3)最深层全局池化+上采样+语义引导流保障跨尺度语义一致性。 Result: 在四个公开IRSTD数据集上实验表明,FSGNet在检测性能和计算效率上均优于现有方法,具备强实用性与鲁棒性。 Conclusion: FSGNet有效缓解了U-Net语义退化问题,通过频域与语义协同建模显著提升了红外小目标检测的精度与效率。 Abstract: Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network's sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on https://github.com/Wangtao-Bao/FSGNet.

[157] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

Niccolò Cavagnero,Narges Norouzi,Gijs Dubbelman,Daan de Geus

Main category: cs.CV

TL;DR: 本文提出Plain Mask Transformer (PMT),一种基于冻结视觉基础模型(VFM)特征的轻量级Transformer解码器(PMD),在保持编码器冻结、可共享的前提下,实现图像与视频分割的高效准确性能。

Details Motivation: 现有基于VFM的分割模型(如EoMT、VidEoMT)虽速度快,但需微调编码器,牺牲了VFM多任务共享的优势;本文旨在兼顾冻结编码器的实用性与解码器的高效性。 Method: 提出Plain Mask Decoder(PMD),一个轻量、快速的Transformer解码器,直接作用于冻结VFM提取的特征;构建端到端的Plain Mask Transformer(PMT)框架,统一支持图像和视频分割。 Result: 在图像分割上,PMT达到冻结编码器SOTA精度且快约3倍;在视频分割上,性能媲美全微调方法,且比当前最优冻结编码器模型快8倍。 Conclusion: PMT成功 reconciles 冻结VFM编码器的通用性与分割任务的高性能低延迟需求,为大规模多任务部署提供了实用新范式。 Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

[158] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Xinkai Wang,Chenyi Wang,Yifu Xu,Mingzhe Ye,Fu-Cheng Zhang,Jialin Tian,Xinyu Zhan,Lifeng Zhu,Cewu Lu,Lixin Yang

Main category: cs.CV

TL;DR: LaMP是一种双专家视觉-语言-动作(VLA)框架,通过引入3D场景流作为潜在运动先验,提升机器人操作在复杂空间动态下的泛化与鲁棒性。

Details Motivation: 现有VLA模型仅依赖2D语义特征直接回归动作,难以显式建模3D物理交互,在陌生空间动态下性能下降。 Method: 提出双专家架构:Motion Expert生成部分去噪的单步3D场景流,Action Expert通过门控交叉注意力受Motion Expert隐状态条件引导,无需完整多步重建。 Result: 在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准及真实世界实验中,LaMP均超越现有VLA基线;在LIBERO-Plus OOD扰动下平均成功率提升9.7%。 Conclusion: 将3D场景流显式建模为运动先验,并通过双专家协同机制实现高效动作预测,显著提升了VLA模型在分布外场景中的鲁棒性与泛化能力。 Abstract: We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

[159] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Huizhi Liang,Yichao Shen,Yu Deng,Sicheng Xu,Zhiyuan Feng,Tong Zhang,Yaobo Liang,Jiaolong Yang

Main category: cs.CV

TL;DR: 本文提出了一种分层框架,用于提升视觉语言模型(VLMs)的3D空间理解能力,并构建了大规模3D空间VQA数据集与RGB-D VLM模型,在多个空间推理基准上达到SOTA性能。

Details Motivation: 实现类人空间智能需VLM具备从2D图像推断3D结构、识别3D物体属性与关系、并进行高层空间推理的能力,而现有方法缺乏系统性分层建模。 Method: 提出四层递进式空间理解框架(几何感知→空间关系→场景理解→抽象推理),构建自动化pipeline生成500万图像、4500万物体的3D空间VQA数据,并设计融合度量尺度点云图的RGB-D VLM。 Result: 在多个空间理解与推理基准上达到SOTA,超越专用空间模型及Gemini-2.5-pro、GPT-5等大模型;验证了层级任务间存在清晰依赖关系。 Conclusion: 分层任务设计是促进VLM中3D空间智能涌现的关键机制,所提框架与数据、模型共同推动了空间智能的系统性发展。 Abstract: Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

[160] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Yang Bai,Liudi Yang,Ziyuan Liu

Main category: cs.CV

TL;DR: 本文提出了VideoWeaver,首个支持多视角视频到视频(V2V)翻译的框架,通过共享4D潜在空间和扩散时间步训练实现跨视角一致性与可扩展性,显著提升机器人策略迁移中的多视角仿真质量。

Details Motivation: 现有单视角V2V方法无法保证多同步相机下的外观一致性,且标准Transformer因跨视角注意力计算成本高而难以扩展至多视角场景,限制了其在具身AI任务(如机器人策略学习)中的应用。 Method: VideoWeaver首先基于流式模型构建单视角V2V基础;再利用Pi3空间基础模型构建共享4D潜在空间以实现视角对齐;并通过在不同扩散时间步训练各视角,建模联合及条件视角分布,支持自回归生成新视角。 Result: 在单视角基准上达到SOTA或相当性能;首次实现了物理与风格一致的多视角翻译,涵盖具身视角与异构相机等挑战性设置,适用于机器人世界随机化训练。 Conclusion: VideoWeaver为多视角V2V翻译提供了可扩展、一致性强的新范式,有效支撑具身AI中预训练策略向新环境的零样本迁移。 Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

[161] DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming

Wei Lian,Fei Ma,Hang Pan,Zhesen Cui,Wangmeng Zuo

Main category: cs.CV

TL;DR: 本文提出DC-Reg框架,通过差凸(DC)规划范式构建整体凹下估计器,显著收紧分支定界(BnB)搜索的下界,实现点云配准的全局最优解,尤其在部分重叠和大错位场景下表现更优。

Details Motivation: 点云配准在部分重叠和大错位情况下难以获得全局最优解;现有联合优化变换与对应关系的方法因目标函数非凸耦合,易陷入局部极小或收敛过慢。 Method: 提出DC-Reg框架,基于差凸(DC)分解构造耦合目标函数的整体凹下估计器,利用线性指派问题(LAP)在搜索盒顶点处高效计算紧致下界,避免传统逐项松弛(如McCormick包络)忽略变量交互的缺陷。 Result: 在2D相似变换与3D刚性配准任务上验证有效;使用旋转不变特征提升效率;在合成数据与3DMatch基准上,相比SOTA全局方法,收敛更快、对极端噪声和离群点鲁棒性更强。 Conclusion: DC-Reg通过结构化DC分解实现更紧致的下界估计,为点云配准提供了兼具高效性与全局最优保证的新范式。 Abstract: Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbolθ$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbolθ$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.

[162] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

Keming Ye,Zhou Zhao,Fan Wu,Shengyu Zhang

Main category: cs.CV

TL;DR: 本文提出CIAR框架,通过云-端协同和设备端自验证机制,解决自回归图像生成模型在端侧部署时的高计算开销与延迟问题,显著提升推理速度并减少云端请求。

Details Motivation: 自回归图像生成模型虽性能优异,但因计算密集和串行特性难以在端侧高效部署,存在显著延迟;同时其面临大规模视觉词表和图像空间冗余带来的验证效率低下问题。 Method: 提出云-端协同框架CIAR,核心包括:1)端侧基于连续概率区间的token不确定性量化器,避免对冗余区域进行统一验证;2)区间增强解码模块,结合分布对齐训练策略以兼顾解码加速与图像保真度和语义一致性。 Result: 实验表明CIAR相比现有方法实现2.18倍加速,云端请求减少70%,同时保持图像质量。 Conclusion: CIAR通过引入不确定性感知的端侧验证与区间化解码,有效平衡了自回归图像生成的效率与质量,为端侧部署提供了可行路径。 Abstract: Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.

[163] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

Mohamed Eltahir,Ahmed O. Ibrahim,Obada Siralkhatim,Tabarak Abdallah,Sondos Mohamed

Main category: cs.CV

TL;DR: 本文提出GridVAD,一种无需训练的视频异常检测框架,利用视觉语言模型(VLM)生成自然语言异常描述,再通过自一致性过滤、定位与分割模块实现像素级异常掩码输出,兼顾零样本性能与效率。

Details Motivation: 现有VLM直接用于视频异常检测时因缺乏校准的异常先验而鲁棒性差,易漏检或误报;作者认为问题在于使用方式而非模型本身,应将VLM作为‘异常提议器’,辅以专用时空模块进行接地与传播。 Method: 提出‘提议-接地-传播’范式:1)VLM在分层网格视频片段上生成开放集异常语言描述;2)自一致性整合(SCC)通过多采样共识过滤幻觉;3)Grounding DINO将保留的描述锚定至边界框;4)SAM2在异常时间区间内传播为稠密像素掩码;VLM调用预算固定为M+1次/片段,与视频长度无关。 Result: 在UCSD Ped2数据集上,GridVAD取得最高像素级AUROC(77.59),超越部分微调方法TAO(75.11);在对象级RBDC指标上,优于其他零样本方法5倍以上;消融显示SCC可调控精度-召回权衡;效率实验表明其调用效率达均匀逐帧查询的2.7倍,并额外输出稠密分割掩码。 Conclusion: GridVAD验证了将VLM解耦为提议器并协同轻量专用模块的可行性,实现了高性能、零训练、高效率的像素级视频异常检测,为VLM在开放世界感知任务中的结构化应用提供了新范式。 Abstract: Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.

[164] AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao

Main category: cs.CV

TL;DR: 本文提出AdaSFormer,一种专为室内单目语义场景补全(MSSC)设计的序列化Transformer框架,通过自适应序列化Transformer、中心相对位置编码和卷积调制层归一化三大创新,在NYUv2和Occ-ScanNet数据集上达到SOTA性能。

Details Motivation: 室内单目语义场景补全比室外更难,因空间布局复杂、遮挡严重;现有Transformer内存开销大且难以重建细粒度细节。 Method: 提出AdaSFormer:(1) 带可学习位移的自适应序列化Transformer以动态调整感受野;(2) 中心相对位置编码以捕获丰富空间信息;(3) 卷积调制层归一化以融合卷积与Transformer异构特征。 Result: 在NYUv2和Occ-ScanNet数据集上达到当前最优性能。 Conclusion: AdaSFormer有效克服了Transformer在室内MSSC中的内存与细节建模瓶颈,验证了序列化设计与多模态特征融合的有效性。 Abstract: Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.

[165] Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects

Jakob Paul Zimmermann,Gerrit Holzbach,David Lerch

Main category: cs.CV

TL;DR: 本文提出了一种名为知识引导失败预测(KGFP)的新框架,用于在运行时检测目标检测器对安全关键物体(如行人)的漏检,通过衡量检测器特征与视觉基础模型嵌入之间的语义不一致性来实现可靠预警。

Details Motivation: 目标检测器在安全关键场景中可能静默失效(如漏检行人),而传统OOD检测方法无法直接预测检测器自身的功能失效。 Method: 提出KGFP框架,采用双编码器架构,利用角度距离度量检测器内部特征与视觉基础模型嵌入之间的语义错位;当检测器超出其能力范围或基础模型遇到新输入时,二者嵌入发散,产生高角度信号以标识不安全图像。 Result: 在COCO人物检测任务上,KGFP作为选择性预测门控,将接受图像中的人物召回率从64.3%提升至84.5%(FPR=5%),并在六个COCO-O视觉域上显著优于OOD基线方法。 Conclusion: KGFP是一种有效、鲁棒的检测器运行时监控方法,能可靠识别检测器对安全关键目标的漏检,适用于实际部署中的风险预警。 Abstract: Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP.

[166] RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

Yufeng Yang,Xianfang Zeng,Zhangqi Jiang,Fukun Yin,Jianzhuang Liu,Wei Cheng,jinghong lan,Shiyu Liu,Yuqi Peng,Gang YU,Shifeng Chen

Main category: cs.CV

TL;DR: 本文构建了一个覆盖九种常见真实世界退化类型的大型数据集,并训练了一个最先进的开源图像恢复模型,同时提出了RealIR-Bench基准,包含464张真实退化图像和专门的评估指标,在开源方法中达到SOTA性能。

Details Motivation: 现有图像恢复模型受限于训练数据的规模与分布,难以泛化到真实场景;而闭源大模型虽效果好但计算与数据成本高。 Method: 构建大规模真实世界退化数据集(涵盖九类退化),训练高性能开源恢复模型,并提出RealIR-Bench评估基准(含464张真实图像及聚焦退化去除与一致性保持的指标)。 Result: 所提模型在开源方法中排名第一,达到当前最优性能;RealIR-Bench为真实图像恢复提供了标准化评估平台。 Conclusion: 通过高质量数据构建与针对性基准设计,可显著提升开源图像恢复模型在真实场景中的泛化能力与实用性,缩小与闭源模型的性能差距。 Abstract: Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

[167] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

Koldo Basterretxea,Jon Gutiérrez-Zaballa,Javier Echanobe

Main category: cs.CV

TL;DR: 本文分析了高光谱成像(HSI)在自动驾驶(AD)中的应用挑战与技术选择,结合HSI-Drive数据集实验结果,探讨了适应复杂环境与嵌入式平台限制的定制化视觉算法设计。

Details Motivation: 解决HSI在自动驾驶中面临的非受控光照、大景深、动态快速场景以及实时性与嵌入式算力受限等实际挑战。 Method: 分析现有HSI视觉系统研究中的多种技术路线,并基于最新HSI-Drive数据集的实验结果进行评估与比较。 Result: 明确了适用于自动驾驶场景的HSI技术选型标准及需定制开发的融合光谱与空间信息的视觉算法方向。 Conclusion: HSI在自动驾驶中具有潜力,但必须针对其特殊约束设计专用传感器方案与轻量高效算法,不能直接套用传统HSI方法。 Abstract: The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.

[168] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

Alex Hoi Hang Chan,Neha Singhal,Onur Kocahan,Andrea Meltzer,Saverio Lubrano,Miyako H. Warrington,Michel Griesser,Fumihiro Kano,Hemal Naik

Main category: cs.CV

TL;DR: 本文介绍了CHIRP数据集和CORVID方法,用于野生鸟类的个体再识别和长期行为监测,通过生物相关指标评估模型性能,旨在弥合计算机视觉研究与生物学应用之间的差距。

Details Motivation: 长期个体动物行为监测对保护和进化生物学至关重要,但现有计算机视觉方法在野外种群中仍面临挑战,主要由于缺乏支持多种视觉任务的生物相关数据集。 Method: 提出CHIRP数据集(涵盖再识别、动作识别、2D关键点估计、目标检测和实例分割)和CORVID方法(基于彩色腿环分割与分类的概率化视频再识别流水线)。 Result: CORVID在应用特异性基准测试(如取食率、共现率)中优于当前最优再识别方法;CHIRP数据集支持多任务学习与真实场景评估。 Conclusion: 该工作为从伦理批准的生物学研究中构建现实世界数据集提供了范式,推动计算机视觉与生态行为学的交叉融合。 Abstract: Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

[169] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Xiangyang Luo,Qingyu Li,Yuming Li,Guanbo Huang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文提出Timestep-aware Quality Decoupling(TQD)方法,通过在训练中动态选择适配数据质量特性的扩散时间步,解耦视觉质量与运动质量的固有冲突,从而仅用质量不均衡的数据即可超越传统方法在高质量数据上的性能。

Details Motivation: 视频生成模型依赖高视觉质量与高运动质量兼备的‘黄金数据’,但现实中二者存在负相关,形成‘Motion-Vision Quality Dilemma’,难以获取理想训练数据。 Method: 基于对视频扩散模型分层学习动力学和质量退化样本的梯度分析,发现质量不均衡数据在特定时间步可产生类似黄金数据的梯度;据此提出‘训练过程中的时间步选择’机制,并设计TQD方法,按数据类型(运动丰富型/视觉高质量型)自适应调整采样时间步分布。 Result: TQD仅使用分离的、质量不均衡的数据训练,性能即超越传统方法在更优数据上的表现;同时在高质量数据上训练时也能进一步提升性能,验证其普适有效性。 Conclusion: 完美数据并非视频生成的必要条件;通过匹配模型学习动态的时间步感知数据采样策略,可有效缓解质量权衡困境,提升模型训练效率与泛化能力。 Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

[170] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

Ning Ding,Keisuke Fujii,Toru Tamaki

Main category: cs.CV

TL;DR: 本文提出了首个羽毛球全场比赛密集标注数据集BFMD,并构建了一个基于VideoMAE的多模态字幕生成框架,引入语义反馈机制提升字幕质量,支持全场战术动态分析。

Details Motivation: 现有羽毛球数据集多为短片段或任务特定标注,缺乏带密集多模态标注的全场数据,难以支撑准确的击球描述生成和全场级战术分析。 Method: 构建BFMD数据集(19场广播比赛、20+小时、16751次击球,含多层次密集标注);提出基于VideoMAE的多模态字幕框架,引入语义反馈机制以提升语义一致性。 Result: 多模态建模与语义反馈显著优于纯RGB基线;验证了BFMD在全场战术模式时序演化分析中的潜力。 Conclusion: BFMD填补了羽毛球全场多模态分析的数据空白,所提框架有效提升击球描述质量,为战术理解与智能辅助提供新基础。 Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

[171] Insights on back marking for the automated identification of animals

David Brunner,Marie Bordes,Elisabeth Mayrhuber,Stephan M. Winkler,Viktoria Dorfer,Maciej Oczak

Main category: cs.CV

TL;DR: 本研究探讨了如何为外观相似的动物(如猪)设计背部标记,以支持基于机器学习的个体识别监控,重点分析了ResNet-50模型在十头带独特背部标记猪上的分类性能,并提出标记设计需兼顾运动模糊、多视角、遮挡及数据增强鲁棒性。

Details Motivation: 现有研究缺乏针对均匀外观物种(如猪)背部标记设计的指导,而机器学习监控方案的兴起亟需可被算法可靠识别的标记设计原则。 Method: 使用ResNet-50神经网络训练模型对十头具有独特背部标记的猪进行个体分类,并通过分析模型预测结果,评估不同标记设计在运动模糊、多视角、遮挡及常见数据增强(颜色、翻转、裁剪)下的识别鲁棒性。 Result: 发现背部标记必须在运动模糊、多视角和行为导致的遮挡下保持唯一可辨性,且设计需适配训练中常用的数据增强策略;某些设计选择即使在受控环境下也显著影响识别性能。 Conclusion: 该研究为面向机器学习识别的动物背部标记设计提供了实用指南,有助于提升个体级监控系统的准确性与实用性。 Abstract: To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model's predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.

[172] PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos

Yihao Wang,Yang Miao,Wenshuai Zhao,Wenyan Yang,Zihan Wang,Joni Pajarinen,Luc Van Gool,Danda Pani Paudel,Juho Kannala,Xi Wang,Arno Solin

Main category: cs.CV

TL;DR: 本文提出PAWS方法,从大规模野外第一人称视频中直接提取物体关节结构,无需高质量3D标注数据,显著提升下游任务性能。

Details Motivation: 现有基于学习的方法严重依赖带高质量3D数据和人工标注的监督训练,限制了可扩展性和多样性。 Method: 提出PAWS方法,直接从大规模野外第一人称视频中的手-物交互中提取物体关节结构。 Result: 在HD-EPIC和Arti4D等公开数据集上显著优于基线方法,并验证了所提取关节结构对下游任务(如3D关节预测模型微调和机器人操作)的有效性。 Conclusion: PAWS是一种无需高成本标注、可扩展性强的关节结构感知新范式,推动了3D场景理解在机器人、仿真与动画等领域的应用。 Abstract: Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.

[173] Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

Nikolo Rohrmoser,Ghazal Ghazaei,Michael Sommersperger,Nassir Navab

Main category: cs.CV

TL;DR: 本文提出了一种用于玻璃体视网膜手术中实时多模态(手术显微镜OPMI + 术中OCT)特征融合的网络,实现器械检测、关键点定位与器械-组织距离估计,显著提升近距离(<1mm)距离估计精度,并达到实时处理性能。

Details Motivation: 在眼科手术中,OPMI与iOCT两种模态互补,但尚未有效融合;亟需利用多模态信息提升手术场景理解,尤其是高精度、实时的器械-组织关系估计。 Method: 设计一种多模态、时序感知、实时可行的网络:用YoloNAS和CNN分别提取OPMI与iOCT特征,引入跨注意力融合模块,并加入基于区域的循环模块建模时间一致性。 Result: mAP50达95.79%;器械-视网膜距离估计误差从单模态284μm大幅降至多模态33μm(<1mm时),单帧处理耗时22.5ms,满足实时性要求。 Conclusion: 多模态特征融合可显著提升多任务预测精度与鲁棒性;定制化网络设计可兼顾精度与实时性;该工作验证了其在图像引导眼科手术中的潜力,也揭示了通向更可靠、一致、全面手术场景理解的关键挑战。 Abstract: Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

[174] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

Xuran Hu,Zhitong Xiong,Zhongcheng Hong,Yifang Ban,Xiaoxiang Zhu,Wufan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种面向遥感高度感知的理解框架,包括数据生成流水线、两个新基准(GeoHeight-Bench 和 GeoHeight-Bench+)以及首个高度感知的遥感大模型基线 GeoHeightChat,有效缓解了现有大模型在垂直维度上的感知盲区。

Details Motivation: 当前地球观测中的大语言多模态模型(LMMs)普遍忽略关键的“垂直”维度,导致在复杂遥感几何结构和灾害场景中推理能力受限,而物理空间结构往往比平面纹理更重要。 Method: 构建了一个基于视觉语言模型(VLM)驱动的可扩展数据生成流水线,结合系统性提示工程与元数据提取;据此建立两个互补基准:GeoHeight-Bench(相对高度分析)与更难的 GeoHeight-Bench+(地形整体感知);并提出首个高度感知遥感 LMM 基线 GeoHeightChat,将视觉语义与隐式注入的高度几何特征协同融合。 Result: GeoHeightChat 成功缓解了模型的‘垂直盲区’,验证了高度感知的必要性,并实现了在现有光学模型上支持交互式高度推理的新范式。 Conclusion: 引入高度维度对提升遥感理解能力至关重要;所提框架和基线为后续高度感知遥感建模提供了系统性评估工具与技术路径。 Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

[175] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

Sk Miraj Ahmed,Xi Yu,Yunqi Li,Yuewei Lin,Wei Xu

Main category: cs.CV

TL;DR: 本文提出两种层次感知的多模态学习方法(CLiBD-HiR 和 CLiBD-HiR-Fuse),通过引入分层信息正则化和轻量级融合预测器,显著提升生物多样性分类准确率,尤其在DNA数据缺失或损坏时表现优异。

Details Motivation: 现有方法将分类学视为扁平标签空间,忽略生物分类的层次结构,导致在噪声和模态缺失下鲁棒性差。 Method: 提出CLiBD-HiR(引入分层信息正则化HiR以塑造嵌入几何结构)和CLiBD-HiR-Fuse(增加轻量级融合预测器,支持单模态或双模态推理并抵抗模态损坏)。 Result: 在大规模生物多样性基准上,相比强多模态基线,分类准确率提升超14%,尤其在部分或损坏DNA条件下增益显著。 Conclusion: 显式建模生物分类层次结构并结合灵活融合策略,是构建实用生物多样性基础模型的关键。 Abstract: Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

[176] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation

Chengfeng Zhao,Junbo Qi,Yulou Liu,Zhiyang Dou,Minchen Li,Taku Komura,Ziwei Liu,Wenping Wang,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出UNIC方法,利用实例特定的神经形变场实时驱动虚拟角色服装形变,避免拓扑处理并提升平滑性与质量,适用于游戏等交互场景。

Details Motivation: 物理仿真方法计算开销大、难以实时;现有基于图神经网络的学习方法难以处理复杂拓扑服装网格。 Method: 提出基于神经形变场(Neural Deformation Field)的UNIC方法,学习实例特定的3D点到形变偏移的映射,仅需泛化至新动作序列而非新服装。 Result: 在多种服装网格上实验表明,UNIC在形变质量与运行效率上均优于基线方法,支持实时应用。 Conclusion: UNIC通过实例特定神经形变场实现高质量、高效率、拓扑无关的实时服装动画,具备实际应用潜力。 Abstract: Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.

[177] DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

Zhenchen Zhu,Ge Hu,Weixiong Tan,Kai Gao,Chao Sun,Zhen Zhou,Kepei Xu,Wei Han,Meixia Shang,Xiaoming Qiu,Yiqing Tan,Jinhua Wang,Zhoumeng Ying,Li Peng,Wei Song,Lan Song,Zhengyu Jin,Nan Hong,Yizhou Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的深度学习模型DeepFAN,用于肺结节良恶性分类,并通过多中心临床试验验证其对初级放射科医生的诊断辅助效果。

Details Motivation: 当前深度学习方法在肺结节良恶性分类中未能充分融合全局与局部特征,且缺乏临床试验验证。 Method: 开发了基于Transformer的DeepFAN模型,使用超1万例病理证实的结节数据训练,并开展多阅片者、多病例的临床试验评估其辅助诊断效能。 Result: DeepFAN在内部测试集和临床试验数据集上AUC分别达0.939和0.954;辅助下12名阅片者各项诊断指标显著提升,结节级诊断一致性由轻度提升至中度。 Conclusion: DeepFAN能有效辅助初级放射科医生,有助于提高诊断质量均一性,减少对不确定肺结节的不必要随访。 Abstract: The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

[178] Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

Ünsal Öztürk,Hatef Otroshi Shahreza,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文评估了九种开源多模态大语言模型(MLLMs)在人脸验证任务中的性能与公平性,发现专用人脸模型FaceLLM-8B显著优于通用MLLMs;不同模型和基准下存在差异化的偏见模式,且高准确率不等于高公平性。

Details Motivation: 探究多模态大语言模型(MLLMs)作为人脸验证系统时的 demographic fairness(种族与性别公平性),因该方向此前尚未被系统研究。 Method: 在IJB-C和RFW两个标准人脸验证协议上,对来自六个模型家族、参数量2B–8B的九个开源MLLMs进行基准测试,按四个种族组和两个性别组分别计算等错误率(EER)、特定误报率下的真匹配率(TMR),并采用四种基于FMR的公平性指标量化群体间差异。 Result: FaceLLM-8B(唯一人脸专用模型)在两个基准上均显著优于通用MLLMs;偏见模式不同于传统人脸识别系统,且因模型与基准而异;最高准确率模型未必最公平,部分低准确率模型因全局高错误率而呈现虚假公平性。 Conclusion: MLLMs在人脸验证中存在显著且复杂的公平性问题,需专门设计与评估;不能以整体准确率替代公平性评估,应结合多维度指标开展细粒度分析。 Abstract: Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

[179] LanteRn: Latent Visual Structured Reasoning

André G. Viveiros,Nuno Gonçalves,Matthias Lindemann,André Martins

Main category: cs.CV

TL;DR: 本文提出LanteRn框架,使大语言-视觉模型能在潜空间中交替使用语言与紧凑的视觉表征进行推理,避免了像素级计算或依赖外部工具,通过两阶段训练(监督微调+强化学习)在多个视觉推理基准上取得提升。

Details Motivation: 当前大 multimodal 模型在视觉推理(尤其是细粒度空间和视觉理解)上存在局限,通常仅将视觉内容转为文本描述;现有改进方法要么依赖外部模块,要么在像素空间直接推理导致计算开销大。 Method: 提出LanteRn框架,在视觉-语言Transformer中引入生成和关注连续视觉‘思维’嵌入的能力;采用两阶段训练:先监督微调以对齐视觉特征与潜态,再用强化学习使潜空间推理符合任务目标。 Result: 在VisCoT、V* 和 Blink 三个感知导向基准上,LanteRn一致提升了视觉定位与细粒度推理能力。 Conclusion: 利用模型内部潜空间进行多模态推理是一种更高效、有前景的方向。 Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.

[180] Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

Chengshuai Yang

Main category: cs.CV

TL;DR: 本文提出spec.md规范格式和三个自主代理(Plan、Judge、Execute),将一句自然语言描述自动转化为经验证的前向模型,并通过设计到实际误差定理分解重建误差,实现与专家库相当的性能。

Details Motivation: 解决计算成像系统设计依赖专家、耗时长、门槛高的问题,使更广泛的科学社区能参与成像仪器原型开发。 Method: 提出结构化规范格式spec.md,构建Plan、Judge、Execute三个自主代理,结合设计到真实误差定理,将自然语言描述转化为带误差界保证的前向模型。 Result: 在6种真实数据模态(覆盖全部5类载波族)上,自动化流程达到专家库水平(98.1 ± 4.2%);生成10个新颖的3D–5D链式组合设计,展现超越单模态工具的组合能力。 Conclusion: 该框架显著降低计算成像系统设计门槛,支持快速、可靠、可扩展的跨模态原型开发。 Abstract: Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.

[181] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Yunus Talha Erzurumlu,Jiyong Kwag,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了一种名为'Just Zoom In'的新方法,通过在城市级卫星地图上进行自回归缩放(coarse-to-fine zooming)来实现跨视角地理定位(CVGL),避免了传统对比学习和难负样本挖掘的依赖,并在新构建的真实基准上取得SOTA性能。

Details Motivation: 现有CVGL方法将问题建模为对比学习下的图像检索任务,受限于大批次训练、难负样本挖掘,且忽略地图几何结构与街景/航拍图像覆盖不匹配问题(如关键地标可能落在固定卫星裁剪区域之外),导致定位目标模糊、难以显式空间推理。 Method: 提出'Just Zoom In'框架:从粗粒度卫星视图出发,通过短序列的自回归缩放决策,逐步定位到目标分辨率下的终端卫星单元;无需对比损失或难负样本挖掘;同时构建了一个基于众包街景与高分辨率卫星影像的真实基准。 Result: 在新基准上,相比最强对比检索基线,Recall@1(50米内)提升5.5%,Recall@1(100米内)提升9.6%,达到SOTA性能。 Conclusion: 序列化的由粗到细空间推理方式比传统嵌入式检索更有效,为CVGL提供了新范式。 Abstract: Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

[182] LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation

Ishaan Gakhar,Laven Srivastava,Sankarshanaa Sagaram,Aditya Kasliwal,Ujjwal Verma

Main category: cs.CV

TL;DR: 本文提出了一种名为LEMMA的轻量级语义分割模型,专为资源受限的海洋遥感图像分割设计,通过拉普拉斯金字塔增强边缘识别,显著降低计算开销并保持高精度。

Details Motivation: 现有基于深度CNN和Transformer的海洋语义分割方法计算成本高、资源消耗大,难以满足实时、低成本的海上实际应用需求。 Method: 提出LEMMA模型,利用拉普拉斯金字塔在特征提取早期融合边缘信息,避免深层网络中昂贵的特征图计算,从而降低模型复杂度。 Result: 相比现有模型,LEMMA参数量减少最多71倍,GFLOPs降低最多88.5%,推理时间减少最多84.65%;在Oil Spill数据集上IoU达93.42%,在Mastr1325上mIoU达98.97%。 Conclusion: LEMMA在显著降低计算资源需求的同时,实现了海洋遥感语义分割的最先进性能,具备强现实部署能力。 Abstract: Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5\%, and inference time by up to 84.65\%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42\% IoU on the Oil Spill dataset and 98.97\% mIoU on Mastr1325.

[183] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Jinbo Xing,Zeyinzi Jiang,Yuxiang Tuo,Chaojie Mao,Xiaotang Gai,Xi Chen,Jingfeng Zhang,Yulin Pan,Zhen Han,Jie Xiao,Keyu Yan,Chenwei Xie,Chongyang Zhong,Kai Zhu,Tong Shen,Lianghua Huang,Yu Liu,Yujiu Yang

Main category: cs.CV

TL;DR: 本文提出Wan-Weaver框架,通过将交错生成分解为文本规划与视觉一致性建模,利用文本代理交错数据和参考引导图像数据分别训练规划器与可视化器,在无真实交错训练数据下实现长程文本连贯性与视觉一致性。

Details Motivation: 现有统一多模态模型虽能处理多模态输入,但难以生成交错(图文混排)内容,主因是交错训练数据稀缺及长程跨模态上下文建模困难。 Method: 提出两阶段框架:1)规划器(Planner)学习生成密集文本描述以表征视觉内容;2)可视化器(Visualizer)根据文本描述合成图像;并构建大规模文本代理交错数据训练规划器、参考引导图像数据训练可视化器。 Result: Wan-Weaver在无真实交错数据情况下,展现出长程文本连贯性与视觉一致性;在自建多维度交错生成基准上显著优于现有方法;同时具备强任务推理与生成能力。 Conclusion: 分解式建模与代理数据构造可有效突破交错生成的数据瓶颈,Wan-Weaver验证了该范式的有效性与泛化性。 Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

[184] TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

Quynh Phung,Long Mai,Cusuh Ham,Feng Liu,Jia-Bin Huang,Aniruddha Mahapatra

Main category: cs.CV

TL;DR: 本文提出Trace框架,用于视频中物体运动轨迹的编辑,用户只需在单帧中设计目标轨迹,即可生成时间一致的编辑视频,无需复杂点跟踪,适用于相机运动场景。

Details Motivation: 现有视频编辑方法主要关注外观编辑或依赖点跟踪进行轨迹控制,但在相机运动视频中,点跟踪对用户而言难以提供,缺乏实用性和易用性。 Method: Trace采用两阶段流程:1)跨视角运动变换模块,将首帧设计的路径映射为适配相机运动的逐帧边界框轨迹;2)运动条件视频重合成模块,依据轨迹重生成目标物体并保留原视频其余内容。 Result: 在多种真实视频上实验表明,该方法在运动编辑的一致性、真实感和可控性方面优于近期图像到视频及视频到视频方法。 Conclusion: Trace提供了一种实用、易用且鲁棒的物体中心化运动编辑方案,尤其适用于含相机运动的复杂视频场景。 Abstract: We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

[185] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware,Animesh Gupta,Kevin Zhai,Zhenyi Wang,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出VISAGE框架,通过在推理时校准目标函数来解决多模态扩散大语言模型(MDLLMs)中的多模态幻觉问题,其核心是利用交叉注意力空间熵估计代理偏差并重新排序token,从而提升视觉接地性。

Details Motivation: MDLLMs在高并发生成中易出现多模态幻觉,根源在于解码器仅依据文本似然性排序候选token,缺乏局部视觉支持验证,导致目标错配。 Method: VISAGE是一种无需训练的解码框架,通过量化交叉注意力分布的空间熵来估计代理偏差,并强制跨注意力头的定位一致性,惩罚空间均匀分布,从而重排序token以增强视觉接地性。 Result: VISAGE在MMMU-val和HallusionBench上分别取得8.59%和7.75%的相对性能提升,并提供了分析稳定性保证,证明其在估计误差下仍保持有界目标损失。 Conclusion: VISAGE有效缓解了MDLLMs中的多模态幻觉问题,将幻觉重新解释为局部优化错误,并通过推理时目标校准实现鲁棒、无需训练的改进。 Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

[186] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kaijin Chen,Dingkang Liang,Xin Zhou,Yikang Ding,Xiaoqiang Liu,Pengfei Wan,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出Hybrid Memory范式及HyDRA架构,通过解耦静态背景与动态主体记忆,在HM-World数据集上显著提升视频世界模型对遮挡后重出现主体的运动连续性建模能力。

Details Motivation: 现有视频世界模型将环境视为静态画布,难以处理动态主体短暂离开视野后再出现的情形,导致主体冻结、形变或消失。 Method: 提出Hybrid Memory范式,构建HM-World大规模视频数据集(59K高保真片段,含解耦相机/主体轨迹、17场景、49主体及精心设计的进出事件),并设计HyDRA记忆架构:将记忆压缩为token,采用时空相关性驱动的检索机制选择性关注运动线索。 Result: 在HM-World上实验表明,所提方法在动态主体一致性与整体生成质量上显著优于当前最先进方法。 Conclusion: Hybrid Memory范式和HyDRA架构有效解决了视频世界模型中动态主体遮挡恢复难题,为建模真实物理世界的时序连续性提供了新思路。 Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

[187] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham,David T. Hoffmann,Ricardo Guerrero,Brais Martinez

Main category: cs.CV

TL;DR: 本文提出了一种无需硬负样本、不增加推理开销的轻量级方法,通过概念中心化短文本对齐与跨模态注意力池化,显著提升视觉-语言模型的组合性能力,同时保持甚至增强零样本和检索性能。

Details Motivation: 现有对比式视觉-语言(V&L)模型在学习组合性表征方面能力有限;生成硬负样本的方法虽有效但泛化差、损害基础能力,不实用。 Method: 1)利用标准NLP工具提取短的概念中心化文本片段,并与图像对齐;2)引入无参的跨模态注意力池化机制,从图像编码器中获取概念中心化视觉嵌入;辅以简单对比损失。 Result: 在标准组合性基准上达到SOTA性能,同时维持或提升零样本分类与跨模态检索能力,且未增加推理开销。 Conclusion: 组合性缺陷源于长描述缺乏必要约束及全局池化导致绑定信息丢失;所提两阶段轻量修改可兼顾组合性与通用V&L能力,具备实用性与可扩展性。 Abstract: Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

[188] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Chen Si,Yulin Liu,Bo Ai,Jianwen Xie,Rolandos Alexandros Potamias,Chuanxia Zheng,Hao Su

Main category: cs.CV

TL;DR: 本文提出了AnyHand,一个大规模合成数据集,用于提升RGB和RGB-D输入下的3D手部姿态估计性能,并验证了其在多个基准上的有效性及泛化能力。

Details Motivation: 现有真实世界数据集覆盖有限,而先前的合成数据集缺乏遮挡、手臂细节和对齐深度的大规模支持,限制了手部姿态估计的发展。 Method: 构建包含2.5M单手和4.1M手-物交互RGB-D图像的大规模合成数据集AnyHand,并设计轻量级深度融合模块,可嵌入现有RGB模型中。 Result: 在FreiHAND和HO-3D等基准上显著提升性能;模型在未微调情况下对HO-Cap跨域数据表现出强泛化能力;RGB-D模型在HO-3D上达到更优性能。 Conclusion: AnyHand有效缓解了训练数据瓶颈,验证了高质量合成数据与深度信息融合对手部姿态估计的重要价值。 Abstract: We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

[189] PixelSmile: Toward Fine-Grained Facial Expression Editing

Jiabin Hua,Hengyuan Xu,Aojie Li,Wei Cheng,Gang Yu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出PixelSmile扩散框架,通过完全对称联合训练解耦表情语义,并结合强度监督与对比学习,实现连续、可控、细粒度的面部表情编辑,同时保持身份不变。

Details Motivation: 细粒度面部表情编辑长期受限于内在语义重叠问题。 Method: 构建具有连续情感标注的Flex Facial Expression(FFE)数据集和评估基准FFE-Bench;提出PixelSmile扩散框架,采用完全对称联合训练解耦表情语义,并融合强度监督与对比学习,通过文本潜在空间插值实现精确稳定的线性表情控制。 Result: PixelSmile在解耦性能和身份保持鲁棒性方面表现优越,支持连续、可控、细粒度的表情编辑及平滑表情融合。 Conclusion: PixelSmile有效解决了表情编辑中的语义重叠问题,实现了高精度、强稳定性与良好身份保持的细粒度表情操控。 Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

[190] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao,Shaohao Rui,Kaining Ying,Bo Zheng,Chuanhao Li,Mingmin Chi,Kaipeng Zhang

Main category: cs.CV

TL;DR: PackForcing 提出一种三段式 KV-cache 管理框架,通过 Sink/Mid/Recent 令牌分区、动态 top-k 选择与连续 Temporal RoPE 调整,在单卡上高效生成2分钟高清视频,实现24倍时序外推,仅需5秒短片段监督。

Details Motivation: 解决自回归视频扩散模型在长视频生成中面临的KV缓存线性增长、时间重复和误差累积瓶颈。 Method: 提出PackForcing框架:将历史上下文分为Sink(保留关键帧)、Mid(双分支3D卷积+低分辨率VAE压缩,32x降token)、Recent(全分辨率保局部连贯)三类;引入动态top-k Mid上下文选择与连续Temporal RoPE调整以对齐位置偏移。 Result: 在单H200 GPU上生成2分钟、832×480、16FPS视频;KV缓存稳定在4GB;实现5秒到120秒(24倍)零样本或仅用5秒片段训练的时序外推;VBench上时序一致性达26.07、动态程度56.25,SOTA。 Conclusion: 分层上下文压缩可显著缓解长视频生成的计算与内存瓶颈,证明短视频监督足以支撑高质量长视频合成。 Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

[191] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Yan Li,Zezi Zeng,Ziwei Zhou,Xin Gao,Muzhao Tian,Yifan Yang,Mingxi Cheng,Qi Dai,Yuqing Yang,Lili Qiu,Zhendong Wang,Zhengyuan Yang,Xue Yang,Lijuan Wang,Ji Li,Chong Luo

Main category: cs.CV

TL;DR: 本文提出了BizGenEval,一个面向商业视觉内容生成的系统性基准测试,涵盖五种文档类型和四个关键能力维度,通过400个提示和8000个人工验证问题评估26个主流图像生成模型,揭示了当前模型在专业视觉内容生成任务中的显著能力差距。

Details Motivation: 现有图像生成基准主要关注自然图像合成,缺乏对真实商业设计任务中结构化与多约束要求的系统评估。 Method: 构建BizGenEval基准,覆盖五类典型文档(幻灯片、图表、网页、海报、科学图表),从文本渲染、布局控制、属性绑定和知识推理四方面设计20项评估任务,并采用400个精心设计的提示及8000个人工验证的检查清单问题进行评测。 Result: 在26个主流图像生成模型(含商用API与开源模型)上的大规模评测表明,当前模型在商业视觉内容生成的关键能力上仍存在显著不足。 Conclusion: BizGenEval为真实商业场景下的视觉内容生成提供了标准化评估框架,有望推动该领域向更实用、可控和可靠的方向发展。 Abstract: Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

[192] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Jiwook Han,Geo Ahn,Youngrae Kim,Jinwoo Choi

Main category: cs.CV

TL;DR: 本文提出SlotVTG框架,通过轻量级slot适配器引导多模态大语言模型进行以对象为中心、输入驱动的视觉推理,显著提升视频时序定位任务在跨域场景下的泛化能力,同时保持领域内性能且开销极小。

Details Motivation: 现有MLLMs在视频时序定位(VTG)中因粗粒度识别能力不足,依赖任务特定微调,易记忆数据集捷径而非真实视觉内容,导致跨域泛化差;而现有以对象为中心的学习方法需从头重训整个多阶段流程,成本高。 Method: 提出SlotVTG框架,引入轻量级slot适配器:利用slot attention将视觉token分解为抽象slots并重建原序列,并融合自监督视觉模型提供的objectness先验,促进语义一致的slot生成。 Result: 在标准VTG跨域评测基准上,SlotVTG显著提升OOD鲁棒性,同时保持有竞争力的ID性能,且计算与训练开销极小。 Conclusion: SlotVTG以极低成本实现了MLLMs向对象中心、输入驱动视觉推理的有效引导,是提升VTG模型泛化能力的有效新范式。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

[193] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Ziyin Wang,Sirui Xu,Chuan Guo,Bing Zhou,Jiangshan Gong,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui

Main category: cs.CV

TL;DR: 本文提出LIGHT方法,通过异步去噪调度和跨模态注意力机制,在无需手工设计接触先验的情况下,实现高质量的人-物交互动画生成。

Details Motivation: 现有基于扩散模型的方法依赖手工设计的接触先验或运动学约束,难以兼顾动态动作与多样物体几何建模。 Method: 基于扩散强制(diffusion forcing),将表征分解为模态特异性组件,分配个性化噪声水平并采用异步去噪调度;利用干净组件通过跨注意力引导噪声组件,实现数据驱动的隐式接触引导。 Result: LIGHT在接触保真度、HOI生成真实感及对未见物体/任务的泛化能力上均优于传统无分类器引导方法,并能有效替代显式接触先验。 Conclusion: 去噪节奏本身可作为有效的数据驱动引导信号,异步去噪与合成几何增强训练可提升接触语义不变性与生成质量。 Abstract: Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

[194] How good was my shot? Quantifying Player Skill Level in Table Tennis

Akihiro Kubota,Tomoya Hasegawa,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 本文提出了一种基于生成模型和联合嵌入的方法,通过分析乒乓球比赛中球员的战术击球行为及其上下文(如站位、对手行为),在隐空间中建模个体特征(包括技能水平),从而实现对技能的量化评估。

Details Motivation: 技能是影响人类行为的关键潜在因素,但难以直接观测;现有方法难以在复杂交互行为(如体育对抗)中准确量化技能。 Method: 构建每个球员的战术球拍击球生成模型,并将不同球员的模型联合嵌入到一个共享隐空间中;该空间编码个体特性(含技能水平),模型以3D重建的职业比赛数据为训练基础,并充分条件化于比赛上下文(如位置、对手行为)。 Result: 所学隐空间能反映不同的打法风格与技能相关属性;基于该空间嵌入训练的简单相对排序网络,可同时实现相对与绝对技能预测。 Conclusion: 该方法成功将技能水平显式编码于学习到的球员隐空间中,为复杂交互行为中的自动化技能评估提供了可行框架。 Abstract: Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports -- specifically table tennis -- where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player's tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context -- including player positioning and opponent behaviors -- the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.

[195] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

Xincheng Shuai,Song Tang,Yutong Huang,Henghui Ding,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出PSDesigner,一个模拟人类设计师创意流程的自动化平面设计系统,通过构建CreativePSD数据集提升工具使用能力,在多项设计任务中超越现有方法。

Details Motivation: 现有自动化设计系统简化专业工作流,导致灵活性和直观性不足,难以将用户意图准确转化为可编辑的设计文件。 Method: 提出PSDesigner系统,结合主题相关素材收集与自主工具调用(如资产集成、元素优化);构建包含大量带操作轨迹标注的高质量PSD文件的CreativePSD数据集,用于训练模型学习专家设计流程。 Result: 在多种平面设计任务中,PSDesigner性能优于现有方法,支持非专业人士便捷生成生产级设计。 Conclusion: PSDesigner通过模拟人类设计流程与高质量数据驱动的工具学习,有效提升了自动化图形设计的实用性、灵活性与专业性。 Abstract: Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

[196] MegaFlow: Zero-Shot Large Displacement Optical Flow

Dingxi Zhang,Fangjinhua Wang,Marc Pollefeys,Haofei Xu

Main category: cs.CV

TL;DR: MegaFlow 提出了一种零样本、大位移光流估计新方法,利用预训练ViT全局特征建模运动匹配,并辅以轻量迭代优化,在多个基准上达到SOTA零样本性能。

Details Motivation: 现有方法依赖迭代局部搜索或领域特定微调,难以应对大位移和零样本泛化场景。 Method: 将光流估计建模为基于预训练ViT全局特征的全局匹配问题,并引入少量轻量级迭代细化提升亚像素精度。 Result: 在多个光流基准上实现零样本SOTA性能,并在长程点跟踪任务中展现强泛化能力与跨任务迁移性。 Conclusion: MegaFlow 证明了借助强大视觉先验而非复杂专用架构,可实现高效、通用的大位移运动估计,为可泛化的运动估计提供统一范式。 Abstract: Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

[197] Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo,Yuxuan Li,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种统一的视觉-语言-世界-动作模型Vega,用于基于指令的自动驾驶规划,构建了大规模驾驶指令数据集InstructScene,并结合自回归与扩散范式实现多模态融合与轨迹生成。

Details Motivation: 现有自动驾驶模型仅将语言用于场景描述或推理,缺乏根据多样化用户指令进行个性化驾驶的能力。 Method: 构建包含约10万场景及对应指令-轨迹对的InstructScene数据集;提出Vega模型,采用自回归处理视觉和语言输入,扩散模型生成未来世界状态和动作轨迹,并通过联合注意力和独立投影层实现多模态交互。 Result: 实验表明该方法在规划性能和指令跟随能力上均优于现有方法。 Conclusion: Vega模型为实现更智能、个性化的自动驾驶系统提供了新路径。 Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

[198] RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang,YuXin Song,Ge Wu,Haocheng Feng,Hang Zhou,Jingdong Wang,Yaxing Wang,jian Yang

Main category: cs.CV

TL;DR: 本文提出RefAlign框架,通过显式对齐参考图像特征与视觉基础模型(VFM)语义空间,提升参考到视频生成中身份一致性和语义可区分性,兼顾文本可控性与参考保真度,且不增加推理开销。

Details Motivation: 现有参考到视频(R2V)方法依赖辅助语义或跨模态特征缓解VAE潜在空间的信息泄漏,但仍难以解决复制粘贴伪影和多主体混淆问题,根源在于异构编码器特征间的模态错配。 Method: 提出RefAlign框架,核心是参考对齐损失:在训练中拉近同一主体的DiT参考分支特征与VFM特征,推开不同主体的对应特征;该损失仅作用于训练阶段,不引入推理开销。 Result: 在OpenS2V-Eval基准上,RefAlign在TotalScore指标上超越当前最优方法,验证了显式参考对齐对R2V任务的有效性。 Conclusion: 显式对齐参考特征至VFM语义空间是一种简单而有效的方法,能更好平衡文本可控性与参考保真度,为R2V生成提供了新思路。 Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

[199] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Bocheng Zou,Mu Cai,Mark Stanley,Dingfu Lu,Yong Jae Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为Multi-Resolution Fusion(MuRF)的简单而通用的推理阶段多分辨率特征融合策略,用于提升视觉基础模型(VFMs)在不同任务上的性能,无需额外训练。

Details Motivation: 现有视觉基础模型在推理时通常仅使用单一固定尺度输入,忽略了多分辨率图像能提供互补归纳偏置(低分辨率利于全局语义识别,高分辨率利于细粒度优化)这一视觉感知基本特性。 Method: MuRF通过在多个分辨率下输入图像至冻结的VFM中提取特征,并融合这些多尺度特征,构建统一表征;该方法不依赖特定架构、无需微调或训练。 Result: MuRF在多个VFM家族(如DINOv2、SigLIP2)及多种关键视觉任务上均取得显著且一致的性能提升,验证了其强泛化性与实用性。 Conclusion: MuRF是一种通用、即插即用、训练无关的推理增强策略,揭示并有效利用了多分辨率信息对视觉表征的协同增益,为VFM的实际部署提供了新范式。 Abstract: Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

[200] Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

Yixing Lao,Xuyang Bai,Xiaoyang Wu,Nuoyuan Yan,Zixin Luo,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Shiwei Li,Hengshuang Zhao

Main category: cs.CV

TL;DR: LGTM是一种新型前馈式3D高斯点绘框架,通过结合紧凑高斯基元与逐基元纹理,解耦几何复杂度与渲染分辨率,实现无需场景优化的4K新视角合成。

Details Motivation: 现有前馈式3D高斯点绘方法因像素对齐导致基元数量随分辨率二次增长,难以扩展至4K等高分辨率合成。 Method: 提出LGTM框架,预测紧凑的高斯基元并为其分配逐基元纹理,从而将几何表示与渲染分辨率解耦。 Result: 在不进行每场景优化的前提下,实现了高质量4K新视角合成,并显著减少所需高斯基元数量。 Conclusion: LGTM突破了前馈式高斯点绘方法的分辨率扩展瓶颈,为高分辨率实时新视角合成提供了可行路径。 Abstract: Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/

[201] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Yawen Luo,Xiaoyu Shi,Junhao Zhuang,Yutian Chen,Quande Liu,Xintao Wang,Pengfei Wan,Tianfan Xue

Main category: cs.CV

TL;DR: ShotStream提出一种因果式多镜头视频生成架构,支持交互式叙事和低延迟实时生成,通过双缓存机制和两阶段蒸馏解决一致性与误差累积问题。

Details Motivation: 现有双向多镜头视频生成模型交互性差、延迟高,难以支持长叙事和实时交互 storytelling。 Method: 提出ShotStream因果架构:将任务重构为基于历史上下文的下一镜头生成;采用分布匹配蒸馏将双向模型蒸馏为因果学生模型;设计双缓存(全局/局部)机制保障跨镜头与单镜头视觉一致性,并引入RoPE不连续指示器区分缓存;提出两阶段自强制蒸馏策略缓解误差累积。 Result: 在单GPU上实现亚秒级延迟、16 FPS生成速度;生成视频连贯性好,质量媲美或超越更慢的双向模型;支持流式提示交互。 Conclusion: ShotStream为实时、交互式多镜头视频生成提供了高效可行的新范式,推动长叙事视频生成走向实用化。 Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our