Table of Contents
cs.CL [Back]
[1] When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews
Hasindri Watawana,Sergio Burdisso,Diego A. Moreno-Galván,Fernando Sánchez-Vega,A. Pastor López-Monroy,Petr Motlicek,Esaú Villatoro-Tello
Main category: cs.CL
TL;DR: 本文揭示了在基于医患对话的自动抑郁检测中,模型性能被访谈者提示语系统性地夸大,强调应限制模型仅使用患者语言以确保可解释性和真实性。
Details
Motivation: 尽管自动抑郁检测取得了进展,但其预测结果缺乏可解释性,且现有方法可能无意中利用了访谈者固定提示语等非语言线索,导致性能虚高。 Method: 作者分析了三个公开数据集(ANDROIDS、DAIC-WOZ、E-DAIC),通过对比使用全部对话与仅使用患者语句的模型表现,检验 interviewer prompts 是否构成系统性偏差,并考察该偏差在不同架构和数据集上的普适性。 Result: 发现模型若包含访谈者语句,常依赖固定提示和位置信息实现高分类准确率,而剔除后决策证据更均匀分布于患者语言中;该偏差跨数据集、跨模型架构普遍存在。 Conclusion: 应严格区分说话人角色,在模型评估中仅使用患者语句,并采用按时间和说话人定位决策证据的分析方法,以确保模型真正学习抑郁相关的语言特征。 Abstract: Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.[2] Demystifying When Pruning Works via Representation Hierarchies
Shwai He,Guoheng Sun,Haichao Zhang,Yun Fu,Ang Li
Main category: cs.CL
TL;DR: 本文从表示层次视角分析网络剪枝对语言模型的影响,发现剪枝在嵌入和logit空间中相对鲁棒,但logit到概率的非线性变换会放大误差,导致生成任务性能显著下降;而非生成任务因依赖稳定概率子空间和嵌入空间,受剪枝影响较小。
Details
Motivation: 解释为何网络剪枝在非生成任务中有效,却在生成任务中常失效。 Method: 从表示层次(embedding、logit、probability)分解语言模型内部计算,分析剪枝扰动在各空间中的传播与放大效应。 Result: 嵌入与logit空间对剪枝扰动鲁棒,但logit→概率的非线性变换会逐时间步累积误差,严重损害生成质量;而概率子空间的稳定性支撑了非生成任务的剪枝有效性。 Conclusion: 剪枝效果高度依赖任务类型及其所依赖的表示空间特性,需根据任务选择性应用剪枝策略。 Abstract: Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations[3] Fine-Tuning A Large Language Model for Systematic Review Screening
Kweku Yamoah,Noah Schroeder,Emmanuel Dorley,Neha Rani,Caleb Schutz
Main category: cs.CL
TL;DR: 本文探讨了利用微调的大型语言模型(LLM)提升系统性综述中标题与摘要筛选效率的方法,结果显示微调后的模型在各项指标上显著优于基线模型。
Details
Motivation: 传统系统性综述耗时耗力,尤其在标题和摘要筛选阶段;现有基于提示词(prompting)的LLM方法效果不稳定,作者认为缺乏足够上下文是主因。 Method: 对一个12亿参数的开源LLM进行微调,训练数据来自人类标注的8500+篇标题与摘要(用于某系统性综述),任务为二分类(是否纳入)。 Result: 微调模型加权F1提升80.79%;在8277项研究测试集上,与人工编码者一致率达86.40%,真阳性率91.18%,真阴性率86.38%,多次推理结果完全一致。 Conclusion: 针对特定任务微调小型LLM可显著提升系统性综述初筛性能,具备实用潜力。 Abstract: Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.[4] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset
Mohammed Nowshad Ruhani Chowdhury,Mohammed Nowaz Rabbani Chowdhury,Sakari Lukkarinen
Main category: cs.CL
TL;DR: 本研究通过在小型验证的芬兰语模拟临床对话语料库上微调LLaMA 3.1-8B模型,探索其在芬兰语医疗转录任务中的有效性,结果表明该方法在语义相似性上表现良好,具备用于芬兰语临床文档书写的可行性。
Details
Motivation: 解决芬兰语等低资源语言中电子健康记录(EHR)带来的行政负担和医生倦怠问题,提升临床文档质量与患者安全。 Method: 在Metropolia应用科学大学学生模拟临床对话语料库上,采用受控预处理与优化策略对LLaMA 3.1-8B进行微调,并通过七折交叉验证评估效果。 Result: BLEU=0.1214,ROUGE-L=0.4982,BERTScore F1=0.8230;显示低n元语法重叠但高语义相似性。 Conclusion: 微调LLaMA 3.1-8B可有效支持芬兰语医疗话语转录,验证了构建隐私导向、领域专用大语言模型用于芬兰语临床文档的可行性,并为后续研究提供方向。 Abstract: Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.[5] Enhancing Structured Meaning Representations with Aspect Classification
Claire Benét Post,Paul Bontempo,August Milliken,Alvin Po-Chun Chen,Nicholas Derby,Saksham Khatwani,Sumeyye Nabieva,Karthik Sairam,Alexis Palmer
Main category: cs.CL
TL;DR: 本文提出了一种新的英语句子数据集,为缺乏时相特征的AMR图标注UMR时相标签,并设计了标注方案与多步仲裁流程以保证质量;同时通过三种建模方法开展基线实验,为自动预测UMR时相提供了初步基准。
Details
Motivation: 时相(aspect)对完整刻画事件的时间结构至关重要,但在现有语义表示框架(如AMR)中普遍缺失标注,阻碍了人工标注与自动预测系统的发展。 Method: 构建基于UMR时相格的标注方案与指南,对AMR图中的动词性谓词进行时相标注,并采用多步仲裁流程保障标注一致性与质量;在此基础上开展三种建模范式(如序列标注、图神经网络等)的基线实验。 Result: 发布了首个面向AMR图的UMR时相标注数据集,并在自动时相预测任务上建立了初始性能基准,验证了该数据集对推动时相信息融入语义表示的有效性。 Conclusion: 该工作填补了语义表示中时相标注的空白,为后续自动化建模和更丰富的语义表示奠定了数据与方法基础。 Abstract: To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.[6] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini
Main category: cs.CL
TL;DR: 本文研究了在葡萄牙语持续预训练中,合成数据重写如何与源数据质量相互作用。研究发现,重写高质量数据能显著提升模型性能,而重写低质量数据效果有限,表明合成重写主要是质量放大器,而非数据筛选的替代方案,且该效应依赖于模型规模。
Details
Motivation: 现有合成数据生成研究多集中于英语,且未系统控制源数据质量,本文旨在填补葡萄牙语场景下源数据质量与重写效果关系的研究空白。 Method: 基于标注有STEM和教育质量分数的葡萄牙语语料库ClassiCC-PT,构建高低质量两个10B词符子集;使用7B指令微调模型以四种风格重写,生成约40B词符合成数据;在1.1B和7B参数英文基座模型上开展持续预训练,并在44任务葡萄牙语基准PoETa V2上评估。 Result: 在7B模型上,重写高质量数据带来+3.4 NPM增益,而重写低质量数据仅+0.5 NPM;在1.1B模型上该差异减弱,未修改的低质量数据表现接近重写的高质量数据。 Conclusion: 合成重写主要起质量放大作用,不能替代数据筛选,且其有效性随模型规模增大而增强。 Abstract: Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.[7] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
Haobo Xu,Sirui Chen,Ruizhong Qiu,Yuchen Yan,Chen Luo,Monica Cheng,Jingrui He,Hanghang Tong
Main category: cs.CL
TL;DR: 本文提出arrol方法,通过在线剪枝rollout并平衡正确性来加速强化学习与可验证奖励(RLVR)训练,显著提升准确率和训练速度。
Details
Motivation: 现有RLVR方法如GRPO和DAPO计算开销大,且因奖励相对优势稀疏导致学习信号弱。 Method: arrol是一种在线rollout剪枝方法:训练轻量级质量头预测部分rollout的成功概率,据此实时剪枝,并对剩余rollout进行重加权以增强学习信号;系统层面在推理引擎内完成剪枝与重批处理。 Result: 在Qwen-3和LLaMA-3.2(1B–8B)上,arrol使GRPO和DAPO平均准确率提升+2.30至+2.99,训练速度最高提升1.7倍,测试时缩放下额外提升平均准确率+8.33。 Conclusion: arrol有效缓解了RLVR中高计算成本与弱学习信号问题,在精度与效率上实现双重提升。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.[8] LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis
Baraa Hikal,Jonas Becker,Bela Gipp
Main category: cs.CL
TL;DR: LogSigma is a system for SemEval-2026 Task 3 (DimABSA), predicting continuous Valence and Arousal scores using learned homoscedastic uncertainty to balance regression objectives, achieving first place on five datasets.
Details
Motivation: Traditional ABSA predicts discrete sentiment labels, but DimABSA requires continuous Valence and Arousal scores; prediction difficulty varies across languages and domains, necessitating adaptive task balancing. Method: LogSigma employs learned homoscedastic uncertainty—introducing task-specific log-variance parameters—to automatically balance Valence and Arousal regression objectives during training, combined with language-specific encoders and multi-seed ensembling. Result: LogSigma achieves 1st place on five datasets across both tracks of SemEval-2026 Task 3; learned variance weights differ substantially across languages (e.g., 0.66x for German, 2.18x for English), confirming language-dependent difficulty in VA prediction. Conclusion: Optimal task balancing in DimABSA is language-dependent and must be learned rather than set a priori; LogSigma’s uncertainty-aware approach effectively addresses cross-lingual VA prediction challenges. Abstract: This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.[9] Estimating near-verbatim extraction risk in language models with decoding-constrained beam search
A. Feder Cooper,Mark A. Lemley,Christopher De Sa,Lea Duesterwald,Allison Casasola,Jamie Hayes,Katherine Lee,Daniel E. Ho,Percy Liang
Main category: cs.CL
TL;DR: 本文提出了一种解码约束的束搜索方法,用于高效量化大语言模型中近逐字(near-verbatim)记忆提取风险,显著降低计算成本,并揭示了传统逐字方法无法发现的记忆风险模式。
Details
Motivation: 现有贪婪解码方法无法刻画不同序列间记忆提取风险的差异;概率提取方法虽可建模风险,但仅适用于逐字情形,而近逐字情形虽具同等隐私与版权风险,却因候选后缀组合爆炸和蒙特卡洛估计开销大(约10万样本/序列)而难以评估。 Method: 提出解码约束的束搜索(decoding-constrained beam search),在保证确定性的同时,为近逐字提取风险提供高效、可计算的下界估计,单序列计算成本仅相当于约20次蒙特卡洛采样。 Result: 实验表明,该方法能发现远多于逐字方法的可提取序列、更大的单序列提取质量(extraction mass),并揭示近逐字风险随模型规模及文本类型变化的规律性模式。 Conclusion: 解码约束束搜索是一种高效、可靠且具洞察力的工具,可弥补现有方法在近逐字记忆风险评估上的关键缺陷,对模型安全与合规性分析具有重要价值。 Abstract: Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.[10] Toward domain-specific machine translation and quality estimation systems
Javad Pourmostafa Roshan Sharami
Main category: cs.CL
TL;DR: 本文研究了如何通过数据驱动的方法将机器翻译(MT)和质量估计(QE)系统适配到专业领域,提出了包括基于相似性的数据选择、分阶段QE训练、子词切分与词表对齐、以及QE引导的上下文学习等方法,显著提升了跨领域、低资源及零样本场景下的性能。
Details
Motivation: MT和QE在通用领域表现良好,但在领域不匹配时性能显著下降,亟需有效的领域自适应方法。 Method: 提出四种数据聚焦方法:1)基于相似性的MT数据选择;2)结合领域自适应与轻量数据增强的分阶段QE训练;3)分析子词切分与词表对齐对微调的影响;4)QE引导的大语言模型上下文学习。 Result: 所提方法在多个维度提升性能:小规模领域数据优于大规模通用数据;QE方法在跨语言、零样本和低资源设置下均有效;对齐的tokenization-vocabulary提升训练稳定性与翻译质量;QE引导的上下文学习无需参数更新且支持无参考评估。 Conclusion: 领域自适应效果依赖于数据选择、表示设计与高效适配策略;本工作为构建鲁棒的领域专用MT与QE系统提供了系统性方法。 Abstract: Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.[11] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems
Yuhang Zhou,Zhuokai Zhao,Ke Li,Spilios Evmorfos,Gökalp Demirci,Mingyi Wang,Qiao Liu,Qifei Wang,Serena Li,Weiwei Li,Tingting Wang,Mingze Gao,Gedi Zhou,Abhishek Kumar,Xiangjun Fan,Lizhu Zhang,Jiayi Liu
Main category: cs.CL
TL;DR: 本文提出MoFA框架,利用大语言模型进行基于语义和定量信息的可解释、约束感知的序列式特征选择,适用于标注数据稀缺的工业场景,并在三个真实应用中验证了其有效性。
Details
Motivation: 传统特征选择方法依赖标注数据和统计启发式,在生产环境中因标注数据有限且需满足多种运行约束而难以应用。 Method: 提出Model Feature Agent(MoFA)模型驱动框架,将特征定义、重要性得分、相关性及元数据等融入结构化提示,通过可解释、约束感知的推理进行序列式特征选择。 Result: 在真实工业场景中验证:1)提升预测准确率并降低特征组复杂度;2)发现高阶交互项,显著提升线上用户参与度;3)选出紧凑高价值特征子集,兼顾准确率与推理效率。 Conclusion: LLM驱动的推理式特征选择在实际生产系统中具备实用性与有效性,为标注稀缺场景提供了新范式。 Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.[12] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection
Xiaowei Zhu,Yubing Ren,Fang Fang,Shi Wang,Yanan Cao,Li Guo
Main category: cs.CL
TL;DR: 本文提出Exons-Detect,一种无需训练的AI生成文本检测方法,通过外显子感知的token重加权机制提升检测鲁棒性与可解释性,在多个指标上达到SOTA。
Details
Motivation: 现有无训练检测方法假设各token贡献均匀,在短文本或局部篡改下鲁棒性差,而AI生成文本泛滥带来 misinformation、版权争议等社会风险,亟需更可靠检测手段。 Method: Exons-Detect基于双模型隐状态差异识别并放大信息量高的'外显子'token,据此对token重加权,并计算可解释的翻译得分。 Result: 在DetectRL数据集上平均AUROC相对最强基线提升2.2%,且对对抗攻击和不同输入长度具有强鲁棒性。 Conclusion: Exons-Detect验证了token非均匀贡献建模的有效性,为训练-free AI文本检测提供了新范式,兼顾性能、鲁棒性与可解释性。 Abstract: The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2\% relative improvement in average AUROC over the strongest prior baseline on DetectRL.[13] Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models
Tony Mason
Main category: cs.CL
TL;DR: 本文发现模型将指令视为社会行为而非技术规范,指令的语义效力受语言社会语域(如祈使语气)影响;通过将祈使句改写为陈述句,可显著降低跨语言指令执行差异,并提示宪法式AI原则若以祈使语气撰写可能导致语言依赖型对齐问题。
Details
Motivation: 探究多语言大模型在不同语言中对相同语义指令表现出相反交互拓扑(合作vs竞争)的原因,检验社会语域(如语气)是否影响指令理解与执行。 Method: 在四种语言、四种模型上开展指令级消融实验;手动构建22个探针,将生产系统提示分解为56个模块,系统性重写祈使句为陈述句,并用置换检验评估效果。 Result: 将单个指令块由祈使改为陈述,跨语言方差降低81%(p=0.029);改写11个祈使块中的3个即可使西班牙语指令拓扑从竞争转为合作,并对未改写块产生溢出效应。 Conclusion: 模型处理指令时依赖语言特定的社会语域;‘NEVER do X’是权威行使,效力随语言变化,而‘X: disabled’是事实陈述,更具跨语言鲁棒性;宪法AI原则若采用祈使语气,可能引发语言依赖型对齐风险。 Abstract: System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: "NEVER do X" is an exercise of authority whose force is language-dependent, while "X: disabled" is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.[14] Approaches to Analysing Historical Newspapers Using LLMs
Filip Dobranić,Tina Munda,Oliver Pejić,Vojko Gorjanc,Uroš Šmajdek,David Bordon,Jakob Lenardič,Tjaša Konovšek,Kristina Pahor de Maiti Tekavčič,Ciril Bohak,Darja Fišer
Main category: cs.CL
TL;DR: 本研究结合主题建模、大语言模型(LLM)驱动的细粒度情感分析、命名实体图谱可视化与定性话语分析,系统考察斯洛文尼亚历史报纸《Slovenec》和《Slovenski narod》中集体身份、政治取向与民族归属的表征,揭示其意识形态差异,并验证适配OCR退化斯洛文尼亚语文本的LLM(GaMS3-12B-Instruct)在情感分类中的有效性与局限。
Details
Motivation: 探究20世纪初斯洛文尼亚历史报纸中集体身份与政治认同的话语建构,弥补传统人文研究在处理大规模、噪声化历史文本时的效率与深度不足。 Method: 融合BERTopic主题建模、针对OCR退化斯洛文尼亚语优化的LLM(GaMS3-12B-Instruct)进行方面级情感分析、NER构建实体关系图谱,并结合定量网络分析与批判性话语分析的混合方法。 Result: 发现两报存在共享议题但意识形态鲜明分化;GaMS3-12B-Instruct在中性情感识别上最优但对正/负向敏感度较低;实体图谱显示不同群体在话语中呈现中性描述或冲突评价的差异化表征模式。 Conclusion: 可扩展计算方法与批判性阐释的协同能有效支撑数字人文对噪声历史报刊数据的研究,强调方法适配性与人文解读不可替代性。 Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.[15] Closing the Confidence-Faithfulness Gap in Large Language Models
Miranda Muqing Miao,Lyle Ungar
Main category: cs.CL
TL;DR: This paper investigates how large language models (LLMs) verbalize confidence scores, revealing that calibration (actual accuracy) and verbalized confidence are linearly encoded but orthogonal—leading to miscalibration. It identifies a 'Reasoning Contamination Effect' where reasoning disrupts confidence signals, and proposes a two-stage adaptive steering method to align verbalized confidence with internal accuracy estimates, improving calibration.
Details
Motivation: LLMs often produce confidence scores poorly aligned with their actual accuracy, and the underlying geometric and mechanistic reasons remain unclear. Method: Uses mechanistic interpretability tools—linear probes and contrastive activation addition (CAA) steering—to analyze how calibration and verbalized confidence are encoded in three open-weight LLMs across four datasets; identifies the 'Reasoning Contamination Effect'; designs a two-stage adaptive steering pipeline that reads internal accuracy estimates and steers verbalized confidence accordingly. Result: Calibration and verbalized confidence signals are linearly separable but orthogonal; reasoning interferes with confidence encoding; the proposed two-stage steering significantly improves calibration alignment across all models and datasets. Conclusion: Verbalized confidence is not a faithful reflection of model calibration due to geometric orthogonality and reasoning-induced interference; explicit intervention via internal accuracy-guided steering can effectively realign them. Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.[16] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs
Suraj Racha,Prashant Harish Joshi,Utkarsh Maurya,Nitin Yadav,Mridul Sharma,Ananya Kunisetty,Saranya Darisipudi,Nirmal Punjabi,Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: 本文提出oMind框架,通过构建高质量多任务数据集和多轮对话评估基准,提升大语言模型在心理健康领域的适应性和表现。
Details
Motivation: 心理健康问题日益严重,但大语言模型在该领域面临高质量可解释、知识驱动训练数据缺乏、训练范式受限及多轮对话评估困难等挑战。 Method: 提出oMind框架,包括:1)基于结构化知识检索、LLM剪枝与人工审核的数据生成流程,构建约16.4万条多任务监督微调数据集;2)构建专家标注的多轮对话评估基准oMind-Chat;3)对LLM进行训练与对齐以支持多样化能力。 Result: oMind LLM在核心能力与对话任务上持续超越基线模型,oMind-LLM在推理能力上达到最高80%胜率。 Conclusion: oMind框架有效提升了大语言模型在心理健康领域的适用性与可靠性,为专业领域适配提供了系统性方法。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.[17] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文提出基于二型信号检测理论的评估框架,用meta-d'和M-ratio解耦大语言模型的感知能力(Type-1)与元认知敏感性(Type-2),揭示不同模型在“知道自己不知道什么”方面存在显著差异。
Details
Motivation: 现有LLM置信度评估指标(如ECE、Brier分数)混淆了模型知道多少(Type-1敏感性)与知道自己知道多少(Type-2元认知敏感性)两种能力,亟需解耦评估。 Method: 基于二型信号检测理论(Type-2 SDT),引入meta-d'和M-ratio两个新指标,对4个主流LLM在22.4万次事实问答中进行系统评估,并分析温度调节、领域特异性及指标对比。 Result: 发现:(1)元认知效率跨模型差异显著,且与Type-1性能不相关;(2)元认知效率具有领域特异性;(3)温度仅影响二型判断标准,不改变meta-d';(4)AUROC₂与M-ratio给出完全相反的模型排序。 Conclusion: meta-d'框架能有效区分真正具备元认知能力的模型与仅靠置信度策略‘显得’校准良好的模型,对模型选型、部署及人机协同具有关键指导意义。 Abstract: Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.[18] Goodness-of-pronunciation without phoneme time alignment
Jeremy H. M. Wong,Nancy F. Chen
Main category: cs.CL
TL;DR: 本文提出了一种兼容弱监督ASR模型的语音评估特征提取方法,通过混淆网络映射、词级速率/时长、跨注意力融合帧与音素特征,避免音素对齐,在英语和低资源泰米尔语数据集上性能媲美传统同步特征。
Details
Motivation: 现有ASR模型在低资源语言中因训练数据不足难以扩展语音评估;开源弱监督模型虽支持多语言但帧异步且非音素级,导致特征提取困难。 Method: 1)将ASR假设映射到音素混淆网络以获取音素后验概率;2)使用词级而非音素级语速和持续时间;3)采用跨注意力机制融合音素级和帧级特征,规避音素时间对齐。 Result: 在英语SpeechOcean762和低资源泰米尔语数据集上,所提方法性能与标准帧同步特征相当。 Conclusion: 该方法有效解决了弱监督ASR模型与语音评估特征提取之间的不兼容问题,显著促进了语音评估向低资源语言的拓展。 Abstract: In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.[19] To Write or to Automate Linguistic Prompts, That Is the Question
Marina Sánchez-Torrón,Daria Akselrod,Jason Rauchwerk
Main category: cs.CL
TL;DR: 本文系统比较了人工设计的零样本提示、基础DSPy签名和GEPA优化的DSPy签名在翻译、术语插入和语言质量评估任务中的表现,发现结果因任务而异,且多数情况下专家提示与自动优化提示效果无显著差异。
Details
Motivation: 探究自动提示优化是否能在语言学任务中替代专家提示工程,填补该领域研究空白。 Method: 对翻译、术语插入和语言质量评估三大任务,对比手写零样本专家提示、基础DSPy签名及GEPA优化DSPy签名在五种模型配置下的性能,并进行统计显著性检验。 Result: 术语插入中,优化与人工提示质量基本无显著差异;翻译任务中各方法在不同模型上各有优劣;语言质量评估中专家提示更擅错误检测,优化提示更擅特征刻画;GEPA显著提升基础DSPy签名性能,且多数专家-优化对比无统计显著差异。 Conclusion: 自动提示优化(如GEPA)在多数语言任务中可媲美专家提示工程,但二者范式本质不同:优化依赖标注数据集搜索,专家提示依赖领域知识与迭代调优;因此,自动优化尚不能完全替代专家工程,而是有力补充。 Abstract: LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.[20] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le,Benjamin Goh,Quy Anh Tang
Main category: cs.CL
TL;DR: 本文提出了一种利用轻量级通用大语言模型(如gemini-2.0-flash-lite-001)作为低延迟安全裁判的方法,通过结构化推理流程(意图分解、安全信号验证、危害评估与自省)有效检测提示攻击,并已在新加坡公共服务聊天机器人中部署。
Details
Motivation: 解决现有轻量级分类器和规则系统在分布偏移下泛化能力差、而高容量LLM裁判又过于缓慢或昂贵的部署鸿沟问题,满足生产环境中对低延迟、高鲁棒性提示攻击防护的需求。 Method: 设计结构化提示与输出格式,引导轻量级通用LLM执行四步推理:显式意图分解、安全信号验证、危害评估和自我反思;并在融合真实良性查询与自动化红队生成对抗样本的数据集上进行评估;同时探索多模型混合(MoM)策略的效果。 Result: 轻量级通用LLM(如gemini-2.0-flash-lite-001)可作为高效低延迟的安全裁判,在真实生产环境中已部署为集中式防护服务;MoM策略仅带来小幅性能提升。 Conclusion: 轻量级通用LLM经适当提示工程后,可在严苛低延迟约束下可靠承担LLM系统实时防护任务,是弥合安全与性能之间部署鸿沟的有效方案。 Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.[21] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation
Ying Li,Xinglin Lyu,Junhui Li,Jinlong Yang,Hengchao Shang,Min Zhang,Shimin Tao,Daimeng Wei
Main category: cs.CL
TL;DR: 本文提出了一种名为Cross-Preference Learning(CPL)的偏好学习框架,用于上下文感知机器翻译,通过建模句子级与上下文级翻译的互补优势,提升模型对上下文信息的自适应利用能力,并在多个模型和数据集上验证了其有效性。
Details
Motivation: 上下文感知机器翻译虽利用文档级信息,但因上下文信号对不同句子的增益不均,常无法稳定优于句子级翻译;现有训练目标未显式建模这种变异性,限制了模型自适应利用上下文的能力。 Method: 提出Cross-Preference Learning(CPL)框架,将句内(intra-condition)和跨条件(cross-condition)偏好整合进偏好优化目标,为模型提供何时及如何利用上下文提升翻译质量的显式监督。 Result: 在多个公开上下文感知MT任务及Qwen3-4B、Qwen3-8B、Llama-3-8B等模型上验证,CPL在两种输入条件下均带来一致的翻译质量与鲁棒性提升,且无需修改模型架构。 Conclusion: CPL通过显式建模上下文增益的变异性,有效提升了上下文感知MT的性能与泛化能力,是一种通用、即插即用的训练范式。 Abstract: Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model's ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.[22] Probing the Lack of Stable Internal Beliefs in LLMs
Yifan Luo,Kangping Xu,Yanzhen Lu,Yang Yuan,Andrew Chi-Chih Yao
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在多轮对话中维持隐式一致性(即对未明说目标的持续遵守)的能力,发现当前LLM在缺乏显式上下文提示时难以稳定保持隐含目标,揭示了构建人格化LLM的关键瓶颈。
Details
Motivation: 人格驱动的大语言模型需要在多轮交互中保持行为一致性以模拟真实人格特质(如坚持性、可靠性),但现有LLM缺乏稳定的内在表征来锚定响应,导致行为漂移。 Method: 设计了一个20问式谜题游戏范式:LLM需秘密选定一个目标,并仅以'是/否'回答用户猜测;通过系统评估其在多轮中维持隐式目标(即所选目标)的一致性能力。 Result: 实验表明,LLM在未将所选目标显式写入上下文时,其隐式目标频繁偏移,难以维持隐式一致性;只有在上下文中明确提供目标时,一致性才显著提升。 Conclusion: 当前LLM缺乏锚定隐式目标的机制,限制了其人格建模的真实性;需发展新方法使模型能在长时间交互中稳定维持未明说的内部目标。 Abstract: Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.[23] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations
Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
Main category: cs.CL
TL;DR: 本文系统整理了当代巴斯克语方言数据与资源,区分了原生方言在线数据和标准语到方言的适配数据,并手动构建了高质量的多方言XNLI测试集,同时评估了自动适配数据的质量。
Details
Motivation: 解决方言自然语言处理中数据稀缺的主要限制问题。 Method: 系统梳理并分类巴斯克语方言数据来源(原生在线数据与标准语到方言的适配数据),手动将XNLI测试集适配至三种巴斯克方言,并对自动适配的BasPhyCowest数据集进行母语者人工评估。 Result: 构建了涵盖新闻、社交媒体、词典、语法等的原生方言数据目录;生成了西部、中部和纳瓦拉-拉普迪安三种方言的XNLI高质量平行评测集;验证了自动适配数据经人工校验后可作为银标数据使用。 Conclusion: 该数据目录为巴斯克语方言NLP研究提供了系统性资源基础,手动适配数据具有高可靠性,而经人工评估的自动适配数据可作为补充性银标资源。 Abstract: Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).[24] A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan,Shuyu Dai,Jinglu Wang,Fengtao Zhou,Yan Lu,Xi Wang,Yingcong Chen,Can Yang,Shujie Liu,Hao Chen
Main category: cs.CL
TL;DR: 本文提出了CPGBench,一个用于评估大型语言模型(LLM)在多轮对话中检测与遵循临床实践指南(CPGs)能力的自动化基准框架;研究发现LLM虽能较好识别指南内容(71.1%-89.6%),但在准确引用来源(3.6%-29.7%)和实际遵循指南(21.8%-63.2%)方面存在显著差距,并通过56名临床医生的人工评估验证了结果有效性。
Details
Motivation: 现有LLM在医疗场景中广泛应用,但其在多轮对话中识别和遵循临床实践指南(CPGs)的能力尚不明确,亟需系统性评估框架以保障临床应用的安全性与可靠性。 Method: 构建CPGBench基准:收集来自9国/地区及2个国际组织、覆盖24个专科的3418份CPG文档,提取32155条带元数据(如机构、日期、强度、证据等级等)的临床推荐;为每条推荐生成对应多轮对话;评估8个主流LLM在指南检测、来源引用与临床遵循三方面的表现;辅以56名跨专科临床医生的人工评估验证。 Result: LLM对临床推荐的检测准确率为71.1%-89.6%,但正确引用标题(即来源)仅3.6%-29.7%;指南遵循率仅为21.8%-63.2%;人工评估证实自动评估结果有效;CPGBench是首个系统揭示LLM在对话中失败案例的临床指南基准。 Conclusion: 当前LLM在临床指南的溯源与应用层面仍存在显著缺陷,亟需针对性改进以支撑其在真实临床场景中的安全、负责任部署。 Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.[25] SafeMath: Inference-time Safety improves Math Accuracy
Sagnik Basu,Subhrajit Mitra,Aman Juneja,Somnath Banerjee,Rima Hazra,Animesh Mukherjee
Main category: cs.CL
TL;DR: 本文揭示了数学应用题可能隐含有害、偏见或心理伤害性内容的问题,构建了ToxicGSM数据集,并提出SafeMath方法在保障安全性的同时维持甚至提升数学推理准确性。
Details
Motivation: 现有研究发现大语言模型易受对抗性或看似无害输入的操纵,而数学应用题作为自然语言叙述形式,可能成为传播偏见、不道德或有害内容的隐蔽渠道,尤其在面向儿童的教育场景中风险更高。 Method: 构建包含1900个嵌入有害/敏感语境但保持数学逻辑严谨的算术题数据集ToxicGSM;基于该数据集对现有大语言模型进行安全与数学正确性联合审计;提出名为SafeMath的安全对齐技术。 Result: 实验证明SafeMath能显著降低有害输出,同时保持甚至提升数学推理性能;验证了语言层面的危害性与数学推理能力可解耦,安全对齐无需以牺牲准确性为代价。 Conclusion: 数学应用题存在被滥用为有害内容载体的风险,需专门设计兼顾安全性与推理准确性的对齐方法;ToxicGSM和SafeMath为该方向提供了基础工具与有效方案。 Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.[26] Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages
Danlu Chen,Ka Sing He,Jiahe Tian,Chenghao Xiao,Zhaofeng Wu,Taylor Berg-Kirkpatrick,Freda Shi
Main category: cs.CL
TL;DR: 本文提出FRED难度指标(Fertility Ratio, Retrieval Proxy, Pre-training Exposure, Corpus Diversity),用于量化极低资源机器翻译中数据集固有难度,揭示性能差异主因是训练-测试重叠与预训练暴露,而非模型能力,并指出灭绝语言和非拉丁土著语言存在分词覆盖差的问题。
Details
Motivation: 解决极低资源机器翻译中跨语言对性能结果难以比较、突破性成果归因不明(方法优势 vs 数据偏差)的问题,尤其对古代语言等特定语种研究者缺乏可参考基准。 Method: 提出四个数据集内在难度指标:Fertility Ratio(衡量分词碎片化程度)、Retrieval Proxy(近似检索难度)、Pre-training Exposure(预训练中该语言出现频率)、Corpus Diversity(语料多样性),并用其分析现有基准表现差异来源。 Result: 发现大量性能变异源于train-test overlap和pre-training exposure,而非模型本身;识别出灭绝及非拉丁土著语言存在高token fertility问题,暴露跨语言迁移的根本局限。 Conclusion: FRED指标为XLR MT提供更透明、可靠的评估框架,推动社区从单纯报告准确率转向结合数据难度的科学归因与比较。 Abstract: The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages -- particularly extinct and non-Latin indigenous languages -- suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.[27] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian
Giuseppe Samo,Paola Merlo
Main category: cs.CL
TL;DR: 本研究比较了自然数据和合成数据对训练和评估大语言模型(LLMs)的影响,聚焦法语和意大利语中被动动词交替现象;结果表明,基于自然数据训练的模型在自然与合成测试集上均表现稳健,而仅用合成数据训练的模型难以泛化到自然句子,凸显自然数据在语言学评估中的重要性。
Details
Motivation: 探究自然数据与合成数据在训练和评估大语言模型时对捕捉抽象语言模式(如被动动词交替)能力的影响差异,并验证结构化评估方法(如Blackbird Language Matrices)的有效性。 Method: 采用Blackbird Language Matrices(BLMs)作为结构化探针数据集,对比使用自然句子(来自Universal Dependencies)和合成句子填充的结构化模板,在法语和意大利语被动动词交替任务上训练并测试大语言模型。 Result: 模型在合成数据上训练并测试时达到性能上限,但无法可靠泛化至自然句子;而在自然数据上训练的模型在自然与合成测试集上均表现稳健,展现出更强的抽象语言模式建模能力。 Conclusion: 自然数据比合成数据更能支持大语言模型习得和泛化抽象句法语义知识;结构化的语言探针设计(如BLMs)是有效评估LLMs语言能力的重要方法。 Abstract: This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.[28] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han,Shuang Wu,Jinghang Wang,Yuhao Zhou,Renquan Lv,Bing Zhao,Wei Hu
Main category: cs.CL
TL;DR: 本文提出MolQuest,一种基于真实化学实验数据的代理式评估框架,用于评估大语言模型在分子结构解析中的动态推理能力。该框架强调多轮交互、多源光谱数据整合与假设迭代优化,揭示了当前前沿模型在此类科学任务中表现不佳(最高仅约50%准确率),凸显其在策略性科学推理上的严重不足。
Details
Motivation: 现有科学评估基准多为静态单轮问答形式,无法有效衡量大语言模型在需多步迭代与实验交互的复杂科学任务中的动态推理能力。 Method: 构建MolQuest——一个基于真实化学实验数据的代理式评估框架,将分子结构解析建模为多轮交互任务,要求模型主动规划实验步骤、融合NMR/MS等异构光谱数据,并迭代修正结构假设。 Result: 实证结果表明,当前最先进模型在MolQuest上的准确率仅约50%,多数模型低于30%,暴露出其在真实科研场景中策略性推理能力的显著缺陷。 Conclusion: MolQuest为面向科学的LLM评估提供了可复现、可扩展的新范式;研究揭示了当前LLM在主动参与科学发现过程中的关键能力缺口,为未来提升AI科学推理能力指明方向。 Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.[29] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo,Tommaso Caselli,Davide Ceolin
Main category: cs.CL
TL;DR: 本文提出了一种结合仇恨言论(HS)与可核查性(check-worthiness)的新视角,构建首个融合二者标签的数据集WSF-ARG+,并设计LLM-in-the-loop框架提升标注效率与质量;实验证明引入可核查性信息能显著提升HS检测性能。
Details
Motivation: 仇恨言论常以似是而非的事实形式传播,单独处理HS或 misinformation 会加剧偏见、伤害旁观者并增加审核负担;需联合建模二者。 Method: 构建首个含仇恨言论与可核查性标签的数据集WSF-ARG+;提出LLM-in-the-loop标注框架,集成12种开源LLM辅助标注,并经人工评估验证效果;在HS检测任务中引入check-worthiness标签作为特征。 Result: LLM-in-the-loop框架显著降低人工标注工作量且不损质量;含check-worthy claims的HS具有更高攻击性;加入check-worthiness标签后,LLM-based HS检测macro-F1最高提升0.213,大模型平均提升0.154。 Conclusion: 将可核查性纳入仇恨言论识别框架是必要且有效的;WSF-ARG+数据集与LLM-in-the-loop方法为联合治理HS与misinformation提供了新范式和实用工具。 Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.[30] Separate Before You Compress: The WWHO Tokenization Architecture
Kusal Darshana
Main category: cs.CL
TL;DR: 本文提出了一种面向Abugida文字(如僧伽罗语、天城文)的新型分词算法SGPE及三层架构WWHO,显著降低token数量、提升上下文利用率,并保证音节不被切分。
Details
Motivation: 标准BPE分词器在处理结构复杂的Abugida文字时会错误切分合字(conjuncts),导致推理效率下降和‘Token Tax’问题,尤其影响全球南方语言使用者。 Method: 提出WWHO(Where-What-How Often)三层架构与音节感知的图形单元对编码算法SGPE,将文字语言规则与统计压缩解耦,支持无缝多语言分词。 Result: 在僧伽罗语上Token-to-Word Ratio(TWR)达1.274,较o200k减少61.7%;印地语TWR为1.181,减少27.0%;混合脚本下相较o200k、Llama 4 Scout、DeepSeek V3分别减少36.7%、39.6%、60.2%,上下文窗口最多扩展4.38倍,并实现音节零切分保证。 Conclusion: SGPE有效解决了Abugida文字分词难题,在保持语言学合理性的同时大幅提升分词效率与模型实用性,为低资源语言大模型提供了关键技术支撑。 Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.[31] Beyond Detection: Rethinking Education in the Age of AI-writing
Maria Marina,Alexander Panchenko,Vasily Konovalov
Main category: cs.CL
TL;DR: This paper argues that writing is essential for human deep learning and cognitive development, and warns against outsourcing it to AI tools like ChatGPT; it advocates for pedagogical adaptation and AI-text literacy instead of bans.
Details
Motivation: The increasing use of generative AI in education and daily life risks diminishing writing's cognitive value, prompting an investigation into what is lost when machines replace human writing. Method: Drawing on cognitive psychology, educational theory, and real classroom practices, the paper analyzes the cognitive role of writing and evaluates AI-text detection and pedagogical responses. Result: The paper concludes that the writing process itself—despite being messy and slow—is vital for deep human learning, and that recognizing AI-generated text is becoming a crucial 21st-century literacy skill. Conclusion: Writing must be preserved as a core cognitive practice; educators should prioritize thoughtful pedagogy and AI-literacy over restrictive policies. Abstract: As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality -- outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing -- messy, slow, often frustrating -- is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.[32] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG
Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero
Main category: cs.CL
TL;DR: 本文提出了一种自适应分块框架(Adaptive Chunking),通过五个新颖的内在文档指标(如引用完整性、块内连贯性等)评估并选择最适合每个文档的分块策略,显著提升了RAG系统的性能。
Details
Motivation: 现有RAG中的分块方法多为“一刀切”,难以适配多样文本结构;且缺乏独立于下游任务的分块质量评估框架。 Method: 提出Adaptive Chunking框架,定义五个内在分块质量指标(RC、ICC、DCC、BI、SC),并设计两种新分块器(LLM-regex splitter和split-then-merge递归分块器)及后处理技术,实现基于指标的自适应分块选择。 Result: 在法律、技术与社会科学多领域数据集上验证,RAG答案正确率从62–64%提升至72%,成功回答问题数增加超30%(65 vs. 49),且不改变模型或提示词。 Conclusion: 文档感知的自适应分块结合内在指标评估,是提升RAG鲁棒性的实用有效路径。 Abstract: The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.[33] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning
Xu Huang,Zhejian Lai,Zixian Huang,Jiajun Chen,Shujian Huang
Main category: cs.CL
TL;DR: 本文提出了一种名为TAPO的强化学习框架,通过以英语为中介语言、采用‘先理解后推理’范式,并引入步级相对优势机制,有效提升大语言模型在多语言数学推理任务中的表现。
Details
Motivation: 大型语言模型在英文数学推理上表现优异,但在多语言场景下因语言理解能力不足而性能显著下降,亟需提升其多语言理解与推理协同能力。 Method: 基于GRPO构建翻译增强型策略优化(TAPO)框架,以英语为枢纽语言,采用‘理解-推理’两阶段范式,并设计步级相对优势机制解耦理解与推理过程,融合翻译质量奖励。 Result: TAPO在多语言数学推理和翻译任务上均超越基线方法,具备跨语言泛化能力和对未见领域任务的良好适应性。 Conclusion: TAPO通过显式对齐语言理解和推理过程,有效弥合了多语言数学推理中的性能差距,且具有模型无关性和任务可扩展性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.[34] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering
Erkan Gunes,Christoffer Florczak,Tevfik Murat Yildirim
Main category: cs.CL
TL;DR: 本文探讨了如何通过优化提示工程(prompt engineering)来提升大语言模型(LLM)在社会科学文本分类任务中的性能,发现适度增加提示上下文(如标签描述、指令引导和少样本示例)可显著提升准确率,但过度增加反而可能降低性能,且效果因模型、任务和批量大小而异,强调需针对具体任务单独验证。
Details
Motivation: 当前LLM在文本分类中性能差异大,亟需探究如何系统性提升其准确性,尤其聚焦提示上下文设计这一低成本高潜力方向。 Method: 系统性地调整提示工程的三个维度——标签描述、指令引导和少样本示例,在两个不同文本分类任务上进行实验,评估不同上下文长度与组合对模型性能的影响,并考察模型类型、任务类型和批量大小的调节效应。 Result: 轻微增加提示上下文带来最大性能增益;进一步增加仅带来边际收益,甚至可能降低准确率;性能提升效果在不同模型、任务和批量大小间存在显著异质性。 Conclusion: 提示上下文优化需谨慎权衡,不存在普适最优策略,每个LLM文本编码任务都应独立实证验证,不可依赖通用经验法则。 Abstract: Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.[35] Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas,Ignacio Pérez Prat,Angela Heldstab,Dominic P. Fischer,Sina Ahmadi,Rico Sennrich
Main category: cs.CL
TL;DR: This paper proposes a new data augmentation strategy for low-resource machine translation, aligned with the resource gradient between source and target languages, which significantly improves translation quality for Romansh varieties.
Details
Motivation: LLMs fail to generate high-quality synthetic data for Romansh due to confusion among its 6 distinct language varieties, highlighting the need for a more targeted data augmentation approach. Method: The authors propose aligning data augmentation direction with the resource gradient between source and target languages, rather than relying on LLMs to generate synthetic data from higher-resource languages. Result: The proposed method surpasses Gemini 3 Pro by 23 BLEU in the lowest-resource Romansh variety and produces fluent translations across all individual Romansh varieties, as confirmed by human evaluation. Conclusion: Aligning data augmentation with the resource gradient is a more effective strategy than LLM-based synthetic data generation for extremely low-resource language varieties like Romansh. Abstract: Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.[36] An Experimental Comparison of the Most Popular Approaches to Fake News Detection
Pietro Dell'Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro
Main category: cs.CL
TL;DR: 本文对12种代表性假新闻检测方法在10个英文文本数据集上进行了系统评估,涵盖传统机器学习、深度学习、Transformer及跨域架构,在统一二分类框架下开展域内、多域和跨域实验,发现微调模型泛化能力差,跨域架构需大量数据,而大语言模型在零/少样本场景下更具潜力。
Details
Motivation: 假新闻生成与传播因大语言模型和社会媒体而愈发复杂,现有检测方法在真实场景(如领域偏移、分布外数据)下的鲁棒性和泛化能力尚不明确,亟需系统性、标准化的评估。 Method: 选取12种代表性假新闻检测方法(涵盖传统ML、DL、Transformer及跨域架构),在10个异构英文文本数据集上统一为二分类(Real/Fake)任务,开展域内、多域联合训练及跨域迁移实验,并分析模型在零样本/少样本下的表现。 Result: 微调模型在域内表现优异但泛化能力差;跨域架构可缓解性能下降但依赖大量标注数据;LLMs在零/少样本设置下展现出更强的鲁棒性与适应性;标签归一化虽提升可比性但损失语义细节;所有结果限于英文纯文本设定。 Conclusion: 当前假新闻检测方法在真实复杂场景中仍面临显著泛化挑战;未来研究应更注重跨域鲁棒性、数据效率及LLM驱动的轻量适配范式,同时需谨慎解读评估结果的适用边界。 Abstract: In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into "Real" and "Fake" to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.[37] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence
Nikolai Ilinykh,Hyewon Jang,Shalom Lappin,Asad Sayeed,Sharid Loáiciga
Main category: cs.CL
TL;DR: 本文通过在Visual Writing Prompts数据集上比较人类撰写与视觉语言模型(VLMs)生成的叙事,系统评估了视觉驱动故事中的叙事连贯性,发现VLMs虽表面流畅,但在核心指代、篇章关系、主题连续性、角色持续性及多模态角色定位等维度上表现出与人类系统性差异。
Details
Motivation: 理解视觉语言模型生成的叙事在多大程度上具备人类水平的叙事连贯性,尤其在视觉驱动场景下。 Method: 基于Visual Writing Prompts语料,设计涵盖共指消解、篇章关系类型、主题连续性、角色持久性及多模态角色定位等维度的叙事连贯性度量体系,并计算综合连贯性得分,对比人类与VLMs输出。 Result: VLMs展现出与人类相似但系统性不同的连贯性分布;单个指标差异较细微,但多指标联合分析可显著揭示差异;模型叙事在跨句/跨段落的篇章组织上偏离人类模式。 Conclusion: 当前VLMs在视觉驱动叙事中尚未真正习得人类式的深层连贯结构,仅具表层流畅性,需更关注 discourse-level 建模。 Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.[38] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency
Minseo Kim,Sujeong Im,Junseong Choi,Junhee Lee,Chaeeun Shim,Edward Choi
Main category: cs.CL
TL;DR: 本文提出PICon框架,通过逻辑链式多轮提问评估基于大语言模型的虚拟人格代理在内部一致性、外部一致性和重测一致性三个维度的表现,并发现现有系统仍显著落后于人类基线。
Details
Motivation: 缺乏系统性方法来验证基于大语言模型的人格代理在交互过程中是否保持无矛盾、无事实错误的响应;受审讯学中‘复杂虚构身份终将暴露矛盾’原理启发。 Method: 提出PICon评估框架,采用逻辑链式多轮提问,从内部一致性(不自相矛盾)、外部一致性(符合真实世界事实)和重测一致性(重复提问下响应稳定)三方面量化评估 persona agent。 Result: 在七组 persona agent 和63名真实人类参与者对比实验中,即使此前报告高一致性的系统,在全部三个维度上均未达到人类基线,暴露出矛盾与回避性回答。 Conclusion: PICon为评估 persona agent 提供了概念基础与实用方法,强调在将其作为人类替代者前必须进行严格一致性检验。 Abstract: Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/[39] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Mingmeng Geng,Yuhang Dong,Thierry Poibeau
Main category: cs.CL
TL;DR: 本文通过分析arXiv论文,揭示了大语言模型(LLMs)对学术写作语言使用的潜在影响,如标题中‘beyond’和‘via’使用增多、摘要中‘the’和‘of’减少;指出当前分类器难以准确识别具体生成模型,并提出一种可解释的线性方法来量化LLM使用的异质性与动态性。
Details
Motivation: 现有研究尚未充分关注大语言模型对学术文本词频使用的隐性影响,且缺乏对真实世界中LLM使用多样性和动态变化的定量评估。 Method: 采用直接且高度可解释的线性方法,结合不同LLM及提示词(prompt)的差异,对arXiv论文标题与摘要中的词频变化进行统计分析与建模。 Result: 发现LLM显著改变了学术文本的词汇使用模式(如‘beyond’、‘via’上升,‘the’、‘of’下降);多模型分类准确率低,表明模型输出高度相似;同时证实真实LLM使用具有异质性与动态性。 Conclusion: LLM已悄然影响学术写作风格,其影响复杂多样,需更精细的方法识别与建模,而非依赖黑箱分类器。 Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.[40] Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh,Rodica Ivan
Main category: cs.CL
TL;DR: 本研究探讨了构建无关因素对基于大语言模型(LLM)的双架构自动评分系统的影响,发现该系统在面对无意义文本填充、拼写错误和写作复杂度变化时具有鲁棒性,但对大段重复文本和离题作答表现出敏感性,总体支持LLM评分系统在注重构念相关性设计下的稳健性。
Details
Motivation: 随着大语言模型在自动评分系统中的广泛应用,其对构建无关因素(如拼写错误、文本冗余、离题等)的鲁棒性及‘幻觉’问题受到重新关注。 Method: 研究采用双架构LLM-based评分系统,评估其对短文式情境判断测试开放作答的评分表现,并系统测试其在添加无意义文本、拼写错误、写作复杂度变化、大段重复文本及离题作答等构造无关干扰下的响应。 Result: 系统对无意义填充、拼写错误和写作复杂度变化总体稳健;大段重复文本导致预测分数显著下降(与以往非LLM系统相反);离题作答被严重扣分。 Conclusion: 当以构念相关性为核心进行设计时,LLM-based自动评分系统展现出良好的鲁棒性,为未来教育测评中可信AI评分提供了实证支持。 Abstract: Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.[41] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook
Haoyan Yang,Mario Xerri,Solha Park,Huajian Zhang,Yiyang Feng,Sai Akhil Kogilathota,Jiawei Zhou
Main category: cs.CL
TL;DR: 本文提出了一种面向自改进大语言模型的系统级统一框架,将自改进过程建模为包含数据获取、数据选择、模型优化和推理精炼四个环节的闭环生命周期,并引入自主评估层进行全程监控与引导。
Details
Motivation: 人类监督成本高、可扩展性差,且在模型接近人类水平时反馈信息量不足;同时模型自主决策能力增强,使得开发流程的部分环节可被自动化。 Method: 提出一个闭环生命周期框架,包含数据获取、数据选择、模型优化、推理精炼四阶段及一个自主评估层;系统梳理并技术分析各环节代表性方法。 Result: 构建了首个系统化、结构化的自改进LLM统一框架,厘清了各组件的技术脉络与相互关系,并识别出当前关键局限。 Conclusion: 自改进是LLM持续演进的重要范式;该框架为理解、设计和评估自改进系统提供了系统性基础,未来需向完全自主、鲁棒、可验证的自改进方向发展。 Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.[42] S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Ligong Han,Hao Wang,Han Gao,Kai Xu,Akash Srivastava
Main category: cs.CL
TL;DR: 本文提出S2D2,一种无需训练的自推测解码框架,用于块扩散语言模型,在不增加训练或测试开销的前提下,通过在单token尺度上复用同一模型作为draft和verify,动态决定是否进行局部序列级验证,显著提升准确率与速度的权衡。
Details
Motivation: 现有块扩散语言模型在少量去噪步数下依赖置信度阈值解码,但激进阈值损害质量,保守阈值浪费计算;而改进方法多需额外训练或测试开销。 Method: 利用块扩散模型在块大小为1时退化为自回归模型的特性,使同一预训练模型兼具draft(块扩散)与verify(单token自回归)功能;在标准块扩散解码中插入轻量级路由策略控制的推测性验证步骤,形成扩散并行生成+局部自回归校验的混合解码路径。 Result: 在SDAR、LLaDA2.1-Mini等三大主流块扩散模型上,S2D2持续优于强置信度阈值基线:在SDAR上相较自回归解码提速达4.7×,相较动态解码基线提速1.57×且准确率提升最多4.5分;在LLaDA2.1-Mini上与内置自校正互补,保守设置下比静态基线快4.4×且精度略高。 Conclusion: S2D2是一种高效、通用、免训练的块扩散解码优化方案,通过模型自身多模态能力实现动态验证,在保持低延迟的同时提升生成质量,为实际部署提供了新范式。 Abstract: Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.[43] Natural-Language Agent Harnesses
Linyue Pan,Lexiao Zou,Shuo Guo,Jingchen Ni,Hai-Tao Zheng
Main category: cs.CL
TL;DR: 本文提出自然语言代理框架(NLAHs)和智能框架运行时(IHR),将代理控制逻辑从代码中解耦并以可编辑的自然语言形式外部化,提升可移植性、可比性和可研究性,并在编码与计算机使用基准上验证其可行性。
Details
Motivation: 现有代理性能高度依赖于‘harness engineering’,但框架设计常嵌入控制器代码和特定运行时约定中,难以迁移、比较和系统性研究。 Method: 提出自然语言代理框架(NLAHs)——用自然语言描述高阶控制逻辑;以及智能框架运行时(IHR)——通过显式契约、持久化构件和轻量适配器执行NLAHs。 Result: 在编程与计算机操作基准测试中,验证了NLAHs/IHR在操作可行性、模块消融分析及代码到文本框架迁移方面的有效性。 Conclusion: 将代理框架逻辑外部化为自然语言可执行构件是可行且有益的,为代理工程提供了更科学、可复现和可协作的基础设施范式。 Abstract: Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.cs.CV [Back]
[44] MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies
Weixiang Shen,Yanzhu Hu,Che Liu,Junde Wu,Jiayuan Zhu,Chengzhi Shen,Min Xu,Yueming Jin,Benedikt Wiestler,Daniel Rueckert,Jiazhen Pan
Main category: cs.CV
TL;DR: 本文提出MEDOPENCLAW运行时系统和MEDFLOWBENCH基准,旨在评估视觉语言模型在真实临床3D多模态医学影像中的主动导航与决策能力,发现当前先进模型在获得专业工具支持时因空间定位不准反而性能下降。
Details
Motivation: 现有医学影像VLM评估局限于人工筛选的2D图像,脱离临床实际中需在3D多序列/多模态数据中主动导航、收集证据并决策的真实场景。 Method: 构建MEDOPENCLAW——一个可审计的运行时系统,使VLM能动态接入标准医学工具(如3D Slicer);在其上建立MEDFLOWBENCH基准,覆盖多序列脑MRI与肺部CT/PET全研究级任务,并设viewer-only、tool-use和open-method三类评测轨道。 Result: 实验表明,Gemini 3.1 Pro和GPT-5.4等前沿模型可在viewer-only模式下完成基础任务,但接入专业工具后性能反降,主因是缺乏精确空间定位能力。 Conclusion: MEDOPENCLAW与MEDFLOWBENCH首次将VLM评估从静态图像感知拓展至交互式临床工作流,为构建可审计、全研究级医学影像智能体提供了可复现基础。 Abstract: Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.[45] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
Francesco Gentile,Nicola Dall'Asen,Francesco Tonini,Massimiliano Mancini,Lorenzo Vaquero,Elisa Ricci
Main category: cs.CV
TL;DR: 本文提出SITH框架,一种无需数据、无需训练的CLIP视觉Transformer权重空间可解释性方法,通过奇异向量分解与COMP算法实现细粒度、语义连贯的注意力头内解释,并支持可解释的权重编辑与模型适应分析。
Details
Motivation: 现有可解释方法依赖激活值,存在数据依赖、偏差敏感和解释粒度粗等问题,亟需一种更鲁棒、细粒度且数据无关的权重空间分析方法。 Method: 提出SITH框架:在CLIP视觉Transformer权重空间中对每个注意力头的价值输出矩阵进行奇异值分解,并利用新算法COMP将每个奇异向量解释为稀疏、语义连贯的人类可理解概念组合。 Result: SITH生成了连贯且忠实的头内解释(经重建保真度与可解释性实验验证);支持精准的概念级权重编辑以提升下游性能;揭示微调主要重加权稳定语义基而非学习新特征。 Conclusion: SITH是一种高效、数据无关、细粒度的Transformer可解释性新范式,不仅增强模型透明度,还支持可控、可解释的权重编辑与模型行为分析。 Abstract: As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.[46] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
An Yu,Ting Yu Tsai,Zhenfei Zhang,Weiheng Lu,Felix X. -F. Ye,Ming-Ching Chang
Main category: cs.CV
TL;DR: ReDiPrune是一种无需训练的视觉token剪枝方法,在视觉-语言投影器前直接从视觉编码器输出中选择信息丰富的token,兼顾文本相关性与多样性,显著提升多模态大模型的精度-效率权衡。
Details
Motivation: 现有多模态大语言模型因需处理大量视觉token而计算开销大,亟需高效、无需重训练的token压缩方法。 Method: 提出ReDiPrune:一种训练无关的预投影视觉token剪枝方法;基于轻量级规则对每个token打分,联合考虑文本条件下的相关性和max-min多样性,直接在视觉编码器输出层筛选关键token。 Result: 在4个视频和5个图像基准上持续改善精度-效率权衡;例如在EgoSchema数据集上,LLaVA-NeXT-Video-7B仅保留15%视觉token即实现+2.0%绝对准确率提升,并降低超6倍TFLOPs计算量。 Conclusion: ReDiPrune是一种即插即用、无需修改模型结构或重训练的高效token剪枝方案,能有效保留细粒度空间与语义信息,显著提升多模态理解模型的推理效率与性能。 Abstract: Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.[47] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins
Quanyun Wu,Kyle Gao,Daniel Long,David A. Clausi,Jonathan Li,Yuhao Chen
Main category: cs.CV
TL;DR: 本文提出了一种尺度感知的3D融合框架,利用视觉语言模型引导的几何锚点机制,解决Transformer预测点云与局部重建网格之间的尺度和坐标不一致问题,构建度量一致、语义 grounded 的数字孪生环境。
Details
Motivation: 现有基于Transformer的单目视频重建方法生成的点云存在固有尺度模糊性和坐标约定不一致问题,导致无法可靠地与局部重建的物体网格融合,阻碍了具身AI所需的对象中心、度量准确、语义 grounded 的数字孪生环境构建。 Method: 提出尺度感知3D融合框架:1)VLM引导的几何锚点机制恢复真实世界度量尺度;2)几何感知配准流程,显式引入重力对齐垂直估计、曼哈顿世界结构约束和无碰撞局部优化以保证物理合理性。 Result: 在真实室内厨房环境中实验验证,提升了跨网络物体对齐精度和几何一致性,有利于下游多原语拟合与度量测量任务;并开源了一个带度量尺度、语义标注与物体网格注册的室内数字孪生数据集。 Conclusion: 该方法有效弥合了视觉重建与几何建模之间的尺度与坐标鸿沟,为具身AI提供了更可靠、可度量、语义丰富的数字孪生基础。 Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.[48] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy
Yicheng Xu,Jiangning Zhang,Zhucun Xue,Teng Hu,Ran Yi,Xiaobin Hu,Yong Liu,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出了一种面向能力的六级分类法来诊断多模态统一模型中上下文学习(In-context Learning)的非单调性和任务依赖性问题,并构建了大规模数据集UniICL-760K与评测基准UniICL-Bench;进一步设计了轻量、即插即用的上下文自适应原型调制器(Context-Adaptive Prototype Modulator),显著提升了少样本适配的稳定性与性能。
Details
Motivation: 上下文学习在统一多模态模型中因跨模态干扰和认知需求差异而表现不稳定、非单调且高度任务依赖,亟需系统性诊断与稳定化方法。 Method: 提出六级能力导向的演示功能分类法;构建包含15个子任务、8样本设置的大规模UniICL-760K数据集及可控评测基准UniICL-Bench;设计轻量即插即用的Context-Adaptive Prototype Modulator模块以增强少样本适应稳定性。 Result: 在UniICL-Bench上评估显示,所提方法在多数理解类上下文学习任务上超越参数量更大的多模态大语言模型基线。 Conclusion: 基于认知能力分类的系统性分析与结构化干预可有效缓解多模态统一模型中上下文学习的敏感性与不稳定性,为训练无关的少样本适应提供新范式。 Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.[49] BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation
Bentao Song,Jun Huang,Qingfeng Wang
Main category: cs.CV
TL;DR: 本文提出BCMDA框架,通过虚拟域桥接和原型对齐校正,解决混合域半监督医学图像分割中域偏移与标注不足下的知识迁移困难和确认偏差问题。
Details
Motivation: 在混合域半监督医学图像分割(MiDSS)中,存在标注与未标注数据分布差异大、未标注数据学习效率低导致严重确认偏差两大挑战。 Method: 提出双向相关图域自适应(BCMDA)框架:1)通过虚拟域桥接知识迁移(KTVDB),利用双向相关图合成图像并结合固定比例/渐进动态MixUp生成虚拟域,再用双CutMix实现跨域知识迁移;2)采用原型对齐与伪标签校正(PAPLC),基于可学习原型余弦相似度分类器实现虚实域双向原型对齐,并校正伪标签以缓解确认偏差。 Result: 在三个公开多域数据集上验证了方法优越性,尤其在极少量标注样本下仍表现优异。 Conclusion: BCMDA有效缓解了域偏移与标注稀缺场景下的知识迁移障碍和确认偏差,提升了混合域半监督医学图像分割性能。 Abstract: In mixed domain semi-supervised medical image segmentation (MiDSS), achieving superior performance under domain shift and limited annotations is challenging. This scenario presents two primary issues: (1) distributional differences between labeled and unlabeled data hinder effective knowledge transfer, and (2) inefficient learning from unlabeled data causes severe confirmation bias. In this paper, we propose the bidirectional correlation maps domain adaptation (BCMDA) framework to overcome these issues. On the one hand, we employ knowledge transfer via virtual domain bridging (KTVDB) to facilitate cross-domain learning. First, to construct a distribution-aligned virtual domain, we leverage bidirectional correlation maps between labeled and unlabeled data to synthesize both labeled and unlabeled images, which are then mixed with the original images to generate virtual images using two strategies, a fixed ratio and a progressive dynamic MixUp. Next, dual bidirectional CutMix is used to enable initial knowledge transfer within the fixed virtual domain and gradual knowledge transfer from the dynamically transitioning labeled domain to the real unlabeled domains. On the other hand, to alleviate confirmation bias, we adopt prototypical alignment and pseudo label correction (PAPLC), which utilizes learnable prototype cosine similarity classifiers for bidirectional prototype alignment between the virtual and real domains, yielding smoother and more compact feature representations. Finally, we use prototypical pseudo label correction to generate more reliable pseudo labels. Empirical evaluations on three public multi-domain datasets demonstrate the superiority of our method, particularly showing excellent performance even with very limited labeled samples. Code available at https://github.com/pascalcpp/BCMDA.[50] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration
Gokce Inal,Pouyan Navard,Alper Yilmaz
Main category: cs.CV
TL;DR: 本文提出了专用于月球表面和地下表征的视觉-语言模型LLaVA-LE,构建了大规模月球多模态数据集LUCID,并通过两阶段微调策略显著提升了模型在月球地形分析任务上的性能。
Details
Motivation: 现有视觉-语言模型(VLMs)在行星科学领域应用受限,主要原因是缺乏配对真实行星图像与详细科学描述的大规模数据集。 Method: 构建了包含96k张高分辨率全色图像及对应描述、81k个问答对的月球多模态数据集LUCID;在此基础上,采用两阶段训练策略对LLaVA进行微调:第一阶段为领域特定地形描述的概念对齐,第二阶段为指令调优的视觉问答。 Result: LLaVA-LE在多项月球地形分析推理基准上显著优于基线模型:相比基础LLaVA提升3.3倍,相比第一阶段模型提升2.1倍;其推理得分达1.070,甚至超过人工评委的参考分。 Conclusion: 领域专用多模态数据与指令微调可有效推动视觉-语言模型在行星探测中的应用。 Abstract: Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.[51] Lookalike3D: Seeing Double in 3D
Chandan Yeshwanth,Angela Dai
Main category: cs.CV
TL;DR: 本文提出了lookalike物体检测任务,利用室内场景中重复和相似物体的线索,通过多视角图像输入对物体对进行相同、相似或不同的分类;提出了Lookalike3D多视角图像Transformer模型,并构建了3DTwins数据集,在IoU指标上相较基线提升104%,并验证其在联合3D重建和部件共分割等下游任务中的有效性。
Details
Motivation: 现有3D理解与生成方法常忽略真实场景中普遍存在的重复物体信息,而重复和近似物体可提供强语义和几何一致性线索。 Method: 提出lookalike物体检测新任务,并设计Lookalike3D多视角图像Transformer模型,融合大视觉基础模型的强语义先验;构建含76k人工标注对的3DTwins数据集(基于ScanNet++)。 Result: 在lookalike检测任务上IoU较基线提升104%;显著提升联合3D物体重建与部件共分割等下游任务性能。 Conclusion: 重复与相似物体是提升3D感知一致性与质量的关键线索;Lookalike3D及3DTwins为该方向提供了有效方法与基准资源。 Abstract: 3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.[52] Accurate Point Measurement in 3DGS -- A New Alternative to Traditional Stereoscopic-View Based Measurements
Deyan Deng,Rongjun Qin
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的新型三维点测量方法,利用其高质量、完整的新视角合成能力,通过多视图间一致点的手动选取与三角化实现高精度几何测量,无需立体工作站或专业操作员,且在薄结构和尖锐角等难重建区域显著优于传统网格方法。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成上表现优异,但其在精确几何测量中的潜力尚未被充分挖掘;现有测量方法依赖不完整/不准的网格或昂贵的立体工作站,限制了实用性与精度。 Method: 利用3DGS渲染的高质量、连续多视角图像,用户可在不同视图中直观选取同一物理点(congruent points),再通过三角化计算其三维坐标;支持双视图及多视图交点优化以提升精度;实现为轻量级Web应用。 Result: 在UAV航拍数据集上验证:对清晰点RMSE达1–2 cm;对薄结构(原网格RMSE=0.062 m)降至0.037 m;对网格完全失败的尖锐角点,仍达0.013 m RMSE,精度匹配或超越传统立体测量。 Conclusion: 3DGS不仅可用于渲染,更可作为高保真几何测量平台;该方法降低了硬件与操作门槛,拓展了3DGS在测绘、工业检测等需定量几何分析场景的应用价值。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: https://github.com/GDAOSU/3dgs_measurement_tool.[53] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Shengli Zhou,Minghang Zheng,Feng Zheng,Yang Liu
Main category: cs.CV
TL;DR: 本文提出QuatRoPE和IGRE两种新方法,用于提升大语言模型在3D空间推理任务中的性能。QuatRoPE是一种线性复杂度的四元数位置编码,通过注意力层内点积显式建模物体间空间关系;IGRE则限制其影响范围,保护LLM原有能力。实验验证了其有效性。
Details
Motivation: 现有方法或难以提取空间关系(绝对位置编码),或扩展性差(显式编码所有两两空间关系导致输入长度二次增长),且3D场景-语言配对数据稀缺。 Method: 提出QuatRoPE:基于四元数的新型位置编码,输入长度与物体数呈线性关系,并在注意力机制中通过向量点积显式计算成对空间关系;引入IGRE(孤立门控RoPE扩展)以隔离QuatRoPE作用范围,仅影响物体相关token,避免干扰LLM原有位置编码。 Result: 在多个3D空间推理基准上显著优于现有方法,验证了QuatRoPE与IGRE在保持几何一致性、提升推理能力和维持LLM原有能力方面的有效性。 Conclusion: QuatRoPE与IGRE为3D空间推理提供了一种高效、可扩展且几何一致的新范式,有效缓解了数据稀缺与模型融合的挑战。 Abstract: Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.[54] Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation
Daniele Agostinelli,Thomas Agostinelli,Andrea Generosi,Maura Mengoni
Main category: cs.CV
TL;DR: 本文系统评估了基于面部关键点的凝视估计方法,提出了一种标准化的关键点提取与归一化流程,并在多个大规模数据集上训练了轻量级回归模型(XGBoost、全连接MLP和双目Siamese MLP),发现其在跨域场景下可媲美ResNet18,表明稀疏几何特征足以支撑鲁棒、高效且可解释的边缘端凝视估计。
Details
Motivation: 现有基于CNN的外观凝视估计方法虽准确但计算开销大、缺乏可解释性;而轻量级的几何方法(基于关键点)性能上限与泛化能力在现代基准中尚未充分探索。 Method: 构建标准化流程从Gaze360、ETH-XGaze和GazeGene三大数据集中提取并归一化面部关键点;训练三种轻量模型:XGBoost、整体式MLP和建模双目几何的Siamese MLP。 Result: 关键点模型在域内评估中性能略低(主因关键点检测引入噪声),但在跨域评估中,MLP架构泛化能力与ResNet18基线相当。 Conclusion: 稀疏几何特征足以支持鲁棒凝视估计,为高效、可解释、隐私友好的边缘应用提供了可行路径。 Abstract: Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.[55] Confidence-Based Mesh Extraction from 3D Gaussians
Lukas Radl,Felix Windisch,Andreas Kurz,Thomas Köhler,Michael Steiner,Markus Steinberger
Main category: cs.CV
TL;DR: 本文提出了一种面向3D高斯泼溅(3DGS)的自监督置信度框架,通过可学习的置信值动态平衡光度与几何监督,并引入颜色/法向方差惩罚损失及解耦D-SSIM外观模型,显著提升含视点相关效应场景下的无界网格提取精度与效率。
Details
Motivation: 现有3DGS方法在存在丰富视点相关效应(view-dependent effects)的场景中难以准确提取网格表面,而依赖多视角、迭代提取或大预训练模型等方案会牺牲3DGS固有的高效性。 Method: 提出自监督置信度框架,引入可学习置信值以动态加权光度与几何损失;设计针对每个高斯原语的颜色和法向方差惩罚损失;改进外观建模,解耦D-SSIM损失的各项。 Result: 在无界网格提取任务上达到SOTA性能,同时保持高计算效率。 Conclusion: 该方法以轻量、端到端、无需额外数据或模型的方式,有效缓解了视点相关效应对3DGS表面重建的干扰,兼顾精度与效率。 Abstract: Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.[56] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception
Yuqi Hu,Vasha DuTell,Ahna R. Girshick,Jennifer E. Corbett
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP嵌入空间插值的框架,生成语义模糊图像谱,用于比较人类与机器分类器在概念边界判断上的差异;发现模型更倾向‘兔子’判断,而人类判断更贴近CLIP合成嵌入,且人类对引导尺度更敏感。
Details
Motivation: 探究人类与机器视觉模型在语义模糊图像中概念边界判定的异同,揭示模型表征与人类感知的对齐程度。 Method: 在CLIP嵌入空间中对语义概念(如duck/rabbit)进行插值,生成连续模糊图像谱;结合心理物理学实验与机器分类器测试,定量测量人与模型的边界位置及敏感性。 Result: 机器分类器表现出更强的‘兔子’偏好;人类判断更贴合CLIP合成嵌入;引导尺度显著影响人类敏感性,但对模型影响较小。 Conclusion: 可控模糊图像可作为诊断工具,连接人类心理物理分析、图像分类与生成模型,有助于理解人机对齐、鲁棒性、模型可解释性及图像合成机制。 Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.[57] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video
Selim Gilon,Emily Y. Miller,Scott D. Uhlrich
Main category: cs.CV
TL;DR: 本文提出OpenCap Monocular算法,仅用单台智能手机视频即可高精度估计三维骨骼运动学与动力学参数,显著优于现有单目方法,并已在临床相关任务(如 frailty 和膝骨关节炎评估)中验证其有效性。
Details
Motivation: 传统生物力学分析依赖昂贵、耗时的实验室设备(如标记式动捕和测力台),难以在临床大规模应用;亟需低成本、可扩展、高精度的便携式评估工具。 Method: 基于单目姿态估计模型WHAM输出,通过优化提升3D人体姿态精度;再映射至生物力学约束的骨骼模型以计算运动学;最后结合物理仿真与机器学习估计动力学(如关节力矩、地面反作用力)。 Result: 在行走、深蹲、起坐任务中,运动学误差为4.8°(旋转)和3.4 cm(骨盆平移);相比纯回归基线,旋转与平移精度分别提升48%和69%;对膝关节伸展力矩与内收力矩等临床关键指标估计具有临床意义的准确性。 Conclusion: OpenCap Monocular实现了仅用单智能手机即可完成高精度、 clinically actionable 的生物力学评估,已部署为免费手机App/Web/云平台,有望推动运动功能障碍的远程筛查与监测。 Abstract: Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.[58] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval
David G. Shatwell,Sirnam Swetha,Mubarak Shah
Main category: cs.CV
TL;DR: 本文提出TIGeR模型,通过多模态Transformer将图像、地理位置和时间映射到统一的地理-时间嵌入空间,支持多种查询方式,并在地理定位、拍摄时间预测及地理-时间感知图像检索任务中显著优于现有方法。
Details
Motivation: 现实应用(如数字取证、城市监控、环境分析)需要联合推理图像的视觉外观、地理位置和时间信息,而现有方法难以满足更复杂的地理-时间联合检索需求。 Method: 提出TIGeR模型,基于多模态Transformer构建统一的地理-时间嵌入空间;支持单模态与多模态输入;在自建4.5M训练与86K测试三元组数据集上进行训练与评估。 Result: TIGeR在年份时间预测、日间时间预测和地理-时间感知检索召回率上分别比SOTA提升16%、8%和14%。 Conclusion: 统一建模地理与时间信息可有效提升跨模态联合推理能力,尤其在外观变化大但位置/时间一致的场景下更具鲁棒性。 Abstract: Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.[59] Synthetic Cardiac MRI Image Generation using Deep Generative Models
Ishan Kumarasinghe,Dasuni Kawya,Madhura Edirisooriya,Isuri Devindi,Isuru Nawinne,Vajira Thambawita
Main category: cs.CV
TL;DR: 本文综述了合成心脏MRI(CMRI)生成的最新进展,重点比较了不同生成模型在图像保真度、下游任务效用和患者隐私保护三方面的表现,并指出当前方法存在评估不统一、缺乏临床集成框架等问题。
Details
Motivation: 解决标注心脏MRI数据稀缺、设备厂商差异大、模型记忆导致隐私泄露等临床实际问题。 Method: 系统梳理基于GAN、VAE、扩散模型和流匹配等生成方法,结合掩码引导、厂商风格条件化、强度归一化等技术提升结构保真度与跨域泛化能力,并采用成员推断攻击、最近邻分析和差分隐私等手段保障隐私。 Result: 解剖约束的合成CMRI数据可提升多厂商环境下下游分割任务的准确率与鲁棒性;扩散与流匹配模型在边界保持和确定性变换方面更具优势;隐私保护机制已逐步纳入评估体系。 Conclusion: 现有CMRI生成方法在保真度、效用与隐私三方面仍难以兼顾,亟需建立统一、评估驱动的集成框架以支持可靠临床应用。 Abstract: Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.[60] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects
Narek Tumanyan,Samuel Rota Bulò,Denis Rozumny,Lorenzo Porzi,Adam Harley,Tali Dekel,Peter Kontschieder,Jonathon Luiten
Main category: cs.CV
TL;DR: 本文提出DRoPS方法,利用动态物体的静态预扫描作为几何与外观先验,结合网格对齐的高斯基元和CNN驱动的运动参数化,显著提升极端新视角下的动态场景重建质量与3D跟踪精度。
Details
Motivation: 现有方法难以在高度关节化运动和极端新视角下实现高质量动态场景重建,因其未能充分利用静态预扫描提供的强几何与外观先验。 Method: DRoPS构建表面对齐、网格结构化的高斯基元表示,并设计基于该网格的CNN运动参数化模型,以注入强隐式正则化并建模邻近点运动相关性。 Result: 在渲染质量和3D跟踪精度上显著优于当前最先进方法。 Conclusion: 显式引入静态预扫描并结合结构化表示与数据驱动运动建模,可有效缓解动态重建的病态性,提升泛化性与几何一致性。 Abstract: Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.[61] AVControl: Efficient Framework for Training Audio-Visual Controls
Matan Ben-Yosef,Tavi Halperin,Naomi Ken Korem,Mohammad Salama,Harel Cain,Asaf Joseph,Anthony Chen,Urska Jelercic,Ofir Bibi
Main category: cs.CV
TL;DR: 本文提出AVControl框架,基于LTX-2联合音视频基础模型,采用独立LoRA适配器与并行画布机制,实现轻量、可扩展、多模态可控音视频生成,无需修改主干架构,支持深度、姿态、边缘、相机轨迹、稀疏运动、视频编辑及首个模块化音视频控制,在VACE基准上显著优于现有方法,且训练高效。
Details
Motivation: 现有可控视频音频生成方法要么依赖固定模态的单一大模型,要么为新增模态引入昂贵的架构改动,缺乏灵活性与可扩展性。 Method: 提出AVControl框架:基于LTX-2联合音视频基础模型,为每种控制模态(如深度、姿态、音频等)单独训练LoRA适配器;引入‘并行画布’机制,将参考信号作为额外token注入注意力层,不改变原模型架构;避免简单扩展图像级上下文方法到视频导致的结构控制失效问题。 Result: 在VACE基准上,深度/姿态引导生成、修复(inpainting)、外补(outpainting)任务全面超越所有基线;相机控制和音视频联合任务表现具竞争力;支持多种独立训练模态(深度、姿态、边缘、相机轨迹、稀疏运动、视频编辑、模块化音视频控制);各模态仅需小数据集、数百至数千步即可收敛。 Conclusion: AVControl是一种轻量、可扩展、高效且无需架构修改的多模态可控音视频生成框架,首次实现模块化音视频联合控制,兼具计算与数据效率,并开源代码与LoRA检查点。 Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.[62] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
Danil Tokhchukov,Aysel Mirzoeva,Andrey Kuznetsov,Konstantin Sobolev
Main category: cs.CV
TL;DR: 本文提出Calibri方法,通过引入单个可学习缩放参数并利用进化算法优化DiT组件,在仅调整约100个参数的情况下显著提升文本到图像生成质量,并减少推理步数。
Details
Motivation: 揭示扩散Transformer(DiT)在生成任务中的潜在能力,并解决其在去噪过程中性能提升受限的问题。 Method: 提出Calibri方法,将DiT校准建模为黑箱奖励优化问题,使用进化算法高效求解,仅修改约100个参数。 Result: Calibri在多个文本到图像模型上一致提升生成质量,同时减少推理所需步数,且保持高输出质量。 Conclusion: 引入轻量级可学习缩放参数并结合进化优化的校准策略,能显著提升DiT的生成性能与效率。 Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.[63] Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis
Abu Noman Md Sakib,Merjulah Roby,Zijie Zhang,Satish Muluk,Mark K. Eskandari,Ender A. Finol
Main category: cs.CV
TL;DR: 本文提出了一种可解释人工智能(XAI)引导的编码器整形框架,用于提升复杂腹主动脉瘤(AAA)CT图像分割的可靠性,通过XAI生成的‘XAI场’对编码器注意力进行显式优化,并在训练与推理中分别用于聚焦对齐与轻量级细化,显著优于基础SAM方法。
Details
Motivation: 现有AAA CT图像分割模型常因关注无关结构或忽略低对比度、细薄目标而失败;模型‘看哪里’是关键训练信号,需引入可解释性指导来校正编码器注意力。 Method: 提出XAI引导的编码器整形框架:首先从编码器最后一层生成基于归因的稠密‘XAI场’;然后(i)在训练中将预测概率质量与XAI场对齐以增强焦点-输出一致性;(ii)在推理中将XAI场送入轻量细化路径和置信先验模块,动态调制logits以抑制干扰、保留细微结构;XAI场仅作为控制信号,核心贡献在于将归因信息深度融入表征学习与解码过程。 Result: 在临床验证的、专为易失败场景构建的挑战性病例上评估,相比基础SAM设置,本方法取得显著性能提升。 Conclusion: 显式利用XAI引导优化编码器焦点是一种实用且有效的原则,能显著提升复杂医学图像分割的鲁棒性与可靠性。 Abstract: Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.[64] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining
Deen Dayal Mohan,Hossein Souri,Vitali Petsiuk,Juhong Min,Gopal Sharma,Luowei Zhou,Suren Kumar
Main category: cs.CV
TL;DR: GoldiCLIP 是一种数据高效的大规模视觉语言模型训练框架,基于‘恰到好处’(Goldilocks)原则,通过文本条件自蒸馏、集成解码器的VQA目标及不确定性加权机制,在仅3000万图像(比主流方法少300倍)上实现SOTA性能。
Details
Motivation: 大型视觉语言模型(VLMs)以往严重依赖十亿级数据集,阻碍了研究可及性;现有改进多聚焦对比学习的局部缺陷,缺乏系统性平衡监督信号的设计。 Method: 提出GoldiCLIP框架,包含三项创新:(1) 文本条件下的自蒸馏以对齐文本无关与文本相关特征;(2) 集成解码器并引入VQA目标,提升编码器对非描述性查询的泛化能力;(3) 基于不确定性的动态损失加权机制,统一协调异构损失。 Result: 在仅30M图像上训练,GoldiCLIP在MSCOCO检索任务上超越最佳可比基线2.2点(标准检索)、2.0点(细粒度检索)、5.9点(问答式检索),且性能接近十亿级模型。 Conclusion: GoldiCLIP验证了高质量、多维度协同监督可显著降低VLM对海量数据的依赖,为数据高效视觉语言预训练提供了新范式。 Abstract: Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.[65] Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators
Yubo Wang,Marie Fridberg,Anirejuoritse Bafor,Ole Rahbek,Christopher Iobst,Søren Vedding Kold,Ming Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制和新型卷积方法(ERRC)的深度学习模型,用于根据图像外观自动分类骨针穿皮部位是否感染,在AUC和F1-score上均取得优异性能。
Details
Motivation: 骨针穿皮部位易发生感染,导致患者痛苦与并发症增加,亟需更精准、高效的识别与管理方法。 Method: 构建了骨针穿皮伤口感染数据集,并设计了一种结合注意力机制与高效冗余重建卷积(ERRC)的深度学习模型,聚焦于皮肤-金属界面区域并增强特征表达。 Result: 模型在AUC达0.975、F1-score达0.927,仅需5.77M参数,性能优于基线方法。 Conclusion: 该DL模型可仅凭视觉征象准确区分感染与非感染骨针穿皮部位,结果与临床医生判断一致,具备临床辅助应用潜力,但仍需更大规模数据进一步验证。 Abstract: Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.[66] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
Alabi Mehzabin Anisha,Guangjing Wang,Sriram Chellappan
Main category: cs.CV
TL;DR: 本文提出了一种新型跨范式对抗攻击框架,可同时有效攻击基于密度图和点回归的两种主流人群计数模型,在保持视觉不可察觉性的同时显著提升误差,并在多个SOTA模型间实现高成功率迁移。
Details
Motivation: 现有研究仅探索了密度图模型间的对抗迁移,而跨范式(密度图 vs 点回归)攻击尚未被研究;且人群计数在安防等关键场景中对鲁棒性要求极高。 Method: 提出多任务损失优化的对抗框架:对点回归模型采用场景密度相关的高置信度logit抑制,对密度图模型采用峰值定向的密度图抑制,并统一加入模型无关的感知约束以保障扰动的不可察觉性。 Result: 攻击使平均绝对误差(MAE)相比干净样本提升约7倍;成功在7个SOTA人群计数模型上迁移,迁移比率0.55–1.69;视觉质量保持良好,优于现有可迁移攻击方法。 Conclusion: 该跨范式对抗框架首次实现了对两类主流人群计数范式的统一、高效、不可察觉且可迁移的攻击,揭示了当前模型在安全鲁棒性上的共性脆弱性,为后续防御研究提供了重要基准。 Abstract: State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at https://github.com/simurgh7/CrowdGen[67] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation
Junyi Ouyang,Wenbin Teng,Gonglin Chen,Yajie Zhao,Haiwei Chen
Main category: cs.CV
TL;DR: DCARL是一种新型的分而治之、自回归视频生成框架,通过关键帧生成器与插值生成器协同工作,在保持长时序结构一致性的同时提升视觉保真度和相机运动可控性。
Details
Motivation: 现有视频扩散模型(VDMs)扩展性差,自回归模型虽可无限生成但存在视觉漂移和可控性差的问题;需兼顾长程结构稳定性与高保真细节生成。 Method: 提出DCARL框架:1)无时间压缩训练的关键帧生成器,提供全局结构锚点;2)基于重叠片段、以关键帧为全局上下文、单帧为局部一致性的自回归插值生成器。 Result: 在大规模互联网长轨迹视频数据集上训练后,在FID/FVD(视觉质量)和ATE/ARE(相机运动误差)指标上均优于SOTA自回归与分治方法,支持长达32秒的稳定高质量视频生成。 Conclusion: DCARL成功融合了分治策略的结构稳定性与VDMs的高保真生成能力,为长轨迹视频建模提供了高效、可控且可扩展的新范式。 Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.[68] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching
Yihan Wang,Jia Deng
Main category: cs.CV
TL;DR: WAFT-Stereo是一种基于形变的立体匹配方法,无需传统成本体,性能领先且效率更高。
Details
Motivation: 挑战现有主流方法依赖成本体(cost volume)的设计范式,探索更高效、简洁的立体匹配方案。 Method: 提出基于形变(warping)的WAFT-Stereo方法,摒弃显式构建成本体,通过可学习形变直接建模视差关系。 Result: 在ETH3D、KITTI和Middlebury基准上排名第一;ETH3D零样本误差降低81%;比竞争方法快1.8–6.7倍。 Conclusion: 成本体并非立体匹配高性能的必要条件;基于形变的轻量设计可在精度和速度上实现双重突破。 Abstract: We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at https://github.com/princeton-vl/WAFT-Stereo.[69] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders
Katarina Trojachanec Dineva,Stefan Andonov,Ilinka Ivanoska,Ivan Kitanovski,Sasho Gramatikov,Tamara Kostova,Monika Simjanoska Misheva,Kostadin Mishev
Main category: cs.CV
TL;DR: 本文对20种前沿多模态大模型在2D神经影像(MRI/CT)上的诊断能力进行了全面基准测试,评估其在诊断、亚型分类、模态识别等多任务上的表现,发现技术属性识别已接近解决,而诊断推理(尤其亚型)仍具挑战;Gemini-2.5-Pro和GPT-5-Chat诊断性能最强,Gemini-2.5-Flash效率最优,开源模型MedGemma-1.5-4B在少样本下表现突出。
Details
Motivation: 多模态大语言模型在神经影像决策支持中的可靠性与实际运行权衡尚不明确,亟需系统性基准评估。 Method: 构建涵盖多发性硬化、卒中、脑肿瘤等五类临床场景的标准化2D MRI/CT数据集,要求模型同步输出诊断、亚型、模态、序列、解剖平面五类结果;采用四维评估体系(带拒判的判别分类、校准性、结构化输出有效性、计算效率),并设计多阶段框架控制选择偏差。 Result: 技术属性(如模态、扫描平面)识别准确率接近饱和;诊断任务中肿瘤分类最可靠,卒中中等,多发性硬化和罕见异常最难;少样本提示提升部分模型性能但增加开销;Gemini-2.5-Pro/GPT-5-Chat诊断综合最强,Gemini-2.5-Flash效率-性能平衡最佳,MedGemma-1.5-4B作为开源模型在少样本下逼近闭源模型零样本性能且结构化输出完美。 Conclusion: 该研究揭示了当前多模态大模型在神经影像应用中的能力边界与实用瓶颈,为临床部署提供了性能、可靠性与效率的量化依据,并推动建立标准化评估范式。 Abstract: Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.[70] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment
Jinkui Hao,Gorkem Durak,Halil Ertugrul Aktas,Ulas Bagci,Bradley D. Allen,Nilay S. Shah,Bo Zhou
Main category: cs.CV
TL;DR: 本文提出CORA,一种基于病理中心、合成驱动的3D视觉基础模型,专为冠状动脉CT血管成像(CCTA)分析设计,通过解剖引导的病灶合成实现自监督学习,在多中心数据上显著超越现有模型,并结合大语言模型提升心血管不良事件风险预测能力。
Details
Motivation: 临床缺乏专家标注的CCTA数据集;现有无标签预训练方法(如掩码图像建模)偏向全局解剖统计,难以捕捉局部冠状动脉斑块等病理特征。 Method: 提出CORA模型,采用病理中心、合成驱动的自监督框架:利用解剖引导的病灶合成引擎,在12,801例无标注CCTA体积数据上进行3D自监督训练;进一步将影像编码器与大语言模型耦合,构建多模态风险预测框架。 Result: 在9家独立医院多中心数据上,CORA在斑块表征、狭窄检测、冠状动脉分割等任务中全面超越SOTA 3D视觉基础模型,最高提升29%;在30天主要不良心脏事件(MACE)风险分层任务中显著提升性能。 Conclusion: CORA是一种可扩展、可扩展的3D心血管影像基础模型,统一支持解剖结构评估与心血管风险预测,推动CCTA自动化分析向临床转化。 Abstract: Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29\% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.[71] Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration
Lukas Kratochvila,Jakub Stefansky,Simon Bilik,Robert Rous,Tomas Zemcik,Michal Wolny,Frantisek Rusnak,Ondrej Cech,Karel Horak
Main category: cs.CV
TL;DR: 本文提出了一种用于无人机自动巡检系统的烟雾探测器识别方法,对比了YOLOv11、SSD和RT-DETRv2等目标检测模型,并结合真实与半合成数据及多种增强策略进行训练,在两个具有挑战性的测试集上验证性能,YOLOv11n以mAP@0.5=0.884表现最优。
Details
Motivation: 烟雾探测器常安装于高天花板或难以到达的位置,人工巡检危险且成本高,亟需一种可集成于无人机的自动识别系统。 Method: 比较YOLOv11、SSD和RT-DETRv2三种目标检测模型(含不同骨干网络),结合真实与半合成数据及多种数据增强策略进行训练,并在两个含运动模糊、低分辨率、遮挡等挑战性场景的测试集上评估。 Result: YOLOv11n模型在mAP@0.5指标上达到0.884,为最佳性能;代码、预训练模型与数据集均已开源。 Conclusion: YOLOv11n在复杂现实场景下对烟雾探测器识别鲁棒性强,适合作为无人机自动巡检系统的核心识别模块。 Abstract: Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.[72] OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding
Xiaoyu Tang,Jun Dong,Jintao Cheng,Rui Fan
Main category: cs.CV
TL;DR: 本文提出了跨域遥感视觉定位(CD-RSVG)新任务,构建首个大规模光学/SAR融合数据集OptSAR-RSVG,并提出高效框架OptiSAR-Net++,通过PL-MoE、对比学习范式、动态对抗负采样及文本引导双门融合等模块,在精度与效率上达到SOTA。
Details
Motivation: 现有遥感视觉定位方法局限于单一传感器(光学或SAR),难以满足真实多源遥感场景需求,亟需支持跨域(光学+SAR)的统一建模方法。 Method: 提出OptiSAR-Net++框架:1)采用基于patch的低秩自适应混合专家(PL-MoE)实现跨域特征解耦;2)引入CLIP对比学习范式+动态对抗负采样,替代计算昂贵的Transformer解码回归;3)设计文本引导双门融合模块(TGDF-SSA)和区域感知辅助头提升语义-视觉对齐与空间建模。 Result: 在新构建的OptSAR-RSVG和已有DIOR-RSVG基准上均取得SOTA性能,显著提升定位精度与推理效率。 Conclusion: CD-RSVG是一项重要且实用的新任务,OptiSAR-Net++验证了跨域特征解耦与高效对比匹配的有效性,为多源遥感理解提供了新范式。 Abstract: Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.[73] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform
Yan Meng,Jack Cook,X. Y. Han,Kaan Duman,Shauna Otto,Dhiraj Pangal,Jonathan Chainey,Ruth Lau,Margaux Masson-Forsythe,Daniel A. Donoho,Danielle Levy,Gabriel Zada,Sébastien Froelich,Juan Fernandez-Miranda,Mike Chang
Main category: cs.CV
TL;DR: 本文提出了一种用于垂体瘤手术视频的手术阶段识别框架,结合自监督学习、时序建模与可扩展标注策略,在测试集上达到90%准确率,并构建了面向外科医生的协作式在线平台以支持数据收集与模型迭代。
Details
Motivation: 准确的手术阶段识别对分析手术流程、辅助术中决策以及推动外科教育和绩效评估的数据驱动改进至关重要;而垂体瘤手术(PTS)中存在标注数据稀缺、类别不平衡和手术变异性大等挑战。 Method: 采用ResNet-50架构,先在251个无标签PTS视频上通过自监督学习预训练,再在81例标注手术视频上使用焦点损失、渐进式解冻层和动态采样策略进行微调;同时构建外科医生协作在线平台以支持视频上传、自动阶段分析与数据共建。 Result: 在独立测试集上达到90%的阶段识别准确率,超越当前最优方法,并展现出跨不同手术案例的良好泛化能力。 Conclusion: 该框架有效缓解了医疗视频分析中标注数据匮乏的问题,验证了自监督学习与时序建模结合在手术阶段识别中的有效性,并通过临床协作平台实现了技术闭环与可持续演进。 Abstract: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.[74] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data
Haresh Rengaraj Rajamohan,Yuxuan Chen,Kyunghyun Cho,Cem M. Deniz
Main category: cs.CV
TL;DR: 本研究评估了自监督学习(SSL)在膝骨关节炎(OA)诊断与预后建模中的效果,发现图像-文本多模态SSL虽因数据严重偏倚(93%为KL 3级)未能提升诊断性能,却显著改善了4年结构进展等预后预测任务。
Details
Motivation: 探究自监督学习(特别是基于真实世界未标注医疗数据的SSL)是否能超越ImageNet预训练,在膝骨关节炎的诊断(KL分级)和预后(结构进展)建模中提供更优表征。 Method: 比较两类SSL初始化:(i) 基于OAI、MOST、NYU队列膝X光片的图像-only SSL;(ii) 基于未筛选医院膝X光片及其放射科医生报告的图像-文本多模态SSL;并在KL分级预测(诊断)和4年结构进展预测(预后)任务上,与ImageNet预训练基线对比线性探针与全量微调性能。 Result: 图像-only SSL在线性探针下提升KL分级准确率,但全量微调不优于ImageNet;多模态SSL在诊断任务上无提升(归因于预训练数据严重KL 3级偏倚),但在预后任务(如MOST外部验证)中显著优于ImageNet(10%标注数据下AUROC:0.701 vs. 0.599)。 Conclusion: 未筛选的医院图像-文本SSL数据因分布偏倚不适合诊断建模,但若下游预后任务与预训练数据分布一致,则可提供强表征信号,凸显SSL在医学预后中的独特价值。 Abstract: This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution[75] ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects
Jing Yang,Krithika Dharanikota,Emily Jia,Haiwei Chen,Yajie Zhao
Main category: cs.CV
TL;DR: 本文提出一个大规模真实世界偏振反射与材质数据集,包含218个日常物体、超120万张高分辨率图像,并支持多视角、多光照、偏振、反射分离和材质属性分析;基于该数据集训练的逆向/前向渲染模型在固有分解、重光照和稀疏视角3D重建任务上显著提升性能。
Details
Motivation: 真实材质光反射建模困难,主因是实测反射数据稀缺;现有方法依赖简化光照和低真实感的合成数据,导致模型难以泛化到真实图像。 Method: 构建了基于8相机、346光源Light Stage并配备交叉/平行偏振装置的大规模真实世界偏振反射与材质数据集,覆盖5个采集维度(多视角、多光照、偏振、反射分离、材质属性),提供 diffuse-specular 分离、解析推导的 diffuse/specular albedo 和表面法向;在此数据集上训练并评估前沿逆向/前向渲染模型。 Result: 在固有图像分解、重光照和稀疏视角3D重建任务中,模型展现出更优的材质分离能力、光照保真度和几何一致性。 Conclusion: 该数据集为物理驱动的材质理解提供了新基准,推动逆向渲染从合成训练迈向真实世界泛化。 Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: https://jingyangcarl.github.io/ICTPolarReal/[76] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization
Xuepeng Jing,Wenhuan Lu,Hao Meng,Zhizhi Yu,Jianguo Wei
Main category: cs.CV
TL;DR: 本文提出TIGFlow-GRPO,一种两阶段生成框架,将基于流的轨迹生成与行为规则对齐:第一阶段用带轨迹交互图(TIG)的条件流匹配(CFM)增强上下文建模;第二阶段通过Flow-GRPO后训练,将确定性流展开转为随机SDE采样,并结合社会合规性与物理可行性奖励进行强化优化。在ETH/UCY和SDD数据集上验证了其提升预测精度、长时稳定性及行为合理性。
Details
Motivation: 现有基于条件流匹配(CFM)的轨迹预测方法主要依赖监督拟合,难以充分反映社会规范和场景约束,导致生成轨迹缺乏行为合理性和物理可行性。 Method: 提出两阶段框架TIGFlow-GRPO:第一阶段构建CFM预测器并引入Trajectory-Interaction-Graph(TIG)模块,显式建模智能体间及智能体与场景间的细粒度时空交互;第二阶段采用Flow-GRPO后训练,将确定性ODE rollout重构为随机SDE采样,并设计融合视角感知社会合规性与地图感知物理可行性的复合奖励函数,通过强化学习优化多模态轨迹分布。 Result: 在ETH/UCY和SDD基准上,TIGFlow-GRPO显著提升了轨迹预测精度(如ADE/FDE)与长时预测稳定性,同时生成轨迹更符合社会规范(如避让、跟随)和物理约束(如不可穿越障碍物),定性与定量结果均优于现有CFM及GAN/VAE类方法。 Conclusion: TIGFlow-GRPO成功将基于流的生成建模与行为规则驱动的对齐机制相结合,为动态多媒体环境中的可解释、鲁棒且符合人类行为规律的轨迹预测提供了新范式。 Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.[77] Infinite Gaze Generation for Videos with Autoregressive Diffusion
Jenna Kang,Colin Groth,Tong Wu,Finley Torrens,Patsorn Sangkloy,Gordon Wetzstein,Qi Sun
Main category: cs.CV
TL;DR: 本文提出了一种基于自回归扩散模型的无限时域视频注视点预测框架,能生成具有连续空间坐标和高精度时间戳的注视轨迹,并在长时程时空准确性和轨迹真实性上显著优于现有方法。
Details
Motivation: 传统显著性图和扫描路径难以捕捉注视行为的细粒度时间动态,且现有模型受限于短时窗口(约3-5秒),无法建模真实场景中的长程行为依赖。 Method: 提出一种基于自回归扩散模型的生成式框架,以显著性感知的视觉潜在空间为条件,合成具有连续空间坐标和高分辨率时间戳的原始注视轨迹。 Result: 定量与定性评估表明,该方法在长程时空精度和轨迹真实性方面显著优于现有方法。 Conclusion: 该框架成功实现了任意长度视频的无限时域原始注视预测,为视频场景理解与多模态交互提供了更精细、更长期的注视建模能力。 Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.[78] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Yasong Dai,Zeeshan Hayder,David Ahmedt-Aristizabal,Hongdong Li
Main category: cs.CV
TL;DR: 本文提出BiFM(双向流匹配)框架,统一学习生成与反演过程,通过估计双向平均速度场并引入连续时间间隔监督和双向一致性目标,在少量步数下实现高质量图像编辑与生成。
Details
Motivation: 现有少步采样方法在前向过程逼近上表现较差,导致编辑质量下降;且多依赖预训练生成器和辅助模块,限制了可扩展性和跨架构泛化能力。 Method: BiFM联合学习生成与反演,估计图像→噪声和噪声→图像两个方向的平均速度场,约束于共享的瞬时速度场;采用连续时间间隔监督、双向一致性损失及轻量级时间间隔嵌入进行训练。 Result: BiFM在多种图像编辑与生成任务中一致优于现有少步方法,支持单步反演,并能无缝集成到主流扩散与流匹配骨干网络中。 Conclusion: BiFM提供了一种更通用、高效且可扩展的少步图像生成与编辑框架,解决了前向建模不准与架构依赖性强的问题。 Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.[79] Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
ZeBin Ji,Yang Hu,Xiuli Bi,Bo Liu,Bin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种Select-Hypothesize-Verify框架,通过激活分布分析选择代表性样本、生成概念假设并验证其对神经元的激活程度,从而更准确地解释神经元功能,显著提升概念描述的可靠性。
Details
Motivation: 现有神经元概念解释方法假设每个神经元都有明确定义且具判别性的功能,但实际中存在冗余或误导性神经元,导致错误解读模型决策依据。 Method: 提出Select-Hypothesize-Verify框架:1)基于激活分布分析选取最具功能代表性的样本;2)为选定神经元生成自然语言概念假设;3)验证该概念是否能高激活对应神经元。 Result: 实验表明,所提方法生成的概念对目标神经元的激活概率约为当前最优方法的1.5倍,概念准确性显著提升。 Conclusion: 引入神经元功能验证机制并构建Select-Hypothesize-Verify框架,可有效缓解因冗余/误导性神经元导致的概念误解释问题,提升神经网络可解释性。 Abstract: It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network's decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron's well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.[80] Self-Corrected Image Generation with Explainable Latent Rewards
Yinyi Luo,Hrishikesh Gokhale,Marios Savvides,Jindong Wang,Shengfeng He
Main category: cs.CV
TL;DR: 本文提出xLARD框架,利用多模态大语言模型通过可解释的潜在奖励(Explainable Latent Rewards)引导文本到图像生成,实现自校正,提升细粒度语义与空间关系对齐。
Details
Motivation: 文本到图像生成中,前馈式生成难以准确预判复杂提示的对齐效果,而图像评估相对容易;受此不对称性启发,提出自校正机制。 Method: 提出xLARD框架,包含轻量级校正器,基于模型生成参考图像提供结构化反馈,修正潜在表示;设计可微分映射,将潜在编辑映射为可解释的奖励信号,从而实现非可微图像评估对潜在空间的连续指导。 Result: 在多种生成与编辑任务上,xLARD显著提升了语义对齐精度和视觉保真度,同时保持原有生成先验。 Conclusion: xLARD验证了利用多模态大模型进行潜在空间自我评估与校正的有效性,为可控图像生成提供了新范式。 Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.[81] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration
Yilin Ni,Wenjie Li,Zhengxue Wang,Juncheng Li,Guangwei Gao,Jian Yang
Main category: cs.CV
TL;DR: 本文提出PASDiff,一种无需训练的物理感知语义扩散方法,通过光度约束和风格无关结构注入,有效解决低光照人脸图像的多重退化问题,并构建了WildDark-Face真实世界基准数据集。
Details
Motivation: 现实世界低光照下的人脸图像存在多重退化(如低照度、模糊、噪声、低可见性),现有级联方法误差累积严重,通用联合模型缺乏显式人脸先验,难以恢复清晰人脸结构。 Method: 提出PASDiff:1)利用逆强度加权和Retinex理论引入光度约束,实现合理照度与自然色度恢复;2)设计风格无关结构注入(SASI)模块,从现成人脸先验中提取结构并滤除其光度偏差,融合身份特征与物理约束;3)构建真实世界低光人脸数据集WildDark-Face(700张图像)。 Result: 在多个指标上显著优于现有方法,在自然照度、色彩恢复与身份一致性之间取得更优平衡。 Conclusion: PASDiff以无训练方式实现了物理合理性与语义保真度的统一,验证了显式建模光度规律与结构先验对低光人脸增强的重要性。 Abstract: Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.[82] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
Dohwan Ko,Jinyoung Park,Seoung Choi,Sanghyeok Lee,Seohyun Lee,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: 本文提出MoE-GRPO,一种基于强化学习(GRPO)的专家路由优化框架,用于提升视觉语言模型(VLMs)中混合专家(MoE)的专家选择多样性,缓解专家过拟合,并实现任务级专家特化。
Details
Motivation: 现有MoE在VLM中采用确定性top-K路由,易忽略更优专家组合并导致专家过拟合,亟需提升路由多样性与适应性。 Method: 将专家选择建模为序列决策问题,采用Group Relative Policy Optimization(GRPO)进行强化学习优化;引入模态感知的路由器引导机制,抑制对特定模态不活跃专家的探索。 Result: 在多模态图像与视频基准上,MoE-GRPO显著优于标准top-K及其变体,提升了专家选择多样性、缓解了专家过拟合,并实现了任务级专家特化。 Conclusion: MoE-GRPO通过RL驱动的自适应路由与模态感知引导,有效提升了MoE-VLM的泛化性与效率,为多模态MoE架构设计提供了新范式。 Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.[83] Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning
Yusri Al-Sanaani,Rebecca Thornhill,Pablo Nery,Elena Pena,Robert deKemp,Calum Redpath,David Birnie,Sreeraman Rajan
Main category: cs.CV
TL;DR: 本文提出了一种基于模型无关元学习(MAML)的3D左心房壁分割框架,结合多任务元训练和边界感知复合损失,在少样本(5/10/20-shot)条件下显著提升薄结构分割精度与泛化鲁棒性。
Details
Motivation: 左心房壁在LGE-MRI中因结构薄、对比度低、专家标注稀缺,导致分割困难,亟需高效少样本学习方法。 Method: 采用Model-Agnostic Meta-Learning(MAML)框架,联合左心房壁、左/右心房腔体三任务元训练,并引入边界感知复合损失以增强薄结构分割精度。 Result: 在保留测试集上,5-shot时Dice达0.64(优于监督微调的0.52),HD95为5.70 mm(优于7.60 mm);20-shot时接近全监督性能(0.69 vs. 0.71 DSC);在未见域偏移和本地队列中仍保持稳健(5-shot下Dice分别为0.59和0.57)。 Conclusion: 该MAML方法可在极少量新数据(如5例)下实现更准确可靠的左心房壁边界分割,有望推动临床中仅需极少额外标注的房颤重构评估应用。 Abstract: Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall's thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.[84] Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets
Peng Wu,Yuting Yan,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang
Main category: cs.CV
TL;DR: 本文提出了首个面向事件流的视频异常检测(VAD)框架EWAD,并构建了多个同步事件-RGB基准数据集,通过动态采样、密度调制建模和知识蒸馏提升事件表征能力,在三个基准上显著超越现有方法。
Details
Motivation: 事件相机具有低冗余、聚焦动态运动和天然隐私保护等特性,适合视频异常检测,但缺乏专用数据集和有效建模方法,严重制约该方向发展。 Method: 构建同步事件- RGB基准数据集;提出EWAD框架,包含事件密度感知的动态采样策略、密度调制的时间建模方法、以及RGB到事件的知识蒸馏机制。 Result: 在三个新构建的事件流基准上,EWAD显著优于现有方法;所有基准数据集将公开。 Conclusion: 事件驱动建模在视频异常检测中具有巨大潜力和有效性,本工作为事件式VAD建立了统一研究范式。 Abstract: Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.[85] Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models
Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang
Main category: cs.CV
TL;DR: 本文提出TIES框架,通过动态平衡注意力强度与层间排序一致性来选择视觉token,显著降低VLA模型推理延迟,提升任务成功率。
Details
Motivation: 现有VLA模型因处理密集视觉token导致高推理延迟,且主流token减少方法依赖静态的注意力强度选择,忽略了任务依赖性,可能损害策略性能。 Method: 提出TIES(Tau-guided Inter-layer Efficient Selection)框架,利用层间token排序一致性作为动态指导信号,自适应地融合注意力强度与一致性进行token选择,无需额外训练。 Result: 在CogACT + SIMPLER基准上,平均成功率提升6%,token使用量减少78%,并在多种解码器和基准上展现出强泛化能力。 Conclusion: TIES证明了动态、一致性引导的token选择优于静态注意力驱动方法,为高效VLA模型提供了新范式。 Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.[86] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance
Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
Main category: cs.CV
TL;DR: 本文提出C2W-Tune两阶段腔到壁迁移框架,利用高精度左心房腔分割模型作为解剖先验,提升3D LGE-MRI中薄左心房壁的分割精度;通过预训练+渐进式解冻微调策略,在多个边界指标上显著超越从头训练基线。
Details
Motivation: 左心房(LA)壁在3D晚期钆增强MRI(LGE-MRI)中因结构薄、解剖复杂、对比度低,导致精确分割困难,而准确分割对壁厚映射和纤维化量化至关重要。 Method: 提出C2W-Tune:第一阶段用带ResNeXt编码器和实例归一化的3D U-Net预训练LA腔分割;第二阶段迁移权重,采用渐进式层解冻策略微调网络以适配LA壁分割,保留心内膜特征并细化壁特异性信息。 Result: 在2018 LA Segmentation Challenge数据集上,相比同构从头训练基线,壁Dice从0.623提升至0.814,1mm Surface Dice从0.553升至0.731,HD95从2.95 mm降至2.55 mm,ASSD从0.71 mm降至0.63 mm;仅用70例训练数据时仍达Dice 0.78、HD95 3.15 mm,优于典型多类分割方法(Dice约0.6–0.7)。 Conclusion: 基于解剖结构的任务迁移配合可控微调策略,可有效提升3D LGE-MRI中薄左心房壁分割的边界精度。 Abstract: Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall's thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.[87] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting
Junoh Leea,Junmyeong Lee,Yeon-Ji Song,Inhwan Bae,Jisu Shin,Hae-Gon Jeon,Jin-Hwa Kim
Main category: cs.CV
TL;DR: 本文提出了一种在4D场景中显式保持高斯椭球局部几何结构随时间一致性的新方法,通过视图空间射线分组与α加权约束,避免依赖光流等外部先验,提升了单目动态3D重建的物理合理性和质量。
Details
Motivation: 现有基于3D高斯溅射的动态场景重建方法难以建模符合真实物理规律的运动,尤其在单目视频数据下,高斯运动不一致导致局部几何结构失真、重建质量下降,且多依赖光学流或2D轨迹等外部先验。 Method: 提出视图空间射线分组策略:对同一条射线穿过的、α-blending权重超过阈值的高斯进行聚类,并对其施加空间分布一致性约束,从而在时序上稳定局部几何结构,实现更符合物理规律的运动建模。 Result: 在多个具挑战性的单目数据集上验证,该方法集成于两个基线模型后,显著提升时间一致性与重建质量,优于现有主流方法。 Conclusion: 所提方法无需外部先验即可有效建模物理合理的动态高斯运动,为单目4D场景重建提供了更鲁棒、自洽的几何时序建模范式。 Abstract: The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $α$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.[88] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin
Main category: cs.CV
TL;DR: 本文提出了一种名为概率概念图推理(PCGR)的可解释、可演化的多模态虚假信息检测框架,通过构建人类可理解的概念图并进行分层注意力推理,实现高准确率与强鲁棒性。
Details
Motivation: 传统多模态虚假信息检测器是不透明的黑箱,且难以应对新型操纵手段,亟需可解释、可演化的解决方案。 Method: PCGR采用‘先构建后推理’范式:首先利用多模态大语言模型(MLLMs)自动发现并验证高层概念,构建人类可理解的概念图;然后在该图上应用分层注意力机制进行推理,生成可追溯的推理链。 Result: PCGR在多模态虚假信息检测任务中达到SOTA精度和鲁棒性,在粗粒度检测与细粒度操纵识别两方面均优于先前方法。 Conclusion: PCGR通过结构化、概念驱动的推理方式,有效提升了多模态虚假信息检测的可解释性、适应性和性能。 Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.[89] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method
WenXi Wang,JunQi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种可扩展的分布式车辆控制方法,使普通车辆仅利用局部信息即可实时协同让行应急车辆,兼顾效率、安全与泛化性。
Details
Motivation: 现有应急车辆快速通行方法(集中式优化和强化学习)存在计算成本高、难以扩展到大规模或动态交通场景的问题。 Method: 提出基于局部信息的分布式车辆控制方法,并证明其近似等价于全局信息下的最优解;进一步设计分布式冲突消解机制以保障安全性。 Result: 在真实交通数据集上的仿真表明,该方法决策更快、对普通车辆干扰更小,且在不同交通密度和路网结构下表现出更强的可扩展性。 Conclusion: 所提分布式方法克服了传统方法的计算代价高和扩展性差两大缺陷,在保证安全的前提下实现了高效、自适应、无需预训练的应急车辆通行协同控制。 Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.[90] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs
Yike Wu,Necva Bolucu,Stephen Wan,Dadong Wang,Jiahao Xia,Jian Zhang
Main category: cs.CV
TL;DR: 本文提出SGREC方法,利用查询驱动的场景图作为视觉-语言模型(VLM)与大语言模型(LLM)之间的结构化中介,实现可解释的零样本指代表达理解(REC),在多个基准上达到领先性能。
Details
Motivation: 现有VLM难以捕捉细粒度视觉细节和复杂对象关系,而LLM虽擅长高层语义推理却无法直接抽象视觉特征,因此需融合二者优势以提升零样本REC性能与可解释性。 Method: 提出SGREC方法:首先用VLM构建查询驱动的场景图,显式编码空间关系、描述性字幕和对象交互;再将该结构化图作为文本表示输入LLM,由LLM据此推理目标对象并生成解释。 Result: 在RefCOCO val(66.78%)、RefCOCO+ testB(53.43%)和RefCOCOg val(73.28%)等零样本REC基准上取得最高top-1准确率。 Conclusion: SGREC通过引入查询驱动的场景图有效桥接低层视觉区域与高层语义理解,兼顾性能与可解释性,验证了结构化中间表示在零样本REC中的有效性。 Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.[91] Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning
Md. Rokon Mia,Rakib Hossain Sajib,Abdullah Al Noman,Abir Ahmed,B M Taslimul Haque
Main category: cs.CV
TL;DR: 本文提出了一种结合Center Loss和ArcFace Loss的双损失框架,用于提升水稻叶片病害细粒度分类性能,在公开Rice Leaf数据集上达到99.6%的最高准确率。
Details
Motivation: 传统深度学习模型依赖交叉熵损失,在水稻病害图像中面临类内差异大、类间相似性高的挑战,难以实现高精度细粒度分类。 Method: 提出融合Center Loss(中心损失)与ArcFace Loss(角度边缘损失)的双损失训练框架,并分别集成到InceptionNetV3、DenseNet201和EfficientNetB0三种主流骨干网络中进行训练。 Result: 在Rice Leaf数据集上,三种骨干网络分别取得99.6%、99.2%和99.2%的分类准确率;验证了角边界约束与中心约束能显著增强特征嵌入判别力,且无需修改网络结构。 Conclusion: 该双损失框架有效提升了水稻病害识别精度,具备轻量、易部署特性,适用于实际农田环境中的早期病害检测应用。 Abstract: Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world's population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.[92] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields
Thanh-Hai Le,Hoang-Hau Tran,Trong-Nghia Vu
Main category: cs.CV
TL;DR: Few TensoRF 是一种结合 TensorRF 表示与 FreeNeRF 正则化思想的高效少样本 3D 重建方法,显著提升稀疏视角下的重建质量与稳定性,同时保持快速训练速度。
Details
Motivation: 解决稀疏输入视角下 3D 重建质量差、不稳定的问题,兼顾效率与数据有效性。 Method: 将 TensorRF 的张量表示与 FreeNeRF 的频率驱动少样本正则化相结合,并引入频率掩码和遮挡掩码以增强鲁棒性。 Result: 在 Synthesis NeRF 上 PSNR 提升至 23.70 dB(微调达 24.52 dB),训练时间仍为约 10–15 分钟;在 THuman 2.0 上仅用 8 张图即达 27.37–34.00 dB。 Conclusion: Few TensoRF 是一种高效、少样本、适用于多样化场景的实时 3D 重建新方案。 Abstract: This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF's efficient tensor based representation with FreeNeRF's frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF's fast \(\approx10-15\) minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.[93] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization
Zhangyu Jin,Maksim Siniukov,Deuksin Kwon,Ashutosh Chaubey,Mohammad Soleymani
Main category: cs.CV
TL;DR: 本文提出GDPO-Listener框架,通过自回归流匹配、分组奖励解耦策略优化(GDPO)及语义文本控制,显著提升双人交互中说话者与倾听者3D头部运动的表达力、动态性与可控性。
Details
Motivation: 现有方法在生成倾听者头部运动时存在‘回归均值’问题(导致静态面孔)且缺乏复杂非言语动作的参数空间。 Method: 提出GDPO-Listener框架:1)采用自回归流匹配实现稳定监督学习;2)引入分组奖励解耦策略优化(GDPO),对FLAME不同参数组独立归一化奖励,鼓励高方差表达性生成;3)支持显式语义文本控制以定制响应。 Result: 在Seamless Interaction和DualTalk数据集上,该方法在长期运动方差、视觉表现力和语义可控性方面均优于现有基线。 Conclusion: GDPO-Listener有效解决了倾听者运动呆滞与表达受限问题,为虚拟人双人交互提供了更自然、丰富、可控的3D头部运动合成方案。 Abstract: Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.[94] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning
Zhe Gao,Shiyu Shen,Taifeng Chai,Weinong Wang,Haotian Xu,Xing W,Wenbin Li,Qi Fan,Yang Gao,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出VideoTIR方法,利用强化学习(RL)优化多模态大语言模型在长视频理解中的工具调用,缓解幻觉问题,提升准确性和效率。
Details
Motivation: 现有MLLMs在长视频理解中因文本与视觉token不平衡而易产生幻觉;虽有基于SFT的工具调用方法,但依赖大量高质量标注数据且调用路径受限。 Method: 提出VideoTIR框架,结合Zero-RL和SFT冷启动,并设计Toolkit Action Grouped Policy Optimization(TAGPO)实现分步奖励与失败轨迹复用;构建沙盒式轨迹合成框架生成高质量训练轨迹。 Result: 在三个长视频问答基准上实验表明,VideoTIR显著提升理解准确性与工具调用效率,减少冗余调用。 Conclusion: VideoTIR通过RL驱动的多级工具调用与高效策略优化,为长视频理解提供了更鲁棒、可扩展的解决方案。 Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.[95] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering
Xu Liu
Main category: cs.CV
TL;DR: 本文提出CARE,一种无需训练的可控医学图像恢复框架,通过双潜在分支策略平衡结构保持与先验引导增强,并采用风险感知自适应控制器动态调整两分支贡献,从而在不重新训练模型的情况下实现保守或增强型恢复模式。
Details
Motivation: 现有医学图像恢复方法依赖任务特定重训练,且难以控制保真重建与先验增强之间的权衡,过度激进的恢复可能引入幻觉细节或改变诊断关键结构,临床安全性不足。 Method: 提出CARE框架:采用双潜在恢复策略(一支保障数据保真与解剖一致性,另一支利用生成先验恢复退化信息),并设计风险感知自适应控制器,依据恢复不确定性与局部结构可靠性动态调节两分支权重。 Result: 在噪声与缺失医学影像场景中验证,CARE实现了高质量恢复,更好保留临床相关结构,显著降低不合理重建风险。 Conclusion: CARE为更安全、可控且即插即用的医学图像恢复提供了实用新路径。 Abstract: Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.[96] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation
Jianbo Qi,Mengyao Li,Baogui Jiang,Yidan Chen,Qiao Wang
Main category: cs.CV
TL;DR: 本文提出GeoNDC,一种基于隐式神经场的可查询地球观测神经数据立方体,实现对海量遥感数据的高效压缩、连续时空重建与即时查询。
Details
Motivation: 现有卫星遥感数据以离散栅格文件形式存储,导致存储、传输和查询成本高昂,难以支持大规模环境动态分析。 Method: 构建GeoNDC——一种将全球尺度地球观测数据编码为连续时空隐式神经场的可查询神经数据立方体,支持按需查询与无需全量解压的连续时间重建。 Result: 在MODIS、Sentinel-2和HiGLASS数据上验证:实现95:1高压缩比(20年MODIS仅0.44GB),R² > 0.85–0.98,RMSE = 0.021,可在消费级硬件实时查询与云遮挡下高保真重建。 Conclusion: GeoNDC提供了一种AI原生、统一的行星尺度遥感数据表示范式,集查询、重建与压缩于一体,作为原始数据档案的紧凑、分析就绪补充层。 Abstract: Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5\,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10\,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ($R^2 > 0.85$) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ($R^2 > 0.98$). The representation compresses the 20-year MODIS archive to 0.44\,GB -- approximately 95:1 relative to an optimized Int16 baseline -- with high spectral fidelity (mean $R^2 > 0.98$, mean RMSE $= 0.021$). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.[97] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
Wonjoon Lee,Sungmin Woo,Donghyeong Kim,Jungho Lee,Sangheon Park,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出MoRGS框架,通过引入光流作为轻量运动线索,并设计高斯运动偏移场与运动置信度机制,实现高效、高质量的在线4D动态场景重建。
Details
Motivation: 现有在线动态场景重建方法缺乏对每个高斯椭球体(Gaussian)真实3D运动的建模能力,仅依赖像素级光度损失易导致运动伪影,无法区分动态与静态区域,影响时间一致性和大运动建模效果。 Method: 提出MoRGS:1)利用稀疏关键视图上的光流作为轻量运动监督;2)学习每个高斯的运动偏移场,对齐投影3D运动与观测光流;3)引入每个高斯的运动置信度,加权属性更新并抑制静态区域冗余运动。 Result: 在多个动态数据集上,MoRGS在重建质量与运动保真度方面达到在线方法SOTA,同时保持可流式处理的实时性能。 Conclusion: 显式建模每个高斯的运动并结合稀疏光流监督与置信度加权机制,能显著提升在线4D重建的真实性、一致性与效率,为动态场景实时理解提供新范式。 Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.[98] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
Liyuan Zhu,Manjunath Narayana,Michal Stary,Will Hutchcroft,Gordon Wetzstein,Iro Armeni
Main category: cs.CV
TL;DR: GaussFusion 是一种通过几何引导的视频生成方法,用于提升野外场景下3D高斯泼溅(3DGS)重建质量,有效缓解浮点伪影、闪烁和模糊等问题,并支持多种重建范式,在新视角合成任务中达到SOTA性能。
Details
Motivation: 解决3D高斯泼溅在真实场景中因相机位姿误差、覆盖不全和几何初始化噪声导致的浮点、闪烁与模糊等常见伪影问题。 Method: 提出几何引导的视频到视频生成器,利用深度、法线、不透明度和协方差等高斯原语渲染缓冲区作为输入,结合人工合成的多样化退化数据进行训练,适配优化型与前馈型3DGS重建流程。 Result: 在新视角合成基准上达到SOTA;轻量版本可实现实时21 FPS推理,适用于交互式3D应用。 Conclusion: GaussFusion是一种通用、高效且鲁棒的后处理框架,显著提升了3DGS在复杂真实场景下的重建质量与时间一致性。 Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.[99] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics
Jing Tao,Taihang Lei,Banglei Guan,Ying Qu,Xudong Na,Likun Ma,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 本文提出了一种结合空间可变曝光(SVE)相机与神经形态事件相机立体对的闭环Event-SVE测量系统,用于在浓烟、高动态范围和微秒级运动等极端条件下实现高能推进剂燃烧的实时三维监测。
Details
Motivation: 高能推进剂燃烧过程实时监测困难,因存在极高动态范围(HDR)、微秒级颗粒运动及浓烟干扰,导致传统成像易饱和、运动模糊、颗粒提取不稳定。 Method: 构建闭环Event-SVE系统:SVE相机生成HDR图像并采用烟雾感知融合策略;基于多线索烟雾似然图分离颗粒辐射与烟雾散射,获得校准强度图;利用HDR图作为事件相机绝对强度参考,抑制烟雾引发的事件伪影;再基于净化后的事件数据,通过立体事件驱动的3D流水线进行特征提取与三角测量,估计颗粒分离高度与等效粒径。 Result: 在硼基推进剂实验中成功获取多模态等效半径统计分布,并捕捉到常规传感器难以观测的快速分离瞬态;3D测量最大标定误差为0.56%;系统实现了烟雾遮蔽下、HDR条件中的微秒级分辨率三维燃烧测量。 Conclusion: 该框架为烟雾干扰、高动态范围下的微秒级三维燃烧诊断提供了实用、标定一致的技术路径。 Abstract: Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event--SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.[100] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
Xuankai Zhang,Junjin Xiao,Shangwei Huang,Wei-shi Zheng,Qing Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于单目视频的高质量动态高斯点绘方法,通过SE(3) B样条运动基显式建模高斯椭球的位置与朝向连续变形,并引入自适应控制机制、软分割重建策略和多视角扩散模型以提升运动建模能力、抑制长时运动干扰并增强泛化性。
Details
Motivation: 现有动态高斯点绘方法难以精确建模复杂连续运动,且易受长时运动干扰和过拟合影响,需在单目视频下实现高质量新视角合成。 Method: 采用SE(3) B样条运动基显式建模高斯椭球的连续位置与朝向变形;设计自适应控制机制动态调整运动基与控制点数量;提出软分割重建策略缓解长间隔运动干扰;引入多视角扩散模型提供额外多视角先验以避免过拟合。 Result: 在新视角合成任务上显著优于当前最先进方法,实现了更高质量的动态场景重建与渲染。 Conclusion: 所提方法通过几何运动建模、自适应计算优化与多视角先验融合,在单目动态场景建模中取得了性能与效率的统一提升。 Abstract: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.[101] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Junpeng Ma,Sashuai Zhou,Guanghao Li,Xin Gao,Yue Cao,Hengyu Zeng,Yuxiang Yan,Zhibin Wang,Jun Song,Bo Zheng,Shanghang Zhang,Jian Pu
Main category: cs.CV
TL;DR: 本文提出GIFT框架,通过评估视频帧的内在不可替代性来选择关键帧,以降低视频大语言模型的计算成本。
Details
Motivation: 现有方法在选择关键帧时存在贪心决策和相关性与多样性分离评估的问题,容易陷入局部最优并选中无关噪声帧。 Method: 提出GIFT框架,包括Directed Diversity量化帧的独特性(基于相关性),构建统一的不可替代性得分;并采用Budget-Aware Refinement策略,自适应迭代地优先选取高不可替代性帧,再扩展时间上下文。 Result: 在LLaVA-Video-7B上,GIFT在长视频基准测试中相较均匀采样平均提升达12.5%。 Conclusion: GIFT是一种无需训练的关键帧选择方法,能更有效地平衡相关性与多样性,显著提升视频理解性能。 Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.[102] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers
Nanxiang Jiang,Zhaoxin Fan,Baisen Wang,Daiheng Gao,Junhang Cheng,Jifeng Guo,Yalan Qin,Yeying Jin,Hongwei Zheng,Faguo Wu,Wenjun Wu
Main category: cs.CV
TL;DR: 本文提出了Z-Erase,首个专为单流文本到图像扩散Transformer(如Z-Image)设计的概念擦除方法,通过解耦更新机制和拉格朗日引导的自适应调制算法,在避免生成崩溃的同时实现敏感概念擦除与图像保真度的平衡,并理论证明其收敛性。
Details
Motivation: 现有概念擦除方法在单流扩散Transformer(如Z-Image)上易导致生成崩溃,该任务在此新兴范式中尚未被充分探索。 Method: 提出Stream Disentangled Concept Erasure Framework以解耦更新,并在此框架下设计Lagrangian-Guided Adaptive Erasure Modulation算法;同时提供收敛性分析,证明其可收敛至Pareto平稳点。 Result: Z-Erase成功缓解生成崩溃问题,在多种任务上达到SOTA性能。 Conclusion: Z-Erase是首个适配单流T2I模型的概念擦除方法,兼顾擦除效果与生成稳定性,具备理论保障与实证优势。 Abstract: Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.[103] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs
Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Guoyin Wang,Jiancan Wu,Xiang Wang,Xiangnan He
Main category: cs.CV
TL;DR: 本文提出了一种面向多模态大语言模型(MLLMs)的强化学习奖励可验证方法(RLVR)扩展策略——Token-Reweighting(ToR),通过在token级别建模感知类与推理类token的耦合关系,动态重加权关键token,显著提升视觉定位与符号推理协同能力。
Details
Motivation: 多模态大语言模型(MLLMs)的响应中感知类token(视觉接地)与推理类token(构建推理链)天然交织、相互依赖,单独优化任一类token效果不佳,需联合建模其耦合性。 Method: 基于token级实证分析揭示两类token的强耦合性,并提出即插即用的Token-Reweighting(ToR)策略:识别并动态重加权感知与推理两类关键token,兼容现有RLVR方法(如GRPO、DAPO)。 Result: ToR在多个多模态推理基准上一致提升性能,实现视觉接地准确性和推理连贯性的同步增强,达到当前最优水平。 Conclusion: 感知与推理token的内在耦合是MLLMs RLVR训练的关键瓶颈,显式建模该耦合的ToR策略是一种有效且通用的改进范式。 Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.[104] Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization
Haoran Pei,Yuguang Yang,Kexin Liu,Juan Zhang,Baochang Zhang
Main category: cs.CV
TL;DR: 本文提出分层因果Dropout(HCD)方法,通过通道级因果掩码和矩阵互信息(MMI)目标,在表示层面进行因果干预,以分离因果特征与虚假特征,提升模型在分布外(OOD)数据上的泛化能力。
Details
Motivation: 深度学习模型易依赖非因果的域特异性上下文(即捷径特征),导致跨域性能不稳定;现有不变性学习等方法难以在深层隐空间中有效解耦高度混合的特征。 Method: 提出Hierarchical Causal Dropout(HCD):1)使用通道级因果掩码实现特征稀疏化,执行表示层因果干预;2)引入Matrix-based Mutual Information(MMI)目标,最小化隐特征与域标签的互信息、最大化其与类别标签的互信息;3)融合StyleMix驱动的VICReg模块保障训练稳定性,防止关键因果特征被误滤除。 Result: 在多个OOD基准测试上,HCD性能优于当前顶尖方法。 Conclusion: HCD通过显式因果干预与信息瓶颈约束,有效缓解捷径学习问题,提升了模型的OOD泛化能力,为因果表征学习提供了新思路。 Abstract: Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning problem.In this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class labels.To ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.[105] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors
Chengxu Yang,Jingling Yuan,Chuang Hu,Jiawei Jiang
Main category: cs.CV
TL;DR: 本文提出CLVA(Cross-Layer Visual Anchors)方法,通过强化中间层视觉锚点、抑制深层注意力回退至早期视觉噪声,缓解多模态大语言模型中的物体幻觉问题,无需额外训练且高效。
Details
Motivation: 现有方法在注意力漂移的可解释性方面不足,尤其在模型深层阶段;作者发现幻觉源于深层注意力回退到早期层的视觉噪声,而输出可靠性依赖于中间层而非最终层的视觉锚点。 Method: 提出无训练的CLVA方法,利用注意力动态中捕获的关键中间层视觉锚点,增强中间层特征并抑制回归噪声,从而将深层注意力拉回正确视觉区域。 Result: 在多种架构和基准上验证了CLVA的有效性,显著缓解物体幻觉,且未明显增加计算时间和GPU显存开销。 Conclusion: 中间层视觉锚点对抑制幻觉至关重要;CLVA提供了一种高效、即插即用的后处理机制,提升了多模态大语言模型的视觉-语言对齐鲁棒性。 Abstract: Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.[106] THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics
Tzu-Yen Ma,Bo Zhang,Zichen Tang,Junpeng Ding,Haolin Tian,Yuanze Li,Zhuodi Hao,Zixin Ding,Zirui Wang,Xinyu Yu,Shiyao Peng,Yizhuo Zhao,Ruomeng Jiang,Yiling Huang,Peizhi Zhao,Jiayuan Chen,Weisheng Tan,Haocheng Gao,Yang Liu,Jiacheng Liu,Zhongjun Yang,Jiayu Huang,Haihong E
Main category: cs.CV
TL;DR: THEMIS是一个面向视觉学术造假推理的多任务基准,包含4000+真实与合成样本、5类造假类型及16种细粒度操作,覆盖复杂纹理图像和多维能力评估,现有最强MLLM(GPT-5)仅达56.15%准确率。
Details
Motivation: 现有多模态大模型评测基准缺乏对真实学术造假场景、复杂图像纹理及细粒度造假操作的覆盖,难以有效评估模型在现实欺诈推理中的能力。 Method: 构建了名为THEMIS的新型多任务评测基准,涵盖7个真实撤稿案例衍生场景、4000+问题,引入5类造假类型与16种细粒度图像操纵操作,并建立造假类型到5项核心视觉推理能力的映射关系。 Result: 在16个主流多模态大语言模型上的实验表明,性能最优的GPT-5整体准确率仅为56.15%,显著低于常规理解任务表现,验证了基准的挑战性。 Conclusion: THEMIS填补了真实学术视觉欺诈推理评测的空白,为推动MLLM在高难度、高保真欺诈识别任务中的发展提供了严谨、系统、可扩展的评估框架。 Abstract: We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.[107] Pixelis: Reasoning in Pixels, from Seeing to Acting
Yunpeng Zhou
Main category: cs.CV
TL;DR: 本文提出Pixelis,一种直接在像素空间操作的视觉语言代理,通过可执行的像素级操作(如缩放、分割、跟踪等)进行学习,并在三个阶段训练:监督微调、好奇心-一致性奖励微调和像素级测试时强化学习,显著提升多个图像与视频基准性能。
Details
Motivation: 现有视觉语言系统多为静态观察者,缺乏行动能力与物理世界交互,难以在分布偏移下安全提升;需转向以动作为基础的学习范式,实现具身化、可泛化的视觉智能。 Method: Pixelis采用三阶段训练:(1) 监督微调:基于思维链动作轨迹,使用掩码模仿损失并加权操作/参数token及辅助头;(2) 好奇心-一致性奖励微调:联合优化预测误差驱动的好奇心、相邻步一致性与效率先验,以KL锚点约束;(3) 像素级测试时RL:无标签自适应,通过邻域检索、完整轨迹投票、向高保真短示例更新,并用KL-to-EMA控制漂移。 Result: 在六个公开图像与视频基准上,Pixelis平均相对提升+4.08%(最高达+6.03%),生成更短、可审计的工具链,并在测试时学习中保持通道内KL稳定性。 Conclusion: Pixelis证明了在像素空间而非抽象token中行动,能更好实现物理世界中的多模态感知与视觉推理联结,支持无需外部反馈的具身自适应,推动视觉语言系统从被动描述迈向主动干预。 Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.[108] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Yuqiao Zeng,Xu Wang,Tengfei Liang,Yiqing Hao,Yi Jin,Hui Yu
Main category: cs.CV
TL;DR: 本文提出RL-MBA框架,利用强化学习实现模态平衡与难度感知的多模态主动学习,通过自适应模态权重调整和基于证据融合的难度估计,提升有限标注预算下的分类精度与模态公平性。
Details
Motivation: 现有多模态主动学习方法假设模态重要性稳定、选择规则固定,忽视了训练过程中模态相对价值和样本难度的动态变化,难以有效利用有限标注数据。 Method: 提出基于强化学习的RL-MBA框架,将样本选择建模为马尔可夫决策过程;包含两个核心组件:(1) 自适应模态贡献平衡(AMCB),通过强化反馈动态调整模态权重;(2) 难度感知的证据融合(EFDA),利用不确定性估计样本难度以指导采样。 Result: 在Food101、KineticsSound和VGGSound数据集上,RL-MBA在有限标注预算下显著优于强基线,同时提升分类准确率与模态公平性。 Conclusion: 动态建模模态贡献与样本难度对多模态主动学习至关重要,RL-MBA通过强化学习实现了更高效、更公平的标注样本选择。 Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.[109] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Chenglong Wang,Yifu Huo,Yang Gan,Qiaozhi He,Qi Meng,Bei Li,Yan Wang,Junfu Liu,Tianhua Zhou,Jingbo Zhu,Tong Xiao
Main category: cs.CV
TL;DR: 本文提出了一种多阶段强化学习(MSRL)方法,通过利用大规模文本偏好数据预训练、逐步迁移至图文任务,并引入跨模态知识蒸馏,实现仅用少量多模态数据即可高效训练生成式多模态奖励模型(MRMs),显著提升其在视觉理解和生成任务上的性能。
Details
Motivation: 现有基于可验证奖励的强化学习(RLVR)方法依赖昂贵且稀缺的多模态偏好标注数据,难以扩展多模态奖励模型(MRM)训练。 Method: 提出多阶段强化学习(MSRL)框架:第一阶段在大规模文本偏好数据上学习通用奖励推理能力;第二阶段通过图像描述(caption-based)强化学习迁移至图文任务;第三阶段进行端到端多模态强化学习;并引入跨模态知识蒸馏以增强偏好泛化能力。 Result: MSRL在VL-RewardBench上准确率从66.6%提升至75.9%,在GenAI-Bench上从70.2%提升至75.7%,且无需额外多模态偏好标注。 Conclusion: MSRL提供了一种可扩展、低标注依赖的多模态奖励建模新范式,有效缓解了RLVR对多模态标注数据的强依赖问题。 Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.[110] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness
Yuto Matsuo,Yoshihiro Fukuhara,Yuki M. Asano,Rintaro Yanagi,Hirokatsu Kataoka,Akio Nakamura
Main category: cs.CV
TL;DR: 本文提出了一种基于莫尔干涉图案的轻量级、解析式数据增强方法,无需外部数据或生成模型,计算开销极低,显著提升ViT在多种鲁棒性基准(如ImageNet-C/R和对抗样本)上的性能。
Details
Motivation: 现有扩散模型或特征混合的数据增强方法计算开销大或依赖外部数据,亟需一种高效、自包含、低资源的替代方案。 Method: 提出基于解析莫尔干涉模式的程序化增强方法,利用闭式数学公式实时生成多尺度结构化扰动,并在训练中即时混合与丢弃,实现零存储、零外部依赖的增强流程。 Result: 在Vision Transformer上验证,该方法在ImageNet-C、ImageNet-R及对抗鲁棒性基准上均一致优于标准增强和其它无外部数据增强方法。 Conclusion: 解析干涉图案是一种实用、高效的数据增强新范式,可替代数据驱动的生成式增强方法。 Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.[111] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Jiawei Lin,Wanrong Zhu,Vlad I Morariu,Christopher Tensmeyer
Main category: cs.CV
TL;DR: AnyDoc是一个面向多任务、多类别的文档生成框架,通过构建大规模合成数据集DocHTML,并结合多模态大模型微调与高度感知的强化学习后训练,显著提升意图到文档、文档反渲染和元素到文档等任务的性能。
Details
Motivation: 现有基于人工标注的文档数据集规模小、覆盖类别有限,难以支撑多样化、大规模的文档生成需求;同时,模型在生成过程中存在内容溢出(overflow)问题,影响生成质量。 Method: 提出AnyDoc框架:1)构建可扩展的HTML/CSS文档合成流水线,生成含丰富元数据的大规模DocHTML数据集(265K样本,111类,32种风格);2)基于该数据集对多模态大语言模型(MLLMs)进行微调,支持三类文档生成任务;3)引入高度感知强化学习(HARL)后训练策略,以预测高度与目标高度差异为奖励信号,抑制内容溢出。 Result: AnyDoc在三个文档生成任务(intention-to-document、document derendering、element-to-document)上均显著优于通用MLLM及专用基线模型;HARL有效缓解了生成过程中的内容溢出问题,提升了生成稳定性与质量。 Conclusion: AnyDoc通过数据合成、多任务建模与强化学习优化的协同设计,为通用化、高质量文档生成提供了可扩展的新范式,推动AI驱动内容创作向更实用、更可控方向发展。 Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.[112] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting
Minh-Quan Viet Bui,Jaeho Moon,Munchurl Kim
Main category: cs.CV
TL;DR: 本文提出AirSplat框架,通过自一致姿态对齐(SCPA)和基于评分的不透明度匹配(ROM)技术,将3D视觉基础模型(3DVFMs)的几何先验有效迁移到无姿态要求的新视角合成(NVS)任务中,显著提升重建质量。
Details
Motivation: 3D视觉基础模型虽在零样本几何估计中表现优异,但直接用于通用新视角合成仍具挑战性,需解决姿态-几何不一致与原始表征退化问题。 Method: 提出AirSplat训练框架,包含两项关键技术:(1) 自一致姿态对齐(SCPA),在训练时构建反馈回路实现像素级对齐监督;(2) 基于评分的不透明度匹配(ROM),利用稀疏视角NVS教师模型提供的局部3D几何一致性知识筛选退化图元。 Result: 在大规模基准测试中,AirSplat显著超越现有无姿态NVS方法,在重建质量上取得更优性能。 Conclusion: AirSplat验证了将3DVFMs适配至NVS任务的可行性,为同步实现高精度几何估计与高质量视图合成提供了新路径。 Abstract: While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.[113] Robust Principal Component Completion
Yinjian Wang,Wei Li,Yuanyuan Gui,James E. Fowler,Gemine Vivone
Main category: cs.CV
TL;DR: 本文提出了一种新的鲁棒主成分补全(RPCC)框架,通过变分贝叶斯推断求解概率化稀疏张量分解,以间接识别稀疏前景的支持集,避免了传统RPCA中后处理阈值设定的需要,在合成与真实视频/高光谱数据上均展现出优越性能。
Details
Motivation: 传统RPCA假设低秩背景与稀疏前景是相加关系,但实际中稀疏前景常遮挡或替换背景元素,存在建模不匹配问题。 Method: 提出鲁棒主成分补全(RPCC)框架,将稀疏成分通过其支撑集间接建模;采用全概率化贝叶斯稀疏张量分解,并用变分贝叶斯推断求解;理论证明收敛至硬分类器,自动确定支撑集。 Result: 在合成数据上达到近似最优估计;在真实彩色视频中实现鲁棒前景提取;在高光谱数据中实现有效异常检测。 Conclusion: RPCC框架更贴合遮挡/替换型应用场景,无需后处理阈值,提升了鲁棒性与实用性;开源代码与附录已公开。 Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.[114] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Taegyoon Yoon,Yegyu Han,Seojin Ji,Jaewoo Park,Sojeong Kim,Taein Kwon,Hyung-Sin Kim
Main category: cs.CV
TL;DR: 本文提出了EgoXtreme数据集,用于解决现有6D物体姿态估计基准在真实第一人称视角(如智能眼镜)应用中面临的运动模糊、光照变化和遮挡等挑战,并验证了当前方法在极端条件下的局限性。
Details
Motivation: 现有6D姿态估计基准无法反映真实第一人称场景(如智能眼镜)中的严重运动模糊、动态光照与视觉遮挡等挑战,导致实验室性能与实际部署效果存在巨大鸿沟。 Method: 构建了首个全第一人称视角的大规模6D姿态估计数据集EgoXtreme,涵盖工业维修、体育和应急救援三类极端场景;并在其上评估了通用姿态估计器、图像复原方法及基于跟踪的方法。 Result: SOTA通用姿态估计器在EgoXtreme上性能显著下降,尤其在低光条件下;单纯图像复原(如去模糊)无改善;而引入时序信息的跟踪方法展现出一定增益。 Conclusion: EgoXtreme是推动面向真实世界第一人称视觉的鲁棒姿态估计模型研发与评估的关键资源。 Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/[115] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation
Hongxu Ma,Guang Li,Shijie Wang,Dongzhan Zhou,Baoli Sun,Takahiro Ogawa,Miki Haseyama,Zhihui Wang
Main category: cs.CV
TL;DR: 本文提出FD²框架,专为细粒度数据集蒸馏设计,通过定位判别性区域、构建细粒度表征,并在预训练和蒸馏阶段引入反事实注意力学习、细粒度特征约束与相似性约束,显著提升细粒度识别性能。
Details
Motivation: 现有解耦式数据集蒸馏方法依赖粗粒度类别监督,难以捕捉细粒度数据中细微的类间差异和丰富的类内变化,导致蒸馏样本判别性不足。 Method: FD²框架包含:1)预训练阶段采用反事实注意力学习聚合判别性表征更新类原型;2)蒸馏阶段引入细粒度特征约束(对齐本类原型、排斥他类原型)和相似性约束(增强同类样本间的注意力多样性)。 Result: 在多个细粒度及通用数据集上实验表明,FD²能无缝集成到解耦式DD流程中,并在多数设置下提升性能,展现出强泛化与迁移能力。 Conclusion: FD²有效解决了细粒度数据集蒸馏中判别性不足与类内多样性缺失的问题,为细粒度视觉任务的数据高效学习提供了新思路。 Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.[116] Learning to Rank Caption Chains for Video-Text Alignment
Ansel Blume,Burak Uzkent,Shalini Chaudhuri,Garin Kessler
Main category: cs.CV
TL;DR: 本文提出了一种基于排序优化(ranking optimization)的视觉-语言模型训练方法,以替代传统的二元直接偏好优化(DPO),尤其针对视频-文本对齐任务;通过构建细粒度、全序的降质字幕链,提升对视觉保真度的建模能力,并发现需微调视觉编码器才能取得效果,挑战了DPO仅作用于语言层的常规认知。
Details
Motivation: 传统DPO采用二元胜/负偏好,忽视了视觉-语言模型中‘失败响应’仍可能具备高视觉保真度这一关键特性,导致对视觉内容的建模不够精细。 Method: 提出基于排序优化的训练范式,聚焦视频-文本对齐;通过反复降质生成大规模、全序排列的详细视频字幕链作为训练信号;并联合微调视觉编码器与语言模型。 Result: 在长文本生成与评估任务上,排序优化显著优于标准DPO;且必须微调视觉编码器才有效,表明DPO类方法并非纯粹的语言重加权过程。 Conclusion: 排序优化能更精细地建模响应对视觉输入的忠实度;视觉编码器的可训练性是成功应用偏好学习的关键前提,推动视觉-语言偏好建模向多模态协同优化演进。 Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.[117] A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
SuYeon Kim,Wongyu Lee,MyeongAh Cho
Main category: cs.CV
TL;DR: 本文提出了一种语义解耦的统一模型用于3D异常检测,通过粗到细全局标记化、类别条件对比学习和几何引导解码器,缓解跨类别纠缠问题,在Real3D-AD和Anomaly-ShapeNet数据集上达到SOTA性能。
Details
Motivation: 现有统一模型在多类别3D异常检测中存在跨类别纠缠(ICE)问题,导致语义先验错误和异常分数不可靠。 Method: 提出语义解耦的统一模型,包含:(i) 粗到细全局标记化以构建实例级语义标识;(ii) 类别条件对比学习以解耦类别语义;(iii) 几何引导解码器实现语义一致重建。 Result: 在Real3D-AD和Anomaly-ShapeNet上,对象级AUROC分别提升2.8%(统一模型)和9.1%(类别特定模型),显著提升统一3D异常检测可靠性。 Conclusion: 语义解耦有效缓解跨类别纠缠,提升了统一3D异常检测的泛化性与可靠性,为多类别无监督异常检测提供了新思路。 Abstract: 3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.[118] SportSkills: Physical Skill Learning from Sports Instructional Videos
Kumar Ashutosh,Chi Hsuan Wu,Kristen Grauman
Main category: cs.CV
TL;DR: 本文介绍了SportSkills,首个面向物理技能学习的大规模体育视频数据集,包含36万 instructional 视频和63万视觉示范,并提出了基于错误条件的 instructional 视频检索任务,显著提升了视频模型对用户查询提供个性化视觉指导的能力。
Details
Motivation: 现有大规模视频数据集侧重于通用人类活动,缺乏对细粒度物理技能学习所需活动的深度覆盖。 Method: 构建SportSkills数据集(36万 instructional 视频、63万视觉示范、55项运动),开展系列实验验证其在细粒度动作理解上的优势,并提出错误条件下的 instructional 视频检索新任务。 Result: 在相同模型下,基于SportSkills的表征性能较传统活动中心数据集提升最高达4倍;专业教练正式评估表明,所提检索方法显著提升视频模型为用户查询提供个性化视觉指导的能力。 Conclusion: SportSkills填补了物理技能学习领域高质量视频数据的空白,推动了细粒度动作理解与可操作反馈生成的结合,为个性化运动教学提供了新范式。 Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.[119] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
Bin Yang,Mohamed Abdelsamad,Miao Zhang,Alexandru Paul Condurache
Main category: cs.CV
TL;DR: 本文提出PointINS,一种面向实例的自监督学习框架,通过几何感知学习提升点云表示,增强实例感知能力,显著提升室内实例分割和室外全景分割性能。
Details
Motivation: 现有自监督学习方法虽提升了语义理解,但在实例定位任务上迁移效果差,需全量微调;而实例感知是3D感知的基础,亟需构建支持各类下游任务的3D基础模型。 Method: 提出PointINS框架,引入正交偏移分支联合学习高层语义与几何推理;设计两种正则化策略:偏移分布正则化(ODR)对齐几何先验,空间聚类正则化(SCR)利用伪实例掩码保证局部一致性。 Result: 在五个数据集上实验表明,PointINS平均提升室内实例分割mAP 3.5%,室外全景分割PQ 4.1%。 Conclusion: PointINS通过几何感知与实例导向的自监督学习,有效 bridging 语义与实例理解之间的鸿沟,为可扩展的3D基础模型奠定基础。 Abstract: Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.[120] ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis
Xike Zhang,Maoyuan Ye,Juhua Liu,Bo Du
Main category: cs.CV
TL;DR: 本文提出ET-SAM,一种基于SAM的高效统一场景文本检测与版面分析框架,通过轻量级点解码器生成词热图减少提示点数量,并设计联合训练策略融合多源异构文本标注数据,实现约3倍推理加速及多项数据集性能提升。
Details
Motivation: 现有基于SAM的方法依赖像素级文本分割采样大量前景点作为提示,导致推理延迟高、数据利用率低。 Method: 提出ET-SAM框架:1)定制轻量级点解码器生成词热图以减少前景点数量;2)设计联合训练策略,统一使用多级、词级、行级标注数据,并为不同标注类型引入三组可学习任务提示符,分别作用于点解码器和分层掩码解码器。 Result: 相比先前SAM架构,在HierText上实现约3倍推理加速;在Total-Text、CTW1500和ICDAR15上平均F-score提升11.0%。 Conclusion: ET-SAM在显著提升推理效率的同时,保持甚至增强文本检测与版面分析性能,验证了减少提示点依赖和融合异构标注数据的有效性。 Abstract: Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.[121] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling
Shiji Zhao,Shukun Xiong,Maoxun Yuan,Yao Huang,Ranjie Duan,Qing Guo,Jiansheng Chen,Haibin Duan,Xingxing Wei
Main category: cs.CV
TL;DR: 本文提出了一种基于红外物理知识引导的对抗训练方法(KGAT),通过建模不同类别间的热辐射关系,提升红外目标检测模型在对抗攻击和常见退化下的鲁棒性与干净样本精度。
Details
Motivation: 现有红外目标检测方法多采用数据驱动范式,忽视红外图像特有的物理特性(如热辐射规律),导致鲁棒性有限;而红外图像中类间相对热辐射关系在复杂干扰下仍稳定可靠,可作为先验知识加以利用。 Method: 基于灰度值排序建模类间热辐射关系的秩序结构,并量化其稳定性;将该物理知识嵌入对抗训练过程,设计知识引导的损失函数,约束模型预测符合真实物理规律。 Result: 在三个红外数据集和六个主流检测模型上验证,KGAT显著提升了模型在干净样本上的准确率,以及对对抗攻击和常见退化(如噪声、模糊等)的鲁棒性。 Conclusion: 引入红外物理先验知识(特别是类间热辐射秩序关系)可有效增强检测模型的泛化性与鲁棒性,KGAT为知识引导的鲁棒视觉感知提供了新范式。 Abstract: In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.[122] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation
Md Mushfiqur Azam,John Quarles,Kevin Desai
Main category: cs.CV
TL;DR: 本文提出AG-EgoPose,一种双流框架,融合短/长时运动上下文与细粒度空间线索,用于鱼眼相机输入下的第一人称3D人体姿态估计,显著提升性能。
Details
Motivation: 解决第一人称视角下因严重透视畸变、身体可见性低和相机运动复杂导致的3D姿态估计难题,现有方法未能有效利用视频中的丰富运动上下文。 Method: 设计双流架构:空间流(共享权重ResNet-18编码器-解码器)生成2D关节点热图与空间特征令牌;时间流(ResNet-50+动作识别骨干网络)提取并建模运动动态;二者通过含可学习关节点令牌的Transformer解码器进行融合与精细化。 Result: 在真实世界数据集上实现SOTA定量与定性结果。 Conclusion: AG-EgoPose通过协同建模空间细节与多尺度时间动态,并引入关节级约束,在第一人称3D姿态估计任务中展现出更强鲁棒性与精度。 Abstract: Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.[123] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines
Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang
Main category: cs.CV
TL;DR: 本文提出了首个双语文本到运动生成基准BiHumanML3D,并设计了带跨语言对齐(CLA)的双语运动扩散模型BiMD,显著提升了跨语言动作生成性能,尤其支持零样本语码转换。
Details
Motivation: 现有文本到动作生成方法受限于双语数据集缺失及语言模型跨语言语义理解能力不足。 Method: 构建了首个双语文本-动作基准BiHumanML3D(基于大模型辅助标注+人工校验),并提出双语运动扩散模型BiMD,核心为显式跨语言对齐(CLA)模块,以统一双语条件表征空间。 Result: 在BiHumanML3D上,BiMD+CLA取得FID=0.045(优于单语模型0.169)和R@3=82.8%(优于80.8%),显著超越单语扩散模型与翻译基线。 Conclusion: 双语数据集BiHumanML3D和跨语言对齐策略CLA对提升跨语言动作生成至关重要且有效,为多语言具身智能提供了新范式。 Abstract: Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}[124] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers
Marvin Seyfarth,Salman Ul Hassan Dar,Yannik Frisch,Philipp Wild,Norbert Frey,Florian André,Sandy Engelhardt
Main category: cs.CV
TL;DR: 本文提出了VolDiT,首个纯Transformer架构的3D扩散模型,用于体数据医学图像合成,通过体素化patch嵌入、全局自注意力和时序门控控制适配器实现高保真、可控、全局一致的生成。
Details
Motivation: 现有3D医学图像生成方法多基于卷积U-Net,在感受野、全局建模和灵活条件控制方面存在局限;需探索更适配体数据建模的全Transformer扩散架构。 Method: 提出VolDiT:1)3D体素patch嵌入;2)直接作用于3D token的全局自注意力;3) timestep-gated控制适配器,将分割掩码映射为可学习控制token以调制去噪过程。 Result: 在高分辨率3D医学图像合成任务上,VolDiT相比SOTA U-Net基潜空间扩散模型展现出更优的全局一致性、生成保真度与条件可控性。 Conclusion: 全Transformer结构的扩散模型是体数据医学图像合成的灵活且有前景的基础架构。 Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.[125] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Jiahao Wang,Hualian Sheng,Sijia Cai,Yuxiao Yang,Weizhan Zhang,Caixia Yan,Bing Deng,Jieping Ye
Main category: cs.CV
TL;DR: 本文提出AnyID框架,通过可扩展的全参考架构和主参考生成范式,实现高保真身份保持的视频生成,并利用人类评估数据集进行强化学习微调。
Details
Motivation: 现有方法通常仅针对单一身份参考进行设计和优化,限制了创意灵活性,且单源依赖导致身份复现模糊、不准确。 Method: 提出AnyID框架:1)可扩展的全参考架构,统一处理异构身份输入(如人脸、肖像、视频);2)主参考生成范式,以一个参考为锚点,结合差分提示实现属性级可控生成;训练采用大规模标注数据集,再用基于人类偏好的强化学习进行微调。 Result: AnyID在多种任务设置下均实现了超高身份保真度与优异的属性级可控性。 Conclusion: AnyID有效解决了多源身份输入下的视频生成难题,显著提升了身份保真度与可控性,为创意视频生成提供了更灵活、鲁棒的工具。 Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.[126] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis
Marvin Seyfarth,Sarah Kaye Müller,Arman Ghanaat,Isabelle Ayx,Fabian Fastenrath,Philipp Wild,Alexander Hertel,Theano Papavassiliu,Salman Ul Hassan Dar,Sandy Engelhardt
Main category: cs.CV
TL;DR: 本文提出CardioDiT,一种基于扩散Transformer的全4D潜在扩散框架,用于短轴心脏电影MRI(CMR)合成,通过时空联合建模提升心脏动态的时空一致性和生理合理性。
Details
Motivation: 现有生成方法对 cine CMR 这类具有时间同步特性的 3D 医学影像多采用空间与时间解耦建模或依赖辅助机制(如解剖掩码)保证时序一致性,易引入结构偏差,导致时空不连续或心脏动力学不真实。 Method: 提出 CardioDiT:使用时空VQ-VAE将2D+t切片编码为紧凑潜在表示,并用扩散Transformer对完整3D+t体积进行端到端联合建模,实现空间与时间在生成全过程中的统一耦合。 Result: 在公开及私有CMR数据集上验证,CardioDiT相较基线方法显著提升层间一致性、时间运动连贯性及心脏功能分布的真实性;代码与模型已开源。 Conclusion: 显式4D建模结合扩散Transformer为心脏影像的时空合成提供了原理上更优的基础框架。 Abstract: Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at https://github.com/Cardio-AI/cardiodit.[127] TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
Peng Wen,Yuting Wang,Qiurui Wang
Main category: cs.CV
TL;DR: 本文提出了TacSIm,一个用于足球战术风格模仿的大规模数据集和基准,旨在准确复现真实球队的战术行为,而非仅优化奖励目标;它基于英超比赛转播画面,投影22名球员位置与动作至标准球场坐标系,定义了空间占用与运动向量相似性评估指标,并在统一虚拟环境中对多种基线方法进行定量与可视化评估。
Details
Motivation: 现有足球模仿研究多聚焦于奖励导向目标(如进球数、胜率),忽视对真实球队战术行为的精确复现。 Method: 构建TacSIm数据集与基准:从英超转播画面中提取单视角下双方共22名球员的初始位置与动作,映射到标准球场坐标系;定义战术风格模仿任务,采用空间占用相似性和运动向量相似性作为评估指标;在统一虚拟环境中运行多个基线方法生成全队行为。 Result: TacSIm提供了首个面向战术风格模仿的标准化数据集与评估协议,支持空间与时间维度上的战术协调性量化与可视化评估。 Conclusion: TacSIm建立了从广播视频到仿真行为的统一评估范式,为风格对齐的足球战术模仿任务提供了严谨、可复现的基准。 Abstract: Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.[128] CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging
Shaojin Bai,Yuting Su,Weizhi Nie
Main category: cs.CV
TL;DR: 本文提出CIV-DG框架,利用条件工具变量(CIV)解耦病理语义与设备伪影,解决医疗AI跨中心泛化中由选择偏倚引发的结构性混淆问题,并通过DeepGMM架构实现,在Camelyon17和胸部X光数据集上显著优于现有方法。
Details
Motivation: 医疗AI跨站点泛化受选择偏倚(如患者年龄、病情严重程度非随机决定就诊医院)导致的结构性混杂严重制约;传统领域泛化方法无法消除站点特异性伪影与诊断标签间的虚假相关。 Method: 提出基于条件工具变量(CIV)的因果框架CIV-DG,放宽标准工具变量的随机分配假设,适配临床中由人口统计学因素内生驱动的医院选择;通过深度广义矩估计(DeepGMM)架构,设计条件判别器在人口统计子群内最小化矩条件违规并强制工具变量与误差正交。 Result: 在Camelyon17和大规模胸部X光数据集上,CIV-DG显著优于主流领域泛化基线方法。 Conclusion: 条件因果机制(如CIV)可有效应对医疗AI中的结构性混杂,提升模型跨中心鲁棒性与泛化能力。 Abstract: Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.[129] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Jiahao Tian,Chenxi Song,Wei Cheng,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出FreeLOC框架,通过Video-based Relative Position Re-encoding(VRPR)和Tiered Sparse Attention(TSA)两种技术,无需额外训练即可解决预训练视频扩散模型在生成长视频时因帧级位置与上下文长度分布偏移导致的质量下降问题,并引入层自适应探测机制提升效率。
Details
Motivation: 预训练视频扩散模型通常在短片段上训练,直接用于长视频生成时会出现视觉质量显著下降,主要源于帧级相对位置和上下文长度两类分布外(O.O.D)问题。 Method: 提出FreeLOC:1)Video-based Relative Position Re-encoding(VRPR),分层重编码时间相对位置以对齐预训练分布;2)Tiered Sparse Attention(TSA),跨时间尺度结构化稀疏注意力以兼顾局部细节与长程依赖;3)层自适应探测机制,识别各Transformer层对O.O.D的敏感性,实现选择性、高效应用。 Result: 在时序一致性和视觉质量上显著超越现有无训练方法,达到SOTA性能。 Conclusion: FreeLOC是一种训练无关、层自适应的通用框架,有效缓解长视频生成中的两类关键分布外问题,为基于预训练扩散模型的长视频合成提供了新范式。 Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.[130] SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment
Pengyu Chen,Haotian Sa,Yiwei Hu,Yuhan Cheng,Junbo Wang
Main category: cs.CV
TL;DR: 本文提出SDD-YOLO,一种专为地面到空中(G2A)反无人机监视设计的小目标检测框架,通过高分辨率P2检测头、无DFL/NMS架构及MuSGD混合训练策略,在自建大规模数据集DroneSOD-30K上实现86.0% mAP@0.5,并兼顾高推理速度(226 FPS GPU / 35 FPS CPU)。
Details
Motivation: 地面视角检测小型无人机面临像素占有率极低、空域背景杂乱及实时性要求严苛等挑战,现有YOLO模型缺乏足够特征分辨率且部署复杂。 Method: 提出SDD-YOLO框架:引入P2高分辨率检测头(4倍下采样);融合YOLO26的无DFL、无NMS轻量架构;采用MuSGD混合训练策略(含ProgLoss与STAL),缓解小目标稀疏信号梯度振荡;构建新数据集DroneSOD-30K(约3万张标注图像)。 Result: SDD-YOLO-n在DroneSOD-30K上达86.0% mAP@0.5,较YOLOv5n提升7.8个百分点;推理速度达226 FPS(RTX 5090)和35 FPS(Xeon CPU)。 Conclusion: SDD-YOLO在精度与效率间取得良好平衡,显著提升G2A场景下微小无人机检测性能,具备强边缘部署潜力。 Abstract: Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.[131] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models
Sazzad Hossain,Saiful Islam,Muhammad Ibrahim,Md. Rasel Ahmed,Md Shuayb,Ahmedul Kabir
Main category: cs.CV
TL;DR: 本文提出一个面向孟加拉国五种常见皮肤病(接触性皮炎、白癜风、湿疹、疥疮、癣)的公开图像数据集(共1612张,含增强),并评估了多种机器学习与深度学习模型的分类性能,旨在缓解基层皮肤科资源匮乏问题。
Details
Motivation: 孟加拉国等高人口密度国家缺乏足够皮肤科医生和诊断设备,导致皮肤病漏诊误诊风险高,亟需基于AI的辅助诊断工具。 Method: 构建包含5类皮肤病的临床采集图像数据集(1612张,含数据增强),并采用多种传统机器学习(如SVM、RF)和深度学习(如VGG16、ResNet50)模型进行分类实验。 Result: 报告了各模型在该数据集上的分类性能(具体指标未详述),验证了所建数据集可用于训练有效皮肤病识别模型。 Conclusion: 该公开数据集填补了区域性皮肤病AI研究的数据空白,尤其适用于南亚等资源受限地区,并有望推动全球基于AI的皮肤疾病自动诊断研究与应用。 Abstract: Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.[132] A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks
Jiaming Liang,Chi-Man Pun
Main category: cs.CV
TL;DR: 本文提出了一种统一的空间对齐框架(SAF),通过同步变换输入和标签来提升基于变换的对抗攻击(TAAs)在语义分割和目标检测等空间结构化任务中的迁移性。
Details
Motivation: 现有基于变换的对抗攻击(TAAs)在非结构化分类任务中迁移性强,但在语义分割、目标检测等空间结构化任务中效果差甚至失效,主因是未同步变换空间结构化的标签,导致空间错位与错误梯度。 Method: 提出空间对齐框架(SAF),核心是空间对齐(SA)算法,使TAAs在对输入进行空间变换的同时,同步变换对应的空间结构化标签,以保持空间一致性。 Result: 在Cityscapes和Kvasir-SEG数据集上,非目标攻击使平均mIoU分别从24.50降至11.34、49.91降至31.80;在COCO上,平均mAP从17.89降至5.25,验证了SAF对提升TAAs在结构化任务中迁移性的关键作用。 Conclusion: 同步空间变换输入与标签是提升TAAs在结构化视觉任务中迁移能力的关键,SAF为该类攻击提供了通用且有效的解决方案。 Abstract: Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.[133] Efficient Preemptive Robustification with Image Sharpening
Jiaming Liang,Chi-Man Pun
Main category: cs.CV
TL;DR: 本文提出了一种简单高效、无需代理模型、无需优化或生成器、且人类可解释的图像鲁棒化方法——图像锐化,显著提升了模型在迁移攻击场景下的鲁棒性。
Details
Motivation: 现有预攻击防御(如抢占式鲁棒化)依赖代理模型、计算开销大、缺乏可解释性;而纹理强度与图像鲁棒性正相关这一新发现启发作者探索更轻量、可解释的鲁棒化手段。 Method: 基于纹理强度与鲁棒性正相关的发现,直接采用图像锐化作为鲁棒化操作,无需训练代理模型、迭代优化或生成网络,完全前向、可解释。 Result: 锐化操作在多种迁移攻击设置下显著提升模型鲁棒性,同时计算成本极低,是首个 surrogate-free、optimization-free、generator-free 且 human-interpretable 的鲁棒化方法。 Conclusion: 图像锐化作为一种极简但有效的预处理操作,为鲁棒机器学习提供了新思路,证明了低复杂度、高可解释性的鲁棒化是可行且实用的。 Abstract: Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.[134] FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Taejin Jeong,Joohyeok Kim,Jinyeong Kim,Chanyoung Kim,Seong Jae Hwang
Main category: cs.CV
TL;DR: 本文提出FEAST框架,通过全连接图和负感知注意力机制,结合离网格采样策略,提升空间转录组基因表达预测性能,并生成具有生物学意义的注意力图。
Details
Motivation: 空间转录组(ST)成本高昂,限制其广泛应用;现有基于图神经网络的方法依赖预定义稀疏图,难以捕捉复杂的生物相互作用关系。 Method: 提出FEAST框架:1)将组织建模为全连接图以建模所有斑点对间交互;2)引入负感知注意力机制,同时建模兴奋性和抑制性相互作用;3)采用离网格采样策略获取中间区域图像,丰富形态学上下文。 Result: 在多个公开ST数据集上,FEAST在基因表达预测任务中超越当前最优方法,并生成具有生物学合理性的注意力图,可清晰揭示正负相互作用。 Conclusion: FEAST通过更灵活的图结构建模、更具生物学意义的注意力机制及更丰富的图像上下文,有效提升了空间转录组数据的解析能力,为低成本ST推断提供了新思路。 Abstract: Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.[135] Semantic-Aware Prefix Learning for Token-Efficient Image Generation
Qingfeng Li,Haoxian Zhang,Xu He,Songlin Tang,Zhixue Fang,Xiaoqiang Liu,Pengfei Wan Guoqi Li
Main category: cs.CV
TL;DR: 本文提出SMAP语义感知前缀分词器,通过引入类别级语义条件和尾部标记丢弃策略,使语义信息在训练中不可或缺;并结合CARD混合生成模型验证其语义对齐潜空间在图像生成任务中的有效性。
Details
Motivation: 现有视觉分词器多基于重建目标训练,导致潜在表示语义薄弱;虽有方法增强语义对齐,但语义信号仅作为辅助正则项,未成为表征学习的必要成分。 Method: 提出SMAP分词器:将类别级语义条件注入查询式1D分词框架,并采用尾标记丢弃策略,迫使语义条件与早期潜码前缀在缩减的标记预算下承担更多责任;同时设计CARD混合生成器(因果自回归+扩散),以验证潜空间对生成而非仅重建的有效性。 Result: 在ImageNet上实验表明,SMAP在离散与连续分词设置下均提升重建质量,且其语义对齐的潜空间在紧凑标记预算下展现出优异的下游生成性能。 Conclusion: 语义信息可被设计为分词过程的结构性必需成分,而非辅助约束;SMAP通过架构与训练策略协同,实现了强语义对齐的高效视觉分词,显著提升生成质量。 Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.[136] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Yabin Zhang,Maya Varma,Yunhe Gao,Jean-Benoit Delbrouck,Jiaming Liu,Chong Wang,Curtis Langlotz
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、测试高效且理论可靠的OOD检测方法TANL,通过在测试时动态激活负标签,利用历史和当前批次样本的激活响应自适应选择负标签,并设计激活感知打分函数提升性能与鲁棒性。
Details
Motivation: 现有OOD检测方法中,预设的远离ID类的负标签在OOD样本上激活不足,难以有效捕捉OOD特性。 Method: 提出Test-time Activated Negative Labels(TANL):在线识别高置信度测试图像,累积其在语料库上的分配概率构建标签激活度指标;结合历史测试分布与当前批次信息,实现分布自适应与批自适应的负标签挖掘;设计激活感知得分函数,增强强激活负标签的权重。 Result: 在ImageNet大规模基准上,FPR95从17.5%显著降至9.8%;在多种骨干网络与任务设置下均验证了有效性;具备训练自由、测试高效、对负标签数量鲁棒等优势。 Conclusion: TANL是一种新颖、实用且理论可解释的测试时OOD检测框架,通过动态激活负标签机制显著提升了检测性能与泛化能力。 Abstract: Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\% to 9.8\%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.[137] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection
Md Awsafur Rahman,Chandrakanth Gudavalli,Hardik Prajapati,B. S. Manjunath
Main category: cs.CV
TL;DR: 本文提出TITAnD方法,将轨迹异常检测转化为视觉问题,通过构建超光谱轨迹图像(HTI)统一稠密与稀疏轨迹表示,并设计循环分解Transformer(CFT)建模人类活动的周期性结构,首次实现多月稠密GPS轨迹的高效异常检测。
Details
Motivation: 现有稠密GPS方法计算代价高(二次方),无法支持多月分析;稀疏停留点方法虽可扩展但丢失细粒度信息,导致需不同架构且无法知识迁移。作者认为该瓶颈非必要,因人类轨迹在日内和日间均具二维周期性结构。 Method: 提出TITAnD框架:1)将轨迹表示为超光谱轨迹图像(HTI),即‘日×一天内时刻’网格,各通道编码空间、语义、时间与运动学信息;2)将异常检测转化为图像分类(个体级)和语义分割(时间定位);3)设计循环分解Transformer(CFT),沿两个时间轴分解注意力机制,嵌入周期性先验并大幅降低计算开销。 Result: TITAnD在稠密与稀疏基准上均取得最优AUC-PR;比UNet等视觉模型性能更优;比标准Transformer快11–75倍且内存相当;首次实现多月稠密GPS轨迹的异常检测。 Conclusion: 将轨迹异常检测重构为视觉任务,并结合结构感知建模(如周期性先验),是兼顾精度、效率与泛化性的关键路径;TITAnD为多尺度轨迹分析提供了统一、高效的新范式。 Abstract: Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.[138] Towards Practical Lossless Neural Compression for LiDAR Point Clouds
Pengpeng Yu,Haoran Li,Runqing Jiang,Dingquan Li,Jing Wang,Liang Lin,Yulan Guo
Main category: cs.CV
TL;DR: 本文提出了一种面向LiDAR点云的高效无损预测编码框架,通过几何重稠密化与跨尺度特征传播两个轻量模块,在保持高速的同时提升压缩性能,并引入纯整数推理流程确保跨平台比特级一致性。
Details
Motivation: LiDAR点云几何细节高度稀疏,导致上下文建模效率低,限制了现有压缩方法的速度与性能。 Method: 提出两模块框架:1)几何重稠密化模块——迭代稠密化-特征提取-再稀疏化,避免在稀疏细节上高开销计算;2)跨尺度特征传播模块——利用多分辨率占据线索引导分层特征传播;并设计整数-only推理流程保障比特精确性。 Result: 实验表明该方法在实时速度下达到具有竞争力的无损压缩性能。 Conclusion: 所提紧凑表示与整数推理机制有效平衡了LiDAR点云压缩的效率、性能与部署鲁棒性。 Abstract: LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at https://github.com/pengpeng-yu/FastPCC.[139] ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis
Moonyeon Jeong,Seunggi Min,Suhyeon Lee,Hongje Seong
Main category: cs.CV
TL;DR: ViewSplat提出一种视图自适应的3D高斯点阵方法,通过动态MLP根据目标视角实时调整高斯属性,显著提升新视角合成质量,同时保持高速推理与实时渲染性能。
Details
Motivation: 现有前馈式3D高斯点阵方法受限于静态高斯原语表达能力,难以兼顾所有视角的保真度,存在 fidelity gap。 Method: ViewSplat学习一个视图可适应的潜在表示:先预测基础高斯原语和动态MLP权重;渲染时,MLP以目标视角坐标为输入,输出各高斯属性(位置、尺度、旋转、不透明度、颜色)的视角相关残差更新。 Result: 在新视角合成任务中达到SOTA保真度,推理速度17 FPS,实时渲染达154 FPS。 Conclusion: 视图自适应动态点阵机制有效弥补了静态原语的表达局限,实现了高保真与高效率的统一。 Abstract: We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).[140] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
Yuhan Chen,Pengwen Dai,Chuan Wang,Dayan Wu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出EagleNet,通过细粒度关系学习和能量感知匹配机制,增强文本表达以更好地匹配视频语义,提升跨模态检索性能。
Details
Motivation: 现有方法仅建模文本与视频帧间的交互,忽略视频内部帧间丰富的上下文关系,导致扩充后的文本缺乏帧上下文信息,造成文本与视频表征不一致。 Method: 提出Energy-Aware Fine-Grained Relationship Learning Network(EagleNet),包含:1)细粒度关系学习机制(FRL),构建文本-帧图并聚合文本候选以融入帧上下文;2)能量感知匹配(EAM),建模文本-帧交互能量以准确刻画真实图文对分布;3)采用sigmoid对比损失替代softmax损失以提升跨模态对齐与训练稳定性。 Result: 在MSRVTT、DiDeMo、MSVD和VATEX四个主流文本-视频检索数据集上均取得SOTA或领先性能。 Conclusion: EagleNet通过显式建模帧内上下文与能量感知的细粒度交互,有效提升了文本表征的语义丰富性与上下文感知能力,从而显著增强文本-视频跨模态检索效果。 Abstract: Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.[141] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
Weijia Li,Haoen Xiang,Tianxu Wang,Shuaibing Wu,Qiming Xia,Cheng Wang,Chenglu Wen
Main category: cs.CV
TL;DR: 本文提出了首个面向车辆与无人机协同感知的大规模真实世界多模态数据集V2U4Real,旨在解决地面协同感知在大范围遮挡和远距离感知方面的局限性。
Details
Motivation: 现代自动驾驶感知系统受限于遮挡、盲区和感知距离;现有车-车(V2V)和车-基础设施(V2I)协同范式仅限于地面协作,难以应对复杂环境中大规模遮挡和长距离感知问题。 Method: 构建了首个真实世界车-无人机(V2U)协同感知多模态数据集V2U4Real,包含地面车辆与无人机搭载的多视角LiDAR和RGB相机采集的数据,并建立单智能体3D检测、协同3D检测和目标跟踪三大基准任务。 Result: 数据集涵盖城市街道、校园及乡村道路等场景,含56K LiDAR帧、56K多视角图像和700K标注3D框(四类目标);多个SOTA模型验证了V2U协同可显著提升感知鲁棒性和远距离感知能力。 Conclusion: V2U4Real为跨视角协同感知研究提供了关键数据支撑和基准平台,推动了空地协同感知技术的发展。 Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.[142] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework
Hongru Han,Tingrui Guo,Liming Zhang,Yan Su,Qiwen Xu,Zhuohua Ye
Main category: cs.CV
TL;DR: 本文提出可控低光图像增强(CLE)框架CLE-RWKV,通过引入Light100新基准和HVI颜色空间中的噪声解耦监督策略,解决传统LLIE方法因任务病态性导致的亮度不一致问题,并利用Space-to-Depth策略适配SSM用于密集预测任务。
Details
Motivation: 传统低光图像增强被建模为确定性映射,但该任务本质上病态,存在多模态解空间,导致预测与标签间亮度不一致,常需gt-mean后处理;亟需一种能显式控制光照强度的、良定义的新范式。 Method: 提出可控低光增强(CLE)新范式,构建CLE-RWKV框架:1)发布含连续真实光照变化的Light100基准;2)在HVI颜色空间中采用噪声解耦监督,分离照度调节与纹理恢复;3)设计Space-to-Depth(S2D)策略将SSM适配于密集预测任务,恢复局部归纳偏置并弥合扫描序列的‘扫描间隙’。 Result: 在七个基准上实验表明,该方法性能具有竞争力且亮度控制鲁棒,显著降低对gt-mean后处理的依赖,提供面向真实多光照场景的可行替代方案。 Conclusion: 将LLIE从病态确定性映射转向良定义的条件可控任务是合理且有效的;CLE-RWKV及其配套技术(Light100、HVI解耦监督、S2D-SSM)共同构成了更实用、可控、物理意义更清晰的低光增强新路径。 Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating "gt-mean" post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the "scanning gap" inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.[143] Adaptive Learned Image Compression with Graph Neural Networks
Yunuo Chen,Bing He,Zezheng Lyu,Hongwei Hu,Qunshan Gu,Yuan Tian,Guo Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于图神经网络(GNN)的自适应图像压缩框架GLIC,通过构建双尺度图和动态调整节点邻域数量,克服了CNN和Transformer在建模图像冗余时的刚性局限,在多个数据集上达到SOTA性能。
Details
Motivation: 现有主流学习型图像压缩方法(如CNN或Transformer)因固定感受野和静态连接模式,难以自适应建模图像中空间变化的冗余(尤其全局冗余),易将非冗余像素错误耦合。 Method: 提出基于图神经网络的GLIC框架:构建双尺度图以实现数据驱动的灵活感受野,并引入基于局部内容复杂度的自适应连通性机制(动态调整每个节点的邻居数量)。 Result: 在Kodak、Tecnick和CLIC数据集上,相比VTM-9.1分别获得19.29%、21.69%和18.71%的BD-rate降低,达到SOTA性能。 Conclusion: GLIC通过内容自适应图结构建模,显著提升了对图像多尺度冗余的建模能力,验证了GNN在高效、自适应图像压缩中的有效性与潜力。 Abstract: Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.[144] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Zhekai Chen,Yuqing Wang,Manyuan Zhang,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出MacroData数据集和MacroBench基准,解决多参考图像生成任务中因数据瓶颈导致的性能下降问题,显著提升模型在多参考条件下的生成效果。
Details
Motivation: 现有模型在多参考图像生成任务中性能随参考图像数量增加而严重下降,根本原因在于缺乏支持学习密集参考间依赖关系的大规模、结构化长上下文监督数据。 Method: 构建了包含40万样本、每样本最多10张参考图像的MacroData数据集,并按定制化、插画、空间推理和时序动态四个维度系统组织;同时提出MacroBench评估基准,含4000样本,从任务维度和输入规模两方面评估生成一致性。 Result: 在MacroData上微调显著提升了多参考图像生成性能;消融实验表明跨任务协同训练和有效处理长上下文策略具有协同增益。 Conclusion: MacroData与MacroBench为多参考图像生成提供了关键数据基础与评估标准,推动该方向发展。 Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.[145] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Yongsung Kim,Wooseok Song,Jaihyun Lew,Hun Hwangbo,Jaehoon Lee,Sungroh Yoon
Main category: cs.CV
TL;DR: 本文提出了一种基于头敏感度差异的两阶段稀疏化方法(HeSS),用于缓解视觉几何接地Transformer(VGGT)中全局注意力层因均匀稀疏导致的精度下降问题。
Details
Motivation: 现有稀疏化方法对所有注意力头采用统一稀疏模式,忽略了不同头在稀疏化时的敏感性差异,从而导致显著精度下降。 Method: 提出两阶段稀疏化流程:第一阶段定义并计算Head Sensitivity Score(HeSS)来量化各头对稀疏化的敏感程度;第二阶段依据HeSS动态分配注意力预算,对敏感头保留更多连接、对鲁棒头施加更高稀疏度。 Result: 实验表明HeSS能准确刻画头间敏感性异质性,所提方法在高稀疏率下显著缓解性能退化,在不同稀疏水平上均表现出强鲁棒性。 Conclusion: 头注意力敏感性的异质性是影响稀疏化效果的关键因素,基于HeSS的差异化稀疏策略可有效提升VGGT等模型在保持高效性的同时维持高精度。 Abstract: Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.[146] Image Rotation Angle Estimation: Comparing Circular-Aware Methods
Maximilian Woehrer
Main category: cs.CV
TL;DR: 本文系统研究了五种面向圆周拓扑的图像自动旋转估计方法,在多个现代网络架构上进行评估,发现概率型方法(尤其是圆高斯分布)鲁棒性最佳,分类法在匹配良好骨干网时精度最高;最佳配置在DRC-D、COCO2014和COCO2017数据集上分别达到1.23°、3.71°和2.84°的平均绝对误差。
Details
Motivation: 图像自动旋转估计是视觉流程的关键预处理步骤,但角度具有圆周拓扑特性,导致标准回归方法因边界不连续而性能受限。 Method: 系统评估五种圆周感知方法:带圆周损失的直接角度回归、角度分箱分类、单位向量回归、相位偏移编码器、圆高斯分布;基于ImageNet预训练模型,通过调整输出头适配十六种现代架构。 Result: 圆高斯分布方法在不同骨干网络上鲁棒性最强;分类法(配合EfficientViT-B3)在DRC-D上取得1.23° MAE,圆高斯分布(配合MambaOut Base)达1.24°且更稳定;在COCO2014和COCO2017上分别达3.71°和2.84° MAE,显著优于先前工作。 Conclusion: 面向圆周特性的建模方式对旋转估计至关重要;概率建模(如圆高斯分布)在泛化性和稳定性上更具优势,而分类法需谨慎匹配骨干网络。 Abstract: Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.[147] InstanceAnimator: Multi-Instance Sketch Video Colorization
Yinhan Zhang,Yue Ma,Bingyuan Wang,Kunyu Feng,Yeying Jin,Qifeng Chen,Anyi Rao,Zeyu Wang
Main category: cs.CV
TL;DR: InstanceAnimator 是一种基于扩散变换器(Diffusion Transformer)的新框架,用于多实例草图视频着色,通过画布引导条件、实例匹配机制和自适应解耦控制模块,显著提升用户控制性、多角色对齐精度与细节保真度。
Details
Motivation: 现有方法存在三方面局限:依赖单参考帧导致用户控制不灵活、多角色场景下实例可控性差引发错位、细粒度区域着色细节失真。 Method: 提出三个核心创新:1)Canvas Guidance Condition,支持参考元素与背景自由摆放;2)Instance Matching Mechanism,融合实例特征与草图以实现精准多角色控制;3)Adaptive Decoupled Control Module,将角色、背景及文本语义特征注入扩散过程以增强细节 fidelity。 Result: 实验表明 InstanceAnimator 在多实例着色任务中实现了更强的用户控制性、更高视觉质量与更优实例一致性。 Conclusion: InstanceAnimator 有效解决了多实例草图视频着色中的控制性、对齐性与细节保真三大挑战,为交互式视频生成提供了新范式。 Abstract: We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.[148] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation
Jeannie Chung,Hanna Jang,Ingyeong Yang,Uiwon Hwang,Jaehyung Sim
Main category: cs.CV
TL;DR: 本文提出了一种关系型知识蒸馏框架CLIP-RD,包含垂直关系蒸馏(VRD)和交叉关系蒸馏(XRD),以更好保留CLIP教师模型的多模态嵌入结构关系,提升轻量级学生模型的零样本性能。
Details
Motivation: 现有CLIP蒸馏方法未显式建模师生嵌入间的多向关系依赖,导致学生模型难以保持教师编码的结构关系。 Method: 提出VRD(保证跨模态蒸馏强度在分布层面一致)与XRD(施加跨模态师生相似性分布的双向对称约束),联合建模多向关系结构。 Result: CLIP-RD在零样本分类任务上比现有方法提升0.8个百分点。 Conclusion: 通过显式建模师生嵌入间的多方向关系,CLIP-RD能更忠实地对齐学生与教师的嵌入几何结构,显著提升轻量模型性能。 Abstract: CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.[149] Multimodal Dataset Distillation via Phased Teacher Models
Shengbin Guo,Hang Zhao,Senqiao Yang,Chenyang Jiang,Yuhang Cheng,Xiangru Peng,Rui Shao,Zhuotao Tian
Main category: cs.CV
TL;DR: 本文提出了一种名为PTM-ST的分阶段教师模型框架,用于多模态数据集蒸馏,通过阶段感知建模和捷径轨迹策略提升蒸馏稳定性与表达能力,在Flickr30k和COCO上显著优于现有方法。
Details
Motivation: 现有方法难以捕获教师模型后期训练阶段中动态演化的复杂知识,导致学生模型性能下降和蒸馏数据质量受损,且存在跨阶段性能差距大、教师轨迹不稳定等问题。 Method: 提出Phased Teacher Model with Shortcut Trajectory(PTM-ST),采用阶段感知的教师建模和基于捷径的轨迹构建策略,精准拟合教师模型在不同训练阶段的学习动态。 Result: 理论分析与实验表明,PTM-ST显著缓解优化振荡与阶段间知识鸿沟,并降低存储开销;在Flickr30k和COCO上持续超越SOTA方法,Flickr30k上最高提升13.5%,平均提升9.53%。 Conclusion: PTM-ST有效提升了多模态数据集蒸馏的稳定性、表达力与效率,为大规模图像-文本数据的知识压缩与迁移提供了新范式。 Abstract: Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.[150] FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection
Yingmei Zhang,Wangtao Bao,Yong Yang,Weiguo Wan,Qin Xiao,Xueting Zou
Main category: cs.CV
TL;DR: 本文提出FSGNet,一种结合频域感知与语义引导机制的轻量高效红外小目标检测框架,通过多向交互注意力、多尺度频域感知和全局语义引导流模块,缓解U-Net语义退化问题,提升小目标定位精度与鲁棒性。
Details
Motivation: U-Net在红外小目标检测中存在深层到浅层特征传递时的语义退化问题,导致小目标精确定位能力受限。 Method: 提出FSGNet框架,包含:1)编码器中多方向交互注意力模块以增强对低对比度小目标的敏感性;2)多尺度频域感知模块利用FFT抑制跳连中的背景干扰;3)最深层全局池化+语义引导流实现跨尺度语义一致性与精准定位。 Result: 在四个公开IRSTD数据集上实验表明,FSGNet在检测性能和运行效率上均优于现有方法,具备良好实用性与鲁棒性。 Conclusion: FSGNet有效缓解了U-Net语义退化问题,通过频域建模与语义引导协同提升小目标检测精度与定位能力,是一种轻量且实用的红外小目标检测新方案。 Abstract: Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network's sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on https://github.com/Wangtao-Bao/FSGNet.[151] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
Niccolò Cavagnero,Narges Norouzi,Gijs Dubbelman,Daan de Geus
Main category: cs.CV
TL;DR: 本文提出Plain Mask Transformer (PMT),一种基于冻结视觉基础模型(VFM)特征的快速Transformer解码器(PMD),在保持编码器冻结共享优势的同时,实现图像与视频分割的高性能与低延迟。
Details
Motivation: 现有基于VFM的编码器-only分割模型(如EoMT、VidEoMT)虽速度快,但需微调编码器,牺牲了VFM多任务共享编码器的核心优势;亟需兼顾冻结编码器与高效解码的新架构。 Method: 提出Plain Mask Decoder (PMD)——轻量级Transformer解码器,直接作用于冻结VFM提取的特征;构建Plain Mask Transformer (PMT),完全不更新编码器参数,支持图像和视频统一处理。 Result: 在图像分割上,PMT达到冻结编码器SOTA精度且速度提升约3倍;在视频分割上,性能媲美全微调方法,且比当前最优冻结编码器模型快8倍。 Conclusion: PMT成功 reconciles 冻结VFM编码器的通用性/可共享性与实时分割需求,为大规模多任务部署提供了更优的encoder-decoder协同范式。 Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.[152] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
Xinkai Wang,Chenyi Wang,Yifu Xu,Mingzhe Ye,Fu-Cheng Zhang,Jialin Tian,Xinyu Zhan,Lifeng Zhu,Cewu Lu,Lixin Yang
Main category: cs.CV
TL;DR: LaMP是一种双专家视觉-语言-动作(VLA)框架,通过将密集的3D场景流作为潜在运动先验,提升机器人操作在陌生空间动力学下的泛化与鲁棒性。
Details
Motivation: 现有VLA模型仅依赖2D语义特征直接回归动作,难以显式建模复杂3D物理交互,导致在未见过的空间动态下性能下降。 Method: 提出双专家架构:Motion Expert基于流匹配生成部分去噪的单步3D场景流;Action Expert通过门控交叉注意力受Motion Expert隐状态条件引导,无需完整多步重建。 Result: 在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准及真实世界实验中,LaMP在相同训练预算下取得最高平均成功率;在LIBERO-Plus OOD扰动下相对最强基线提升9.7%平均成功率。 Conclusion: 引入3D场景流作为显式运动先验,并通过双专家协同机制实现高效条件化动作预测,显著提升了VLA模型在分布外场景中的泛化性与鲁棒性。 Abstract: We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.[153] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
Huizhi Liang,Yichao Shen,Yu Deng,Sicheng Xu,Zhiyuan Feng,Tong Zhang,Yaobo Liang,Jiaolong Yang
Main category: cs.CV
TL;DR: 本文提出了一种分层框架,将视觉语言模型(VLM)的3D空间理解学习分解为四个递进层次,并构建了大规模3D空间VQA数据集与RGB-D VLM模型,显著提升了空间理解与推理性能。
Details
Motivation: 实现类人空间智能需要VLM能从2D图像中推断3D结构、识别3D空间中的物体属性与关系,并进行高层空间推理。 Method: 提出四层递进式空间理解框架;构建自动化pipeline生成500万图像、4500万物体的3D空间VQA数据;设计融合度量尺度点云图(metric-scale point maps)的RGB-D VLM。 Result: 在多个空间理解与推理基准上达到SOTA,超越专用空间模型及Gemini-2.5-pro、GPT-5等大模型;验证了层级任务间存在清晰依赖关系。 Conclusion: 多层级任务设计可有效促进VLM中3D空间智能的涌现,所提框架与模型为构建具空间认知能力的VLM提供了系统性路径。 Abstract: Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.[154] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Yang Bai,Liudi Yang,Ziyuan Liu
Main category: cs.CV
TL;DR: 本文提出了VideoWeaver,首个支持多视角视频到视频(V2V)翻译的框架,通过共享4D潜在空间和扩散时序建模实现跨视角外观一致性,并支持自回归生成新视角。
Details
Motivation: 现有V2V方法仅支持单视角,难以满足多相机同步采集的 embodied AI 任务需求;独立处理各视角导致外观不一致,标准Transformer因跨视角注意力计算成本高而难以扩展。 Method: VideoWeaver首先基于流模型构建单视角V2V基础;引入Pi3空间基础模型构建共享4D潜在空间以保障多视角一致性;采用在不同扩散时间步训练各视角的方式,建模联合与条件视角分布,支持自回归新视角合成。 Result: 在单视角基准上达到SOTA或相当性能;首次实现物理与风格均一致的多视角V2V翻译,涵盖具挑战性的第一人称视角及异构相机设置。 Conclusion: VideoWeaver为多视角V2V翻译提供了可扩展、一致性强且灵活的新范式,显著提升其在机器人世界随机化学习中的实用性。 Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.[155] DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming
Wei Lian,Fei Ma,Hang Pan,Zhesen Cui,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出DC-Reg框架,通过差凸(DC)规划范式构建整体凹下估计器,显著收紧分支定界(BnB)搜索的下界,实现点云配准的全局最优解,尤其在部分重叠和大错位场景下表现更优。
Details
Motivation: 点云配准在部分重叠和大错位情况下难以获得全局最优解;现有联合优化变换与对应关系的方法因目标函数非凸耦合,易陷入局部极小或收敛过慢。 Method: 提出DC-Reg框架,基于差凸(DC)编程构建耦合目标函数的整体凹下估计器,利用线性指派问题(LAP)在搜索盒顶点高效计算紧致下界,避免传统逐项松弛忽略变量交互的缺陷。 Result: 在2D相似变换和3D刚体配准任务上验证有效;使用旋转不变特征提升效率;在合成数据和3DMatch基准上,相比最先进全局方法,收敛更快、对噪声和离群值鲁棒性更强。 Conclusion: DC-Reg通过引入整体DC分解,显著提升了BnB算法的下界紧致性与搜索效率,为点云配准提供了兼具全局最优性与实用性的新范式。 Abstract: Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbolθ$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbolθ$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.[156] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration
Keming Ye,Zhou Zhao,Fan Wu,Shengyu Zhang
Main category: cs.CV
TL;DR: 本文提出CIAR框架,通过云-端协同和设备端自验证机制,优化自回归图像生成模型的推理效率,显著降低延迟和云端请求次数,同时保持图像质量。
Details
Motivation: 自回归图像生成模型计算密集且序列化,难以在设备端部署,导致高延迟;同时,图像中存在大量冗余token(如均匀区域)和高不确定性区域(如物体边界),统一验证策略资源浪费严重。 Method: 提出CIAR框架:1)设备端token不确定性量化器,采用连续概率区间替代离散集合,适配大规模视觉词表;2)区间增强解码模块,结合分布对齐训练策略以加速解码并维持视觉保真度与语义一致性。 Result: 实验表明CIAR相比现有方法实现2.18倍加速、减少70%云端请求,同时保持图像质量。 Conclusion: CIAR通过引入不确定性感知的云-端协同机制,有效缓解了自回归图像生成在边缘设备部署中的效率瓶颈,为高质量实时图像生成提供了可行路径。 Abstract: Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.[157] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Mohamed Eltahir,Ahmed O. Ibrahim,Obada Siralkhatim,Tabarak Abdallah,Sondos Mohamed
Main category: cs.CV
TL;DR: 本文提出GridVAD,一种无需训练的视频异常检测框架,利用视觉语言模型(VLM)生成自然语言异常描述,通过自一致性筛选、Grounding DINO定位和SAM2传播生成像素级异常掩码,在UCSD Ped2上达到最高Pixel-AUROC(77.59)。
Details
Motivation: 现有VLM直接用于视频异常检测时因缺乏校准的异常先验而表现脆弱,易漏检或误报;作者认为问题不在VLM本身,而在使用方式——应将其作为异常提议器,再由专用模块进行空间-时间接地与传播。 Method: 提出'提议-接地-传播'范式:VLM在分层网格视频片段上生成开放集异常语言描述;自一致性整合(SCC)跨多次采样筛选稳定提案;Grounding DINO将提案锚定至边界框;SAM2在异常时间区间内传播为稠密像素掩码;VLM调用预算固定为M+1次/视频片段。 Result: 在UCSD Ped2数据集上Pixel-AUROC达77.59,超越部分微调方法TAO(75.11)及所有零样本方法;对象级RBDC指标超其他零样本方法5倍以上;SCC可调控精度-召回率权衡;效率上比均匀逐帧VLM查询高2.7倍,并额外输出分割掩码。 Conclusion: GridVAD验证了将VLM解耦为提议器并辅以轻量专用模块的有效性,实现了高性能、高效率、免训练的像素级视频异常检测,为VLM在下游感知任务中的角色重构提供了新范式。 Abstract: Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.[158] AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao
Main category: cs.CV
TL;DR: 本文提出AdaSFormer,一种面向室内单目语义场景补全(MSSC)的序列化Transformer框架,通过自适应序列化Transformer、中心相对位置编码和卷积调制层归一化三大设计,有效应对室内场景复杂布局与严重遮挡带来的挑战,在NYUv2和Occ-ScanNet上达到SOTA性能。
Details
Motivation: 室内单目语义场景补全比室外更具挑战性,主要因复杂空间布局和严重遮挡;而传统Transformer存在内存开销大、难以重建细粒度细节等问题,限制其在该任务中的应用。 Method: 提出AdaSFormer框架,包含三个核心设计:(1) 带可学习偏移的自适应序列化Transformer,动态调整感受野;(2) 中心相对位置编码,增强空间信息建模;(3) 卷积调制层归一化,融合卷积与Transformer异构特征。 Result: 在NYUv2和Occ-ScanNet数据集上取得当前最优性能(SOTA)。 Conclusion: AdaSFormer有效克服了Transformer在室内MSSC任务中内存消耗高与细节重建弱的瓶颈,验证了序列化与结构协同设计对复杂室内场景理解的有效性。 Abstract: Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.[159] Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects
Jakob Paul Zimmermann,Gerrit Holzbach,David Lerch
Main category: cs.CV
TL;DR: 本文提出了一种名为知识引导失败预测(KGFP)的新框架,用于在运行时检测目标检测器对安全关键物体(如行人)的漏检,通过衡量检测器特征与视觉基础模型嵌入之间的语义不一致性来识别异常。
Details
Motivation: 目标检测器在安全关键场景中可能静默失效(如漏检行人),而传统OOD检测方法无法直接预测检测器自身的功能失效。 Method: KGFP是一种基于表征的监控框架,采用双编码器架构,利用角度距离度量来量化目标检测器内部特征与视觉基础模型嵌入之间的语义错位。 Result: 在COCO人体检测任务上,KGFP作为选择性预测门控,将接受图像中的人体召回率从64.3%提升至84.5%(FPR=5%),并在六个COCO-O视觉域上显著优于现有OOD基线方法。 Conclusion: KGFP能有效识别检测器在能力边界外运行或面对新颖输入时的失效状态,为安全关键应用提供了可靠、可解释的实时监控机制。 Abstract: Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP.[160] RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models
Yufeng Yang,Xianfang Zeng,Zhangqi Jiang,Fukun Yin,Jianzhuang Liu,Wei Cheng,jinghong lan,Shiyu Liu,Yuqi Peng,Gang YU,Shifeng Chen
Main category: cs.CV
TL;DR: 本文构建了一个覆盖九种真实世界退化类型的大型数据集,并训练了一个开源模型,在真实图像恢复任务中达到了最先进的性能,同时提出了RealIR-Bench评估基准。
Details
Motivation: 现有图像恢复模型受限于训练数据的规模与分布,难以泛化到真实场景;而大型闭源编辑模型虽效果好但成本高。 Method: 构建大规模真实退化数据集(9类),训练高性能开源恢复模型,并提出包含464张真实退化图像和专用评估指标的RealIR-Bench基准。 Result: 所提开源模型在RealIR-Bench上排名第一,达到当前最优性能,显著缩小与闭源模型的差距。 Conclusion: 通过高质量数据构建与针对性评估,开源图像恢复模型可在真实场景中实现媲美闭源方案的性能。 Abstract: Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.[161] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case
Koldo Basterretxea,Jon Gutiérrez-Zaballa,Javier Echanobe
Main category: cs.CV
TL;DR: 本文分析了高光谱成像(HSI)在自动驾驶(AD)中的应用挑战与技术选择,结合HSI-Drive数据集最新版本的实验结果,探讨了适用于AD场景的HSI技术及定制化视觉算法。
Details
Motivation: 高光谱成像在自动驾驶中具有潜力,但面临非受控光照、大景深、动态快速运动场景以及嵌入式平台实时性与算力限制等多重挑战,亟需适配AD需求的技术选型与算法设计。 Method: 分析多种HSI-based视觉系统技术,并以最新版HSI-Drive数据集的实验结果为案例进行实证研究。 Result: 明确了面向自动驾驶的HSI技术选型标准和定制视觉算法开发方向,强调需协同利用光谱与空间信息以应对实际驾驶环境约束。 Conclusion: HSI在自动驾驶中的有效应用依赖于针对域特异性挑战(如光照变化、实时性、算力限制)所设计的传感器技术与专用视觉算法,未来研究应聚焦软硬协同优化。 Abstract: The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.[162] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
Alex Hoi Hang Chan,Neha Singhal,Onur Kocahan,Andrea Meltzer,Saverio Lubrano,Miyako H. Warrington,Michel Griesser,Fumihiro Kano,Hemal Naik
Main category: cs.CV
TL;DR: 本文介绍了CHIRP数据集和CORVID方法,用于野生鸟类的个体再识别和长期行为监测,旨在弥合计算机视觉研究与生物学应用之间的差距。
Details
Motivation: 长期个体动物行为监测对保护生物学和进化生物学至关重要,但目前缺乏支持多种计算机视觉任务以提取生物有意义指标的数据集。 Method: 提出CHIRP数据集(涵盖再识别、动作识别、2D关键点估计、目标检测和实例分割)及CORVID方法(基于彩色腿环分割与分类的概率化视频再识别流程)。 Result: CORVID在应用特异性基准(如取食率、共现率)上优于当前最优再识别方法;CHIRP支持多任务学习与真实场景评估。 Conclusion: 该工作为从伦理批准的生物学研究中构建真实世界数据集提供了范式,推动计算机视觉与生态学交叉研究。 Abstract: Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.[163] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
Xiangyang Luo,Qingyu Li,Yuming Li,Guanbo Huang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Shao-Lun Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为Timestep-aware Quality Decoupling (TQD) 的新方法,通过在训练过程中根据数据质量(视觉质量与运动质量)动态选择采样时间步,解决视频生成中‘运动-视觉质量困境’,从而在非黄金数据上实现超越传统高质量数据训练的效果。
Details
Motivation: 视频生成模型依赖高视觉质量和高运动质量的高质量数据,但现实中二者存在负相关,难以同时兼得,形成‘Motion-Vision Quality Dilemma’。 Method: 基于对视频扩散模型分层学习动力学和梯度分析的发现,提出‘训练过程中的时间步选择’概念,并设计TQD方法:针对运动丰富数据偏向高时间步采样,视觉质量高数据偏向低时间步采样,以匹配模型学习节奏。 Result: TQD在仅使用分离的、质量不平衡的数据训练时,性能反超使用更高质量数据的传统训练;同时在高质量数据上训练也能进一步提升性能。 Conclusion: 完美数据并非视频生成的必要条件;通过适配模型学习动态的时间步感知数据采样策略,可有效缓解数据质量瓶颈。 Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.[164] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning
Ning Ding,Keisuke Fujii,Toru Tamaki
Main category: cs.CV
TL;DR: 本文提出了首个羽毛球全场比赛密集标注数据集BFMD,并开发了基于VideoMAE的多模态字幕生成框架,引入语义反馈机制以提升字幕语义一致性,在 shot caption 任务上优于纯RGB基线。
Details
Motivation: 现有羽毛球数据集多为短片段或任务特定标注,缺乏带密集多模态标注的全场比赛数据,难以支持准确的击球描述生成和比赛级战术分析。 Method: 构建BFMD数据集(19场完整比赛、16751次击球、含击球类型、球轨迹、球员姿态关键点和击球描述等分层标注);提出基于VideoMAE的多模态字幕生成框架,引入语义反馈机制利用击球语义指导字幕生成。 Result: 实验表明多模态建模与语义反馈显著提升击球字幕质量;基于BFMD实现了全场比赛战术模式的时序演化分析。 Conclusion: BFMD填补了全场比赛多模态标注数据的空白,所提方法验证了多模态语义协同对细粒度运动理解的重要性,为羽毛球战术分析提供了新基准与工具。 Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.[165] Insights on back marking for the automated identification of animals
David Brunner,Marie Bordes,Elisabeth Mayrhuber,Stephan M. Winkler,Viktoria Dorfer,Maciej Oczak
Main category: cs.CV
TL;DR: 本研究基于ResNet-50模型分析猪背部标记的设计原则,强调在运动模糊、多视角和遮挡等现实条件下标记需保持可区分性,并需适配数据增强策略(如颜色变换、翻转、裁剪),为机器学习驱动的个体级监测提供实用设计指南。
Details
Motivation: 针对外观相似物种(如猪)个体监测中背部标记设计缺乏研究的问题,尤其在机器学习监测方案兴起背景下,亟需面向算法识别优化的标记设计指南。 Method: 使用ResNet-50神经网络训练分类10头带独特背部标记的猪;通过分析模型预测结果,结合运动模糊、视角变化、遮挡及常见数据增强(颜色、翻转、裁剪)等因素,评估标记设计有效性。 Result: 发现背部标记必须在运动模糊、多视角和行为导致的遮挡下仍保持唯一可辨识性;同时需兼容训练中常用的数据增强策略,否则显著影响识别性能。 Conclusion: 该研究为均匀外观动物的个体监测提供了可操作的背部标记设计原则,有助于提升机器学习模型在真实养殖场景中的鲁棒性与实用性。 Abstract: To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model's predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.[166] PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos
Yihao Wang,Yang Miao,Wenshuai Zhao,Wenyan Yang,Zihan Wang,Joni Pajarinen,Luc Van Gool,Danda Pani Paudel,Juho Kannala,Xi Wang,Arno Solin
Main category: cs.CV
TL;DR: 本文提出PAWS方法,从大规模野外第一人称视频中直接提取物体关节结构,无需高质量3D标注数据,显著提升关节运动与结构恢复性能,并推动下游机器人操作等任务。
Details
Motivation: 现有基于学习的方法严重依赖带高质量3D数据和人工标注的监督训练,限制了可扩展性和多样性。 Method: 提出PAWS方法,直接从大规模野外第一人称视频中的手-物交互中提取物体关节结构。 Result: 在HD-EPIC和Arti4D等公开数据集上显著优于基线方法;所提取的关节结构有助于微调3D关节预测模型和实现机器人操作。 Conclusion: PAWS摆脱了对密集人工标注的依赖,为 articulated object 理解提供了更具可扩展性和实用性的新范式。 Abstract: Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.[167] Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion
Nikolo Rohrmoser,Ghazal Ghazaei,Michael Sommersperger,Nassir Navab
Main category: cs.CV
TL;DR: 本文提出了一种用于玻璃体视网膜手术中实时多模态(手术显微镜OPMI + 术中OCT)特征融合的网络,实现器械检测、关键点定位与器械-组织距离估计,显著提升近距离(<1mm)距离估计精度,并达到实时处理速度。
Details
Motivation: 在眼科手术中,OPMI与iOCT两种模态互补,但尚未有效融合用于实时多任务术中感知;需提升器械跟踪与工具-组织距离估计的精度,尤其在关键的近视网膜操作区域。 Method: 提出一种多模态、时序感知、实时可行的网络架构:采用YoloNAS和CNN分别提取OPMI与iOCT特征,通过交叉注意力模块进行特征融合,并引入基于区域的循环模块建模时间一致性。 Result: 实现95.79% mAP50的器械检测与关键点定位;工具-组织距离估计误差从单模态OPMI的284 μm大幅降低至多模态的33 μm(<1 mm场景),单帧处理耗时22.5 ms,满足实时性要求。 Conclusion: 多模态特征融合可显著提升多任务预测精度,且通过定制化网络设计可兼顾实时性;该工作验证了其在图像引导玻璃体视网膜手术中的潜力,但也揭示了通向更可靠、一致、全面手术场景理解的关键挑战。 Abstract: Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.[168] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing
Xuran Hu,Zhitong Xiong,Zhongcheng Hong,Yifang Ban,Xiaoxiang Zhu,Wufan Zhao
Main category: cs.CV
TL;DR: 本文提出了一种面向地球观测的、高度感知的遥感理解评估框架,包括数据生成流水线、两个新基准(GeoHeight-Bench 和 GeoHeight-Bench+)以及首个高度感知LMM基线GeoHeightChat,以弥补现有大模型在垂直维度建模上的缺失。
Details
Motivation: 当前地球观测大模型普遍忽略关键的“垂直”维度,导致在复杂遥感几何结构和灾害场景中推理能力受限,而物理空间结构常比平面纹理更重要。 Method: 构建了基于视觉语言模型(VLM)的可扩展数据生成流水线(含系统化提示工程与元数据提取),创建了两个高度感知基准(GeoHeight-Bench 和 GeoHeight-Bench+),并提出了首个高度感知遥感大模型基线GeoHeightChat,将视觉语义与隐式注入的高度几何特征协同融合。 Result: GeoHeightChat基线成功缓解了模型的‘垂直盲点’,验证了高度感知对遥感理解的必要性,并实现了在现有光学模型上支持交互式高度推理的新范式。 Conclusion: 引入高度维度是提升遥感大模型空间推理能力的关键;所提框架与基线为垂直感知的地球观测AI提供了系统性基础与实证路径。 Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.[169] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference
Sk Miraj Ahmed,Xi Yu,Yunqi Li,Yuewei Lin,Wei Xu
Main category: cs.CV
TL;DR: 本文提出两种层次感知的多模态学习方法CLiBD-HiR和CLiBD-HiR-Fuse,通过引入分层信息正则化(HiR)和轻量级融合预测器,显著提升生物分类准确性,尤其在DNA数据不完整或受损时效果更佳。
Details
Motivation: 现有方法将分类学视为扁平标签空间,忽略生物分类的层级结构,导致在噪声和模态缺失下鲁棒性差。 Method: 提出CLiBD-HiR(引入HiR以塑造跨分类层级的嵌入几何结构)和CLiBD-HiR-Fuse(增加轻量级融合预测器,支持图像、DNA或联合推理,并对模态损坏具有鲁棒性)。 Result: 在大规模生物多样性基准测试中,相比强多模态基线,分类准确率提升超14%,尤其在DNA部分缺失或受损条件下增益显著。 Conclusion: 显式建模生物分类层级结构并结合灵活融合机制,是构建实用生物多样性基础模型的关键。 Abstract: Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.[170] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation
Chengfeng Zhao,Junbo Qi,Yulou Liu,Zhiyang Dou,Minchen Li,Taku Komura,Ziwei Liu,Wenping Wang,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出UNIC方法,利用实例特定的神经形变场实时驱动虚拟角色服装形变,避免拓扑处理并提升平滑性与质量,适用于游戏等交互场景。
Details
Motivation: 物理仿真方法计算开销大、难以实时;现有图神经网络方法难以捕捉复杂拓扑服装的精细形变。 Method: 提出基于神经形变场的UNIC方法,学习实例特定的3D点到形变偏移的映射,仅需泛化至新动作序列而非新服装。 Result: 在多种服装网格上实验表明,UNIC在形变质量与运行效率上均优于基线方法。 Conclusion: UNIC实现了高质量、实时、拓扑无关的服装动画,具备实际交互应用潜力。 Abstract: Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.[171] DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial
Zhenchen Zhu,Ge Hu,Weixiong Tan,Kai Gao,Chao Sun,Zhen Zhou,Kepei Xu,Wei Han,Meixia Shang,Xiaoming Qiu,Yiqing Tan,Jinhua Wang,Zhoumeng Ying,Li Peng,Wei Song,Lan Song,Zhengyu Jin,Nan Hong,Yizhou Yu
Main category: cs.CV
TL;DR: 本文提出了一种名为DeepFAN的基于Transformer的深度学习模型,用于肺结节良恶性分类,经大规模病理验证数据训练并在多中心临床试验中验证其对初级放射科医生的辅助诊断效果,显著提升了诊断性能与一致性。
Details
Motivation: 当前深度学习方法在肺结节良恶性分类中未能充分融合全局与局部特征,且缺乏临床试验验证。 Method: 开发了基于Transformer的DeepFAN模型,使用超10,000例病理证实的肺结节数据进行训练,并开展多阅片者、多病例的多中心临床试验评估其辅助诊断效能。 Result: DeepFAN在内部测试集AUC达0.939,在临床试验数据集(400例,三家机构)AUC达0.954;12名阅片者平均AUC提升10.9%,准确率、敏感性、特异性均显著提升;阅片者间一致性由fair提升至moderate(k值0.313→0.421)。 Conclusion: DeepFAN能有效辅助初级放射科医生,有助于提升诊断质量均一性,减少对不确定肺结节的不必要随访。 Abstract: The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.[172] Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification
Ünsal Öztürk,Hatef Otroshi Shahreza,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文评估了九种开源多模态大语言模型(MLLMs)在人脸验证任务中的性能与公平性,发现专用人脸模型FaceLLM-8B显著优于通用MLLMs;不同模型和基准下偏见模式各异,且高准确率不等于高公平性。
Details
Motivation: 探究多模态大语言模型(MLLMs)作为人脸验证系统时的 demographic fairness(种族与性别公平性),因该方面此前缺乏系统研究。 Method: 在IJB-C和RFW数据集上,对来自六个模型家族、参数量2B–8B的九个开源MLLMs进行人脸验证评估;按四个种族组和两个性别组分别计算等错误率(EER)、特定假匹配率(FMR)下的真实匹配率(TMR),并采用四种基于FMR的公平性指标量化群体间差异。 Result: FaceLLM-8B(唯一专用人脸模型)在两个基准上均显著优于通用MLLMs;偏见模式不同于传统人脸识别系统,且因模型与基准而异;最高准确率模型未必最公平,低准确率模型可能因各群体错误率均匀偏高而呈现虚假公平性。 Conclusion: MLLMs在人脸验证中存在显著且复杂的公平性问题,需专门设计与评估;不能以整体准确率替代公平性评估,应结合多维度指标开展细粒度群体分析。 Abstract: Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.[173] LanteRn: Latent Visual Structured Reasoning
André G. Viveiros,Nuno Gonçalves,Matthias Lindemann,André Martins
Main category: cs.CV
TL;DR: 本文提出LanteRn框架,使大语言-视觉模型能在潜空间中交织语言与紧凑的视觉表征进行推理,避免依赖外部模块或冗余像素级计算,在多个视觉感知基准上显著提升性能。
Details
Motivation: 当前大 multimodal 模型在视觉推理(尤其是细粒度空间和视觉理解)方面存在局限,多依赖文本化描述感知内容;现有改进方法要么依赖外部工具,要么在像素空间直接推理导致计算开销大。 Method: 提出LanteRn框架:在视觉-语言Transformer中引入生成和关注连续视觉‘思维嵌入’(latent visual thought embeddings)的能力;采用两阶段训练——先监督微调以对齐视觉特征与潜态,再通过强化学习使潜空间推理符合任务效用。 Result: 在VisCoT、V* 和 Blink 三个以感知为中心的基准上,LanteRn一致提升了视觉定位与细粒度推理能力。 Conclusion: 利用模型内部潜空间进行视觉-语言交织推理,是一种更高效、更具潜力的多模态推理新路径。 Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.[174] Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis
Chengshuai Yang
Main category: cs.CV
TL;DR: 本文提出spec.md规范格式和三个自主代理(Plan、Judge、Execute),将一句自然语言描述自动转化为经验证的前向模型,并通过设计到实际误差定理分解重建误差,实现与专家库相当的性能。
Details
Motivation: 解决计算成像系统设计依赖专家、耗时长、门槛高的问题,使更广泛的科学社区能参与成像仪器原型开发。 Method: 提出结构化规范格式spec.md;构建Plan、Judge、Execute三个自主代理;建立设计到实际误差定理,将总重建误差分解为五个可独立界定的项并关联修正措施。 Result: 在6种真实数据模态(覆盖全部5类载波族)上,自动化流程达到专家库水平(98.1 ± 4.2%);生成10种新设计,支持从3D到5D的组合式链式构造,超越单一模态工具能力。 Conclusion: 该框架显著降低计算成像系统设计门槛,提升可扩展性与跨模态组合能力,推动领域民主化与自动化。 Abstract: Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.[175] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming
Yunus Talha Erzurumlu,Jiyong Kwag,Alper Yilmaz
Main category: cs.CV
TL;DR: 本文提出了一种名为Just Zoom In的新方法,通过在城市级俯视地图上进行自回归式缩放(zooming)来实现跨视角地理定位(CVGL),避免了传统对比学习和难负样本挖掘的依赖,并在新构建的真实基准上取得了SOTA性能。
Details
Motivation: 现有CVGL方法将问题建模为对比学习下的图像检索,受限于大批次训练、难负样本挖掘,且忽略了地图几何结构与街景-俯视图覆盖不匹配的问题(如街景中的显著地标可能落在固定卫星裁剪区域之外),导致定位目标模糊、难以显式空间推理。 Method: 提出Just Zoom In方法:从粗粒度卫星视图出发,通过短序列的自回归‘缩放’决策,逐步聚焦到目标分辨率下的终端卫星单元;不使用对比损失或难负样本挖掘;同时构建了一个基于众包街景与高分辨率卫星影像的真实基准。 Result: 在新基准上,Just Zoom In在50米内Recall@1提升5.5%,100米内Recall@1提升9.6%,超越最强对比学习基线,验证了自回归粗到细空间推理的有效性。 Conclusion: CVGL不应仅依赖嵌入空间检索,而可建模为序列化空间定位任务;Just Zoom In展示了利用地图结构与分层空间推理提升定位精度的新范式。 Abstract: Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.[176] LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation
Ishaan Gakhar,Laven Srivastava,Sankarshanaa Sagaram,Aditya Kasliwal,Ujjwal Verma
Main category: cs.CV
TL;DR: 本文提出了一种名为LEMMA的轻量级语义分割模型,专为资源受限的海洋遥感图像分割设计,通过拉普拉斯金字塔增强边缘识别,显著降低计算开销并保持高精度。
Details
Motivation: 现有基于深度CNN和Transformer的海洋语义分割方法计算成本高、资源消耗大,难以满足实时、低成本的海上实际应用需求。 Method: 提出LEMMA模型,利用拉普拉斯金字塔在特征提取早期融合边缘信息,避免深层网络中昂贵的特征图计算,从而大幅压缩模型规模与计算量。 Result: 在多个遥感数据集上达到SOTA性能:Oil Spill数据集IoU达93.42%,Mastr1325达98.97% mIoU;参数量、GFLOPs和推理时间分别最多减少71倍、88.5%和84.65%。 Conclusion: LEMMA在保证高精度的同时极大提升了效率,适用于USV自主导航、溢油监测、海岸带监控等真实海洋场景。 Abstract: Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5\%, and inference time by up to 84.65\%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42\% IoU on the Oil Spill dataset and 98.97\% mIoU on Mastr1325.[177] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
Jinbo Xing,Zeyinzi Jiang,Yuxiang Tuo,Chaojie Mao,Xiaotang Gai,Xi Chen,Jingfeng Zhang,Yulin Pan,Zhen Han,Jie Xiao,Keyu Yan,Chenwei Xie,Chongyang Zhong,Kai Zhu,Tong Shen,Lianghua Huang,Yu Liu,Yujiu Yang
Main category: cs.CV
TL;DR: 本文提出Wan-Weaver框架,通过将交错生成分解为文本规划与视觉一致性建模,利用文本代理交错数据和参考引导图像数据分别训练规划器与可视化器,在无真实交错数据情况下实现长程文本连贯性与视觉一致性的交错生成。
Details
Motivation: 现有统一多模态模型虽能处理多模态输入,但难以生成模态交错的输出,主因是交错训练数据稀缺及长程跨模态上下文建模困难。 Method: 提出两阶段框架:1)规划器(Planner)生成密集文本描述以表征视觉内容;2)可视化器(Visualizer)据此合成图像;并构建大规模文本代理交错数据训练规划器,以及参考引导图像数据训练可视化器。 Result: Wan-Weaver在无真实交错数据条件下,展现出长程文本连贯性与视觉一致性的交错生成能力,并在自建多维度交错生成基准上显著优于现有方法。 Conclusion: 该工作验证了将交错生成解耦为文本规划与视觉合成的有效性,为无需真实交错标注的多模态交错生成提供了新范式。 Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.[178] TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance
Quynh Phung,Long Mai,Cusuh Ham,Feng Liu,Jia-Bin Huang,Aniruddha Mahapatra
Main category: cs.CV
TL;DR: 本文提出Trace框架,实现视频中目标物体运动路径的可控编辑,用户只需在单帧中设计轨迹,即可生成时间一致的编辑视频。
Details
Motivation: 现有视频编辑方法主要操纵外观或依赖点跟踪轨迹控制,但点跟踪在相机运动场景下难以提供,缺乏实用性和易用性。 Method: 采用两阶段流水线:1)跨视角运动变换模块,将首帧路径设计映射为适配相机运动的逐帧边界框轨迹;2)运动条件化视频重合成模块,依据轨迹重生成目标物体并保留原视频其余内容。 Result: 在多种真实视频上实验表明,该方法相比近期图像到视频和视频到视频方法,在运动编辑的一致性、真实性和可控性方面更优。 Conclusion: Trace提供了一种实用、易用的目标中心运动编辑方法,显著提升了视频中物体轨迹编辑的可控性与鲁棒性。 Abstract: We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.[179] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs
Vishal Narnaware,Animesh Gupta,Kevin Zhai,Zhenyi Wang,Mubarak Shah
Main category: cs.CV
TL;DR: 本文提出VISAGE框架,通过在推理时校准目标函数来缓解多模态扩散大语言模型(MDLLMs)中的多模态幻觉问题,其核心是利用交叉注意力的空间熵估计代理误差并重新排序token,从而提升视觉对齐性。
Details
Motivation: MDLLMs采用并行掩码解码实现高并发生成,但因解码器仅基于文本似然性排序候选token而缺乏局部视觉验证,导致多模态幻觉;该问题本质是语言概率作为多模态任务代理目标时引发的客观错配。 Method: VISAGE是一种无需训练的解码框架,在推理时通过量化交叉注意力分布的空间熵来估计代理偏差,并强制多头注意力间达成空间定位共识,惩罚空间均匀分布,从而重排序token以增强视觉接地性。 Result: VISAGE在幻觉敏感和通用基准上均表现出鲁棒性,在MMMU-val和HallusionBench上分别取得8.59%和7.75%的相对性能提升,并提供了分析稳定性保证,确保在估计误差下目标损失有界。 Conclusion: VISAGE有效缓解了MDLLMs中由目标错配引发的局部优化错误(即幻觉),将幻觉重新定义为可建模与校正的解码偏差,为无训练、推理时干预的多模态对齐提供了新范式。 Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.[180] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Kaijin Chen,Dingkang Liang,Xin Zhou,Yikang Ding,Xiaoqiang Liu,Pengfei Wan,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出Hybrid Memory范式及HyDRA架构,解决视频世界模型中动态主体遮挡后重出现时的身份与运动连续性问题,并构建首个大规模混合记忆视频数据集HM-World。
Details
Motivation: 现有视频世界模型的记忆机制将环境视为静态画布,难以处理动态主体短暂离开视野后再出现时的冻结、形变或消失问题。 Method: 提出Hybrid Memory范式,要求模型同时精准存档静态背景与持续追踪动态主体;构建HM-World数据集(59K高保真片段,含解耦相机/主体轨迹、17场景、49主体、精心设计的进出事件);设计HyDRA内存架构,采用时空相关性驱动的检索机制压缩并选择性关注运动线索。 Result: 在HM-World上实验表明,HyDRA在动态主体一致性与整体生成质量上显著优于现有最先进方法。 Conclusion: Hybrid Memory是提升视频世界模型对动态主体长期建模能力的有效新范式,HyDRA架构与HM-World数据集为该方向提供了关键基础支撑。 Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.[181] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Hai X. Pham,David T. Hoffmann,Ricardo Guerrero,Brais Martinez
Main category: cs.CV
TL;DR: 本文提出了一种无需硬负样本、基于概念中心化caption和跨模态注意力池化的对比视觉-语言模型改进方法,显著提升了组合性任务性能,同时保持或增强了零样本和检索能力。
Details
Motivation: 现有对比视觉-语言模型在学习组合性表征方面能力有限;硬负样本虽有效但泛化差、损害基础V&L能力,不实用。 Method: 1)使用标准NLP工具提取短的概念中心化caption片段并与图像对齐;2)引入无参跨模态注意力池化,从图像编码器中获得概念中心化视觉嵌入;辅以简单对比损失。 Result: 在标准组合性基准上达到SOTA性能,同时维持或提升零样本分类与跨模态检索能力,且不增加推理开销。 Conclusion: 组合性能力受限主因是长caption缺乏必要约束、以及全局池化导致绑定信息丢失;通过概念中心化建模与注意力池化可高效解决,兼顾性能与实用性。 Abstract: Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.[182] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation
Chen Si,Yulin Liu,Bo Ai,Jianwen Xie,Rolandos Alexandros Potamias,Chuanxia Zheng,Hao Su
Main category: cs.CV
TL;DR: 本文提出了AnyHand,一个大规模合成数据集,用于提升RGB和RGB-D输入下的3D手部姿态估计性能,并设计了一个轻量级深度融合模块,在多个基准上显著提升了性能与泛化能力。
Details
Motivation: 现有真实世界采集的手部数据集覆盖有限,而先前的合成数据集往往缺乏遮挡、手臂细节及对齐深度信息的大规模支持,限制了模型性能与鲁棒性提升。 Method: 构建包含2.5M单手和4.1M手-物交互RGB-D图像的大规模合成数据集AnyHand,提供丰富的几何标注;提出可嵌入现有RGB模型的轻量级深度融合模块;在固定架构与训练策略下,将AnyHand用于扩展基线模型训练。 Result: 在FreiHAND和HO-3D等基准上显著提升性能;未经微调即在跨域HO-Cap数据集上表现出更强泛化能力;结合AnyHand训练的RGB-D模型在HO-3D上达到最优性能。 Conclusion: AnyHand有效缓解了高质量、多样化手部训练数据稀缺的问题,验证了合成数据在3D手部姿态估计中的巨大潜力,并为RGB-D融合建模提供了实用方案。 Abstract: We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.[183] PixelSmile: Toward Fine-Grained Facial Expression Editing
Jiabin Hua,Hengyuan Xu,Aojie Li,Wei Cheng,Gang Yu,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出PixelSmile扩散框架,通过完全对称联合训练解耦表情语义,并结合强度监督与对比学习,实现连续、可控、细粒度的面部表情编辑,同时保持身份一致性。
Details
Motivation: 细粒度面部表情编辑长期受限于内在语义重叠问题,缺乏高质量连续情感标注数据集和综合评估基准。 Method: 构建带连续情感标注的Flex Facial Expression(FFE)数据集及FFE-Bench评估基准;提出PixelSmile扩散模型,采用完全对称联合训练解耦表情语义,并融合强度监督与对比学习,通过文本潜在空间插值实现线性表情控制。 Result: PixelSmile在结构混淆抑制、编辑准确性、线性可控性及身份保持方面均优于现有方法,支持平滑表情融合,验证了其在连续可控细粒度表情编辑任务上的有效性。 Conclusion: PixelSmile为细粒度面部表情编辑提供了新范式,兼顾表达强度、语义区分度与身份一致性,推动可控生成向更精细、更自然的方向发展。 Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.[184] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Xiaofeng Mao,Shaohao Rui,Kaining Ying,Bo Zheng,Chuanhao Li,Mingmin Chi,Kaipeng Zhang
Main category: cs.CV
TL;DR: PackForcing提出一种三段式KV缓存策略(Sink/Mid/Recent tokens)与动态top-k选择及连续RoPE调整机制,实现高效长视频生成,在单卡上支持2分钟高清视频合成,且仅需5秒短片段监督即可达到SOTA时序一致性。
Details
Motivation: 解决自回归视频扩散模型在长视频生成中面临的KV缓存线性增长、时间重复和误差累积等瓶颈问题。 Method: 提出PackForcing框架:1)将历史上下文划分为Sink(保留关键帧)、Mid(双分支3D卷积+低分辨率VAE压缩,32x降token)、Recent(全分辨率保局部连贯)三类token;2)对Mid token引入动态top-k选择;3)设计连续Temporal RoPE Adjustment校准位置编码。 Result: 在单H200 GPU上生成2分钟、832×480、16FPS视频;KV缓存稳定在4GB;实现5秒到120秒(24倍)时序外推;VBench上时序一致性达26.07、动态度56.25,均为SOTA。 Conclusion: 通过分层上下文压缩与轻量位置校准,PackForcing证明短视频监督足以支撑高质量长视频合成,为高效视频生成提供了新范式。 Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing[185] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
Yan Li,Zezi Zeng,Ziwei Zhou,Xin Gao,Muzhao Tian,Yifan Yang,Mingxi Cheng,Qi Dai,Yuqing Yang,Lili Qiu,Zhendong Wang,Zhengyuan Yang,Xue Yang,Lijuan Wang,Ji Li,Chong Luo
Main category: cs.CV
TL;DR: 本文提出了BizGenEval,一个面向商业视觉内容生成的系统性基准测试,涵盖五种文档类型和四个关键能力维度,通过400个提示和8000个人工验证的问题评估26个主流图像生成模型,揭示了当前模型在专业视觉内容生成任务中的显著能力差距。
Details
Motivation: 现有图像生成基准主要关注自然图像合成,缺乏对真实商业设计任务中结构化与多约束要求的系统性评估。 Method: 构建BizGenEval基准,覆盖五类典型商业文档(幻灯片、图表、网页、海报、科学图表),从文本渲染、布局控制、属性绑定和知识推理四个维度设计20项任务,包含400个精心设计的提示和8000个人工验证的检查清单问题,并对26个主流图像生成模型进行大规模评测。 Result: 评测结果表明,当前生成模型在满足专业商业视觉内容生成的复杂视觉与语义约束方面仍存在显著能力差距。 Conclusion: BizGenEval为真实场景下的商业视觉内容生成提供了标准化评估基准,有望推动该领域的发展与模型能力提升。 Abstract: Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.[186] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding
Jiwook Han,Geo Ahn,Youngrae Kim,Jinwoo Choi
Main category: cs.CV
TL;DR: 本文提出SlotVTG框架,通过轻量级slot适配器引导多模态大语言模型进行以对象为中心、输入驱动的视觉推理,显著提升视频时序定位任务的域外泛化能力,同时保持域内性能且开销极小。
Details
Motivation: 现有MLLMs在视频时序定位(VTG)中因粗粒度识别能力不足需任务特定微调,易导致模型记忆数据集捷径而非真实视觉内容,从而损害域外(OOD)泛化能力;而现有以对象为中心的学习方法需从头重训多阶段流程,成本高昂。 Method: 提出SlotVTG框架,引入轻量级slot适配器:利用slot attention将视觉token分解为抽象slot,并通过自监督视觉模型提供的objectness先验引导语义连贯的slot生成,再重建原始序列。 Result: 在标准VTG基准上的跨域评估表明,该方法显著提升OOD鲁棒性,同时维持有竞争力的ID性能,且计算开销极小。 Conclusion: SlotVTG以低成本实现了MLLMs向对象中心、输入驱动视觉推理的有效引导,为提升VTG模型泛化能力提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.[187] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
Ziyin Wang,Sirui Xu,Chuan Guo,Bing Zhou,Jiangshan Gong,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui
Main category: cs.CV
TL;DR: 本文提出LIGHT方法,通过扩散模型中的去噪节奏实现数据驱动的引导,无需手工设计的接触先验,提高了人-物交互动画生成的真实性和泛化能力。
Details
Motivation: 生成真实的人-物交互(HOI)动画具有挑战性,需同时建模动态人体动作和多样物体几何;现有基于扩散的方法依赖手工接触先验或运动学约束,限制了灵活性与泛化性。 Method: 提出LIGHT框架,基于扩散强迫机制,将表征分解为模态特定组件,并为各组件分配异步去噪时间表与个性化噪声水平;利用跨注意力使较干净组件引导较嘈杂组件,实现无分类器的数据驱动引导。 Result: 实验表明,节奏诱导的引导比传统无分类器引导更有效地模拟接触先验优势,在接触保真度、HOI生成真实性及对未见物体/任务的泛化能力上均有提升。 Conclusion: LIGHT提供了一种更自然、更鲁棒的数据驱动引导范式,显著提升了HOI动画生成质量与泛化性,减少了对手工先验的依赖。 Abstract: Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.[188] How good was my shot? Quantifying Player Skill Level in Table Tennis
Akihiro Kubota,Tomoya Hasegawa,Ryo Kawahara,Ko Nishino
Main category: cs.CV
TL;DR: 本文提出了一种基于生成模型和联合嵌入的方法,通过分析乒乓球比赛中球员的战术击球行为及其上下文(如站位、对手行为),在隐空间中建模个体特征(包括技能水平),从而实现对技能的量化评估。
Details
Motivation: 技能是影响人类行为的关键潜在因素,但难以直接观测;现有方法难以在复杂交互行为(如双人运动)中准确量化技能。 Method: 构建每个球员的战术球拍动作生成模型,并将这些模型联合嵌入到一个共享隐空间中,该空间编码个体特征(含技能);模型在大规模3D重建职业比赛数据上训练,并以球员位置、对手行为等游戏上下文为条件。 Result: 所学隐空间能反映不同打法风格与技能相关属性;基于该空间嵌入训练的简单相对排序网络可实现相对与绝对技能预测。 Conclusion: 所提出的隐空间有效量化了技能水平,为复杂交互行为中的自动化技能评估提供了新基础。 Abstract: Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports -- specifically table tennis -- where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player's tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context -- including player positioning and opponent behaviors -- the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.[189] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
Xincheng Shuai,Song Tang,Yutong Huang,Henghui Ding,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出PSDesigner,一个模拟人类设计师创意流程的自动化图形设计系统,通过构建CreativePSD数据集提升工具使用能力,在多种图形设计任务中超越现有方法。
Details
Motivation: 现有自动化设计系统简化专业工作流,导致灵活性和直观性不足,难以将用户意图准确转化为可编辑的设计文件。 Method: 提出PSDesigner系统,结合多专业组件收集主题相关素材,并自主推断执行工具调用以操作设计文件;构建包含高质量PSD文件及操作轨迹的CreativePSD数据集用于训练。 Result: 实验表明PSDesigner在多样化图形设计任务中优于现有方法,使非专业人士也能便捷生成生产级质量的设计。 Conclusion: PSDesigner有效弥合了用户意图与专业设计输出之间的鸿沟,为自动化图形设计提供了更贴近真实工作流的解决方案。 Abstract: Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.[190] MegaFlow: Zero-Shot Large Displacement Optical Flow
Dingxi Zhang,Fangjinhua Wang,Marc Pollefeys,Haofei Xu
Main category: cs.CV
TL;DR: 本文提出MegaFlow,一种用于零样本大位移光流估计的简单而强大的模型,通过利用预训练的全局视觉Transformer特征将光流估计建模为全局匹配问题,并辅以轻量级迭代优化,实现了零样本下的SOTA性能及跨任务泛化能力。
Details
Motivation: 现有方法依赖迭代局部搜索或领域特定微调,在大位移和零样本泛化场景中性能受限。 Method: MegaFlow利用预训练视觉Transformer提取全局特征,将光流估计建模为全局匹配问题,并引入轻量级迭代 refinement 提升亚像素精度。 Result: 在多个光流基准上达到零样本SOTA性能,并在长程点跟踪任务上展现出强泛化能力。 Conclusion: MegaFlow证明了借助强大预训练视觉先验、避免复杂专用架构,可实现高鲁棒性与泛化性的通用运动估计。 Abstract: Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.[191] Vega: Learning to Drive with Natural Language Instructions
Sicheng Zuo,Yuxuan Li,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 本文提出了一种统一的视觉-语言-世界-动作模型Vega,通过结合自回归与扩散范式,实现基于用户指令的自动驾驶规划,并构建了大规模指令驱动驾驶数据集InstructScene。
Details
Motivation: 现有自动驾驶模型仅将语言用于场景描述或推理,缺乏根据多样化用户指令进行个性化驾驶的能力。 Method: 构建包含约10万场景的大规模指令驱动驾驶数据集InstructScene;提出Vega模型,采用自回归处理视觉和语言输入,扩散模型生成未来预测和轨迹,并通过联合注意力和独立投影层实现多模态交互。 Result: 实验表明该方法在规划性能和指令跟随能力上均优于现有方法。 Conclusion: Vega模型为构建更智能、个性化的自动驾驶系统提供了新路径。 Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.[192] RefAlign: Representation Alignment for Reference-to-Video Generation
Lei Wang,YuXin Song,Ge Wu,Haocheng Feng,Hang Zhou,Jingdong Wang,Yaxing Wang,jian Yang
Main category: cs.CV
TL;DR: 本文提出RefAlign框架,通过显式对齐参考图像特征与视觉基础模型(VFM)语义空间,提升参考到视频(R2V)生成中身份一致性与语义可分辨性,兼顾文本可控性与参考保真度,且训练后无推理开销。
Details
Motivation: 现有R2V方法依赖多源辅助特征缓解VAE潜在空间的信息泄露,但仍难以解决复制粘贴伪影和多主体混淆问题,根源在于异构编码器特征间的模态不匹配。 Method: 提出RefAlign框架,在扩散Transformer(DiT)的参考分支中引入参考对齐损失:拉近同一主体的DiT参考特征与VFM特征距离,推远不同主体对应特征,实现语义空间显式对齐;该损失仅在训练时使用,不增加推理开销。 Result: 在OpenS2V-Eval基准上,RefAlign在TotalScore指标上超越当前最先进方法,验证了显式参考对齐对R2V任务的有效性。 Conclusion: 显式将参考图像特征对齐至VFM语义空间,能更有效地缓解模态不匹配带来的生成缺陷,在保持文本控制力的同时显著提升参考保真度,为可控视频生成提供了新思路。 Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.[193] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
Bocheng Zou,Mu Cai,Mark Stanley,Dingfu Lu,Yong Jae Lee
Main category: cs.CV
TL;DR: 本文提出了一种名为Multi-Resolution Fusion(MuRF)的简单而通用的推理阶段多分辨率特征融合策略,用于提升视觉基础模型(VFMs)的性能,无需额外训练,适用于多种VFM架构。
Details
Motivation: 现有视觉基础模型在推理时通常只使用单一固定尺度输入,忽略了不同分辨率图像提供的互补归纳偏置(低分辨率利于全局语义识别,高分辨率利于细粒度细节提取)。 Method: MuRF通过冻结的视觉基础模型对同一图像在多个分辨率下分别提取特征,并将这些多尺度特征进行融合,构建统一表征;该方法不依赖特定模型结构,也无需微调或重新训练。 Result: MuRF在多个关键计算机视觉任务上显著提升了性能,并成功泛化至DINOv2、SigLIP2等多种VFM家族,验证了其通用性和有效性。 Conclusion: MuRF是一种训练无关、架构无关的即插即用式推理增强方法,揭示并利用了多分辨率视觉信息的协同价值,为提升VFM实际部署能力提供了新思路。 Abstract: Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.[194] Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
Yixing Lao,Xuyang Bai,Xiaoyang Wu,Nuoyuan Yan,Zixin Luo,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Shiwei Li,Hengshuang Zhao
Main category: cs.CV
TL;DR: LGTM是一种新型前馈式3D高斯点绘框架,通过结合紧凑高斯基元与逐基元纹理,解耦几何复杂度与渲染分辨率,实现无需场景优化的4K新视角合成。
Details
Motivation: 现有前馈式3D高斯点绘方法因像素对齐导致基元数量随分辨率二次增长,难以扩展至4K等高分辨率合成。 Method: 提出LGTM框架,预测紧凑的高斯基元并为其分配逐基元纹理,从而将几何表示与渲染分辨率解耦。 Result: 在不进行每场景优化的前提下,实现了高质量4K新视角合成,并显著减少所需高斯基元数量。 Conclusion: LGTM突破了前馈式高斯点绘方法的分辨率扩展瓶颈,为高分辨率实时新视角合成提供了可行路径。 Abstract: Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/[195] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Yawen Luo,Xiaoyu Shi,Junhao Zhuang,Yutian Chen,Quande Liu,Xintao Wang,Pengfei Wan,Tianfan Xue
Main category: cs.CV
TL;DR: 本文提出ShotStream,一种新型因果多镜头视频生成架构,支持交互式叙事和实时帧生成,通过双缓存机制和两阶段蒸馏策略解决跨镜头一致性与误差累积问题,在单卡上实现16FPS实时生成。