Skip to content

Table of Contents

cs.CL [Back]

[1] OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Haiyue Song,Masao Utiyama

Main category: cs.CL

TL;DR: 本文提出OptiMer方法,将连续预训练(CPT)中的数据混合比例选择从训练前决策转变为训练后基于分布向量的优化问题,显著降低调优成本并提升性能。

Details Motivation: 连续预训练中数据混合比例是敏感且昂贵的超参数,需预先固定,次优选择会导致大量计算浪费。 Method: 为每个数据集单独训练一个CPT模型,提取其参数偏移所表征的分布向量;随后在这些向量上通过贝叶斯优化搜索最优组合权重,实现后验式混合比例选择。 Result: 在Gemma 3 27B上跨语言(日、中)和领域(数学、代码)实验表明,OptiMer以15–35倍更低搜索成本持续优于数据混合与模型平均基线;优化权重可解释为混合比例,且同一向量池可针对不同目标快速重优化。 Conclusion: 数据混合比例选择可被重构为对分布向量的后验优化问题,为连续预训练提供了更灵活、高效的新范式。 Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

[2] From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

Daban Q. Jaff

Main category: cs.CL

TL;DR: 本文对三种预训练的基于Transformer的情感分类器在大规模、长篇幅的纳粹大屠杀口述历史语料(超10万条话语)上进行了诊断性研究,提出了一种基于模型间一致性的稳定性分类法(ABC),并结合情绪分类辅助分析,揭示了模型在中性边界判断上的系统性分歧。

Details Motivation: 情感极性检测在领域迁移(尤其是异构、长篇、复杂话语结构的历史叙事如 Holocaust 口述史)下极具挑战性,亟需系统性诊断现有模型的可靠性与失效模式。 Method: 在107,305条话语、579,013句的Holocaust口述史语料上运行三个预训练Transformer极性分类器;构建基于模型间标签一致性的ABC稳定性分类法;计算成对一致性指标(百分比一致率、Cohen/Fleiss kappa)、行归一化混淆矩阵以定位分歧;并用T5情绪分类器分析各一致性分层的情绪分布差异。 Result: 整体模型间一致性偏低至中等,分歧主要集中在‘中性’类别的边界判定上;ABC分类法有效分层了模型输出稳定性;情绪分析显示不同一致性层级间存在可辨识的情绪分布差异。 Conclusion: 多模型标签三角验证结合ABC分类法,为敏感历史文本中的情感建模提供了审慎、可操作的分歧定位与解释框架,强调中性判定是当前模型的关键薄弱点。 Abstract: Polarity detection becomes substantially more challenging under domain shift, particularly in heterogeneous, long-form narratives with complex discourse structure, such as Holocaust oral histories. This paper presents a corpus-scale diagnostic study of off-the-shelf sentiment classifiers on long-form Holocaust oral histories, using three pretrained transformer-based polarity classifiers on a corpus of 107,305 utterances and 579,013 sentences. After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability. We report pairwise percent agreement, Cohen kappa, Fleiss kappa, and row-normalized confusion matrices to localize systematic disagreement. As an auxiliary descriptive signal, a T5-based emotion classifier is applied to stratified samples from each agreement stratum to compare emotion distributions across strata. The combination of multi-model label triangulation and the ABC taxonomy provides a cautious, operational framework for characterizing where and how sentiment models diverge in sensitive historical narratives. Inter-model agreement is low to moderate overall and is driven primarily by boundary decisions around neutrality.

[3] CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation

Andrew Bouras,OMS-II Research Fellow

Main category: cs.CL

TL;DR: 本文提出了CrossTrace数据集,包含1389条跨领域的科学推理链,覆盖生物医学、AI/ML和交叉学科,每条链均基于源论文文本进行步骤级 grounding;提出Input/Trace/Output新框架,结合细粒度验证与八类发现模式;在Qwen2.5-7B上微调后显著提升假设生成质量、结构合规性与spark相似度;实证表明跨领域训练优于单领域,且推理模式具有跨学科可迁移性。

Details Motivation: 现有假设生成数据集局限于单一领域,缺乏将先验知识与新假设显式连接的推理轨迹,难以支撑通用科学推理建模。 Method: 构建CrossTrace数据集(1389条结构化、步骤级grounded推理链),定义Input/Trace/Output新schema,扩展HypoGen的Bit-Flip-Spark框架,引入step-level验证与八类发现模式;采用QLoRA微调Qwen2.5-7B-Instruct,并开展单域与跨域训练对比及人类验证。 Result: 微调后IAScore(GPT-4o)从0.828升至0.968,结构合规率从0%达100%,spark余弦相似度从0.221升至0.620;跨域训练效果优于单域;人工验证显示99.7%步骤级grounding准确率、0%虚构率。 Conclusion: CrossTrace是首个大规模、跨领域、步骤级grounded的假设生成推理数据集;其推理链作为训练信号有效且部分益处具备领域通用性,为构建通用科学推理模型奠定基础。 Abstract: Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.

[4] Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Junsol Kim,Winnie Street,Roberta Rocca,Daine M. Korngiebel,Adam Waytz,James Evans,Geoff Keeling

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)安全微调对心智归因能力的影响,发现抑制自我或技术物的心智归因并不会损害理论心智(ToM)能力,但会降低对非人类动物的心智归因,并削弱普遍存在的灵性信念。

Details Motivation: 探究LLM安全微调在抑制有害心智归因(如声称自身有意识或情感)的同时,是否损害与之密切相关的社会认知能力(如理论心智) Method: 通过安全性消融实验和表征相似性机制分析,检验心智归因与理论心智能力在行为与神经表征层面的可分离性 Result: 自我/技术物的心智归因与ToM能力在行为和机制上可分离;但安全微调导致模型对非人类动物的心智归因减少,并更少表达灵性信念 Conclusion: 安全微调虽未损害核心ToM能力,却可能扭曲模型对广泛人类共识性心智观(如动物心智、灵性)的建模,提示其潜在的社会认知偏差风险 Abstract: Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

[5] Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

Abhilash Nandy

Main category: cs.CL

TL;DR: 本文提出CoMIX-Shift基准和ClauseCompose解码器,强调多意图检测中对未见意图组合的泛化能力,验证了简单因子分解方法在强 compositional 评估下的有效性。

Details Motivation: 现有多意图检测基准未能充分测试模型对新意图组合(尤其是训练中未出现过的意图对或结构)的泛化能力,而这对实际部署至关重要。 Method: 构建了具有意图对保留、语篇模式偏移、更长/噪声包裹、模板保留及零样本三元组等挑战的CoMIX-Shift基准;提出仅在单意图数据上训练的轻量级ClauseCompose解码器,并与WholeMultiLabel及微调tiny BERT基线对比。 Result: ClauseCompose在多个CoMIX-Shift子任务上显著优于基线:未见意图对达95.7%,语篇偏移93.9%,长/噪声包裹62.5%,模板保留49.8%,零样本三元组91.1%;在人工SNIPS式数据上也大幅领先(97.5% vs 41.3%)。 Conclusion: 多意图检测需更强的组合泛化评估;简单因子化解码策略在面向组合性设计的评估下表现优异,表明建模思路应更注重结构可分解性而非端到端黑箱。 Abstract: Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.

[6] Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

Diego C. Lerma-Torres

Main category: cs.CL

TL;DR: 本文提出了一种受生物学启发的、基于互补学习系统理论等认知科学原理的结构化记忆框架,旨在解决大语言模型缺乏持久、结构化记忆的问题。该框架强调记忆的情感价(valence)、默认快速检索与必要时深度检索结合、以及主动编码与反馈驱动的学习机制,目标是实现随经验增长而更高效、更少幻觉的交互。

Details Motivation: 大型语言模型缺乏持久、结构化的记忆以支持长期交互和上下文敏感检索;单纯扩大上下文窗口反而会显著损害推理能力(最高达85%),因此需从认知科学角度重构记忆机制。 Method: 融合互补学习系统理论、认知行为疗法信念层级、双过程认知与模糊痕迹理论,构建三原则记忆框架:(1) 记忆具情感价(valence vectors)与信念层级;(2) 检索默认启用直觉式System 1,必要时升级至审慎式System 2,并引入分级认识状态;(3) 编码是主动、即时且依赖反馈的,通过丘脑门控与执行系统生成概要(gists)。并定义七个必须满足的功能属性。 Result: 该框架使系统随时间演化趋向System 1主导处理,类比临床专家的熟练度,实现交互成本随经验增加而下降,并结构性缓解幻觉问题。 Conclusion: 基于认知科学原理设计的结构化记忆框架,可从根本上提升大语言模型的长期交互能力、上下文适应性与推理鲁棒性,为构建具备持续学习与自我调适能力的智能体提供新范式。 Abstract: Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy's belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck's cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

[7] The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Yubo Li,Lu Zhang,Tianchong Jiang,Ramayya Krishnan,Rema Padman

Main category: cs.CL

TL;DR: 本文揭示了大语言模型在表面线索与隐含可行性约束冲突时的系统性失败,并通过诊断-测量-桥接-治疗框架,提出Heuristic Override Benchmark(HOB)基准和多种干预方法(如最小提示、目标分解提示)来识别和缓解该问题。

Details Motivation: 大语言模型常因依赖表面线索而忽略隐含可行性约束导致推理错误,亟需系统性诊断与可量化的评估手段。 Method: 采用因果行为分析(以'洗车问题'为例)、构建Heuristic Override Benchmark(HOB)基准(500个最小对样本,覆盖4类启发式×5类约束)、进行token级归因分析、参数化探针及目标分解提示等方法。 Result: 发现模型普遍存在上下文无关的sigmoid启发式行为;所有14个模型在严格评估下准确率均未超75%,存在保守偏差;最小提示平均提升15个百分点,目标分解提示提升6–9个百分点;参数探针证实该现象泛化至成本、效率、语义相似性等多种启发式。 Conclusion: 启发式覆盖(heuristic override)是大语言模型一种系统性的推理脆弱性;HOB为衡量相关进展提供了标准化基准,干预策略表明问题核心在于约束推断而非知识缺失。 Abstract: Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

[8] On the limited utility of parallel data for learning shared multilingual representations

Julius Leino,Jörg Tiedemann

Main category: cs.CL

TL;DR: 本文研究了平行数据在多语言预训练中对跨语言表征对齐的影响,发现其作用有限,仅可能加速早期表征共享并减少语言特异性神经元,而跨语言对齐在无平行数据时也能自然出现。

Details Motivation: 探究平行数据(翻译句对)作为预训练信号对多语言表征跨语言对齐的影响。 Method: 通过在预训练中引入不同比例的平行数据,训练多个基准模型,并采用多种评估方法分析其对跨语言对齐的影响。 Result: 平行数据对跨语言对齐影响甚微,仅可能加速早期表征共享、减少语言特异性神经元;即使没有平行数据,模型仍能实现相近水平的跨语言对齐。 Conclusion: 跨语言对齐并非高度依赖显式的平行数据信号,而可能更多源于预训练过程中的隐式学习机制。 Abstract: Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.

[9] An Empirical Recipe for Universal Phone Recognition

Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen

Main category: cs.CL

TL;DR: 本文提出PhoneticXEUS模型,在大规模多语言数据上训练,实现了多语言和带口音英语语音的电话识别(PR)的最先进性能,并通过消融实验系统分析了自监督学习表示、数据规模和损失目标的影响。

Details Motivation: 现有高性能英语模型难以跨语言泛化,而多语言模型未能充分利用预训练表征;同时,数据规模、模型架构和训练目标对多语言电话识别的影响尚不明确。 Method: 提出PhoneticXEUS模型,基于大规模多语言数据训练,并在100多种语言下采用统一评估方案进行受控消融实验,分析SSL表征、数据规模与损失函数的作用,并进一步分析语言族、口音语音及发音特征层面的错误模式。 Result: PhoneticXEUS在多语言PR任务上达到17.7% PFER,在带口音英语PR上达10.6% PFER,均为当时最优;实证揭示了SSL表征、数据规模与损失目标的各自贡献。 Conclusion: 大规模多语言训练结合精心设计的训练策略可显著提升电话识别的跨语言鲁棒性;开源数据与代码支持后续研究。 Abstract: Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

[10] Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs

Aizirek Turdubaeva,Uichin Lee

Main category: cs.CL

TL;DR: 本文提出了一种生成者-解释者(Generator-Interpreter)框架,以解决大语言模型(LLMs)在跨文化情感归因任务中忽视情感表达者文化背景的问题;通过在15个国家数据上评估6个LLM,发现生成者国籍对性能影响更大,强调需构建文化敏感的情感建模方法。

Details Motivation: 现有情感归因研究主要关注解释端,忽视情感表达者的文化背景,假设情感表达具有普遍性,忽略了不同国家在情感表达与理解上的差异。 Method: 提出生成者-解释者双视角框架,系统评估6个大语言模型在来自15个国家的情感归因任务上的表现,并分析情绪类型与文化背景对性能的影响。 Result: 发现模型性能随情绪类型和文化语境变化;存在生成者-解释者一致性效应,且生成者所属国家对性能影响更强。 Conclusion: 应发展文化敏感的情感建模方法,以提升大语言模型在多元文化场景中情感理解的鲁棒性与公平性。 Abstract: Large language models (LLMs) are increasingly used in cross-cultural systems to understand and adapt to human emotions, which are shaped by cultural norms of expression and interpretation. However, prior work on emotion attribution has focused mainly on interpretation, overlooking the cultural background of emotion generators. This assumption of universality neglects variation in how emotions are expressed and perceived across nations. To address this gap, we propose a Generator-Interpreter framework that captures dual perspectives of emotion attribution by considering both expression and interpretation. We systematically evaluate six LLMs on an emotion attribution task using data from 15 countries. Our analysis reveals that performance variations depend on the emotion type and cultural context. Generator-interpreter alignment effects are present; the generator's country of origin has a stronger impact on performance. We call for culturally sensitive emotion modeling in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts.

[11] PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Caio Vicentino

Main category: cs.CL

TL;DR: PolarQuant是一种无需校准数据的后训练权重量化方法,通过块归一化、Walsh-Hadamard旋转和高斯匹配量化三步,实现LLM近无损压缩;仅Hadamard旋转即带来主要质量提升,且可作为INT4量化前的有效预处理。

Details Motivation: 现有量化方法(如absmax)在低比特下易引入显著精度损失,而校准依赖数据又限制实用性;需一种不依赖校准、能更好利用权重分布结构的高效量化方案。 Method: PolarQuant包含三阶段:1)块级归一化至单位超球面;2)Walsh-Hadamard变换使权重近似服从高斯分布;3)基于高斯分布设计量化中心进行量化。 Result: 在Qwen3.5-9B上,PolarQuant Q5将困惑度降至6.40(较FP16仅+0.03),优于absmax Q5(6.90);作为torchao INT4的前置步骤,困惑度达6.56(vs. 直接absmax INT4的6.68),同时保持43.1 tok/s吞吐与6.5 GB显存占用。 Conclusion: PolarQuant通过几何变换揭示并利用权重内在分布结构,实现了高质量、免校准、高吞吐的LLM量化,兼具通用性与实用性。 Abstract: We present PolarQuant, a post-training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.

[12] APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee,Masud Moshtaghi,Ankit Chadha

Main category: cs.CL

TL;DR: 本文提出APEX-EM框架,为LLM智能体引入非参数化、在线式程序性记忆,通过结构化经验表示、PRGII工作流与双结果经验记忆,实现跨任务的计划复用与迭代优化,在多个基准上显著提升性能。

Details Motivation: 现有基于大语言模型的自主智能体缺乏持久的程序性记忆,每次面对结构相似任务都需从头推导,导致效率低下且无法积累经验。 Method: 提出APEX-EM:(1)结构化经验表征,完整记录执行过程中的规划步骤、产物、迭代历史、错误分析与质量评分;(2)Plan-Retrieve-Generate-Iterate-Ingest(PRGII)工作流,结合多维Task Verifier奖励信号;(3)双结果经验记忆,融合语义搜索、结构签名匹配与计划DAG遍历,支持无词汇重叠但结构相似任务的跨域迁移。 Result: 在KGQAGen-10k上准确率89.6%(+48.3pp),超越oracle检索上限;BigCodeBench上解决率83.3%(+29.4pp),优于MemRL;HLE上实体图检索达48.0%(+22.8pp);消融实验证明各组件效果因任务而异。 Conclusion: APEX-EM证明了无需修改模型权重即可有效构建和复用程序性记忆,是提升LLM智能体持续学习与泛化能力的重要范式。 Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

[13] Concept Training for Human-Aligned Language Models

Christine Zhang,Dan Jurafsky,Chen Shani

Main category: cs.CL

TL;DR: 本文提出了一种基于概念(语义相关词集)的监督训练框架,替代传统的单个下一词预测(NTP)目标,以提升语言模型在语义相似性判断上的对齐能力,并在保持语言建模性能的同时降低语义相关词的困惑度。

Details Motivation: 传统下一词预测(NTP)将语义等价但形式多样的续写视为互斥目标,忽视了自然语言中前缀存在多种合理、语义相近续写的现象,限制了模型的语义理解能力。 Method: 用概念(即语义相关词的集合)替代单个目标token进行监督训练,将标准NTP目标扩展为预测语义一致的token集合;在多个词汇语义基准上评估模型对人类语义相似性判断的对齐程度,并对比困惑度变化。 Result: 采用概念监督训练的模型在多个语义相似性基准上显著优于标准NTP模型;语义相关词的困惑度下降,而全局token级困惑度略有上升,表明存在语义对齐与标准NTP优化间的权衡。 Conclusion: 概念级训练目标能有效增强语言模型的语义对齐能力,同时保持有竞争力的语言建模性能,为改进预训练目标提供了新思路。 Abstract: The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

[14] Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

George Boateng,Samuel Boateng,Victor Kumbol

Main category: cs.CL

TL;DR: Kwame 2.0 是一个双语(英法)生成式AI助教,结合检索增强生成与人工协同机制,部署于非洲在线编程课程 SuaCode 中,经15个月、3717名学员的纵向实证表明其能兼顾响应质量与时效性,并通过人机协作弥补AI局限。

Details Motivation: 解决资源受限环境下大规模在线编程课程中及时、准确学习支持难的问题,尤其面向非洲等代表性不足群体。 Method: 构建基于检索增强生成(RAG)的双语AI助教 Kwame 2.0,嵌入‘人在回路’论坛,融合课程材料检索、上下文感知生成、人工审核与社区参与机制。 Result: 在15个月、35国、15期共3717人参与的纵向研究中,系统在课程相关问题上准确率高、响应及时;人工协作者有效纠正行政类等错误;社区反馈与专家评分均显示高质量支持效果。 Conclusion: 人机协同的生成式AI系统可兼顾AI的规模化效率与人类判断的可靠性,是资源受限地区教育支持的有效可行路径。 Abstract: Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.

[15] SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

Mohammad Amer Khalil,Raghad Nahas,Ahmad Nassar,Khloud Al Jallad

Main category: cs.CL

TL;DR: 本文介绍了SyriSign数据集,这是首个面向叙利亚阿拉伯手语(SyArSL)的公开视频数据集,包含1500个样本、150个词汇,旨在支持文本到SyArSL翻译任务,并评估了MotionCLIP、T2M-GPT和SignCLIP三种模型在该数据集上的性能。

Details Motivation: 解决叙利亚聋哑人群因新闻多以口语或书面阿拉伯语传播而面临的信息获取障碍,填补低资源手语——特别是叙利亚阿拉伯手语——缺乏公开基准数据集的空白。 Method: 构建SyriSign数据集(1500个视频样本,覆盖150个词汇),并采用MotionCLIP、T2M-GPT和SignCLIP三种深度学习架构进行文本到手语翻译任务的实验评估。 Result: 生成式方法在手语表征上展现出潜力,但受限于数据集规模,泛化能力不足;SyriSign将被公开发布,作为该语言首个基准数据集。 Conclusion: SyriSign是推动低资源手语技术发展的重要一步,强调了构建本地化手语数据集对促进包容性通信的必要性。 Abstract: Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.

[16] SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

Ranidu Gurusinghe,Nevidu Jayatilleke

Main category: cs.CL

TL;DR: SiPaKosa is a large, high-quality corpus of Sinhala and Pali Buddhist texts (786K sentences, 9.25M words), built from OCR’d manuscripts and web-scraped Tripitaka, enabling domain-specific language modeling and cultural preservation.

Details Motivation: To support domain-adapted language modeling, historical linguistic analysis, and information retrieval for Buddhist scholarship, while preserving Sinhala cultural heritage. Method: Constructed via high-quality OCR (Google Document AI) on 16 copyright-cleared historical manuscripts and systematic web scraping of Tripitaka repositories, followed by rigorous quality control and metadata annotation; organized into Sinhala and Mixed Sinhala-Pali subcorpora; evaluated using perplexity across ten pretrained language models. Result: Proprietary language models outperform open-source ones by 3–6× in perplexity (scores: 1.09–189.67); the corpus enables pretraining, historical analysis, and IR system development. Conclusion: SiPaKosa is a foundational resource for NLP in Sinhala-Pali Buddhist domains, bridging linguistic technology and cultural preservation. Abstract: SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.

[17] Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Zhuowen Liang,Xiaotian Lin,Zhengxuan Zhang,Yuyu Luo,Haixun Wang,Nan Tang

Main category: cs.CL

TL;DR: 本文提出LiteCoST框架,通过Chain-of-Structured-Thought(CoST)引导大模型生成结构化推理轨迹与输出,并利用该数据两阶段微调小语言模型(SLM),在长文档问答任务中以3B/7B模型达到接近大模型的准确率,同时延迟降低2–4倍。

Details Motivation: 大型语言模型(LLMs)在长而嘈杂文档上的直接推理仍脆弱且易错,亟需一种能整合分散证据、输出结构化结果(如表格、图等)以支持可靠、可验证问答的方法。 Method: 提出两支柱框架LiteCoST:1)Chain-of-Structured-Thought(CoST),设计schema-aware指令模板,引导强LLM生成步骤化推理链与结构化输出,涵盖结构诱导、实体/单位归一化、记录对齐、序列化及验证精修;2)基于CoST数据对小语言模型(SLM)进行两阶段微调——先监督微调(SFT)实现结构对齐,再采用带三重奖励(答案质量、格式质量、过程一致性)的Group Relative Policy Optimization(GRPO)优化策略。 Result: 3B/7B小模型在多领域长文档问答任务上达到与GPT-4o、DeepSeek-R1(671B)相当的准确率,同时延迟降低2–4倍。 Conclusion: 通过将‘结构优先’的推理行为蒸馏至小模型,LiteCoST实现了高精度与低延迟的统一,为轻量级、可验证的文档问答提供了新范式。 Abstract: Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

[18] The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya,John Mugane,Gavin Nyamboga,Brian Chege,Maryruth Gathoni

Main category: cs.CL

TL;DR: 本文介绍了Thiomi数据集,一个涵盖十个非洲语言的大规模多模态语料库,包含超60万句级文本标注和38.5万条九种语言的音频记录,并通过社区平台收集与多层质量保障流程确保数据质量;基于该数据集,作者在ASR、MT和TTS任务上建立了基线模型,其中ASR在斯瓦希里语和索马里语上显著超越先前学术SOTA。

Details Motivation: 解决非洲语言在语音与自然语言处理领域数据稀缺、技术基础设施薄弱的问题,推动低资源语言AI发展。 Method: 构建Thiomi社区驱动的数据收集平台,覆盖十种非洲语言;设计多层级质量保证流程;收集并清洗文本与音频数据;训练并评估ASR、MT、TTS基线模型。 Result: 发布含601,000+文本标注和385,000+音频样本的Thiomi数据集;ASR在斯瓦希里语上WER达3.24%(较前SOTA降低61%),索马里语达4.3%;数据集将开源至HuggingFace。 Conclusion: Thiomi数据集为非洲语言技术提供了高质量、可扩展的基础设施支撑,验证了社区协作与系统化质量控制在低资源语言数据建设中的有效性,并为后续研究与应用奠定坚实基础。 Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

[19] MemRerank: Preference Memory for Personalized Product Reranking

Zhiyuan Peng,Xuyang Wu,Huaixiao Tou,Yi Fang,Yi Gong

Main category: cs.CL

TL;DR: 本文提出MemRerank框架,通过强化学习从用户购买历史中提取简洁、查询无关的偏好记忆信号,用于LLM购物代理中的个性化商品重排序,在1-in-5选择任务上显著提升准确率。

Details Motivation: 现有LLM购物代理在利用长购买历史和多轮交互进行个性化时,直接拼接原始历史到提示中效果差,主因是噪声、长度过长及相关性不匹配。 Method: 提出MemRerank偏好记忆框架,构建端到端1-in-5选择基准与评估框架,并采用以下游重排序性能为奖励的强化学习训练记忆提取器。 Result: 在两个LLM重排序器上,MemRerank相较无记忆、原始历史及现成记忆基线,1-in-5准确率最高提升+10.61绝对百分点。 Conclusion: 显式偏好记忆是面向智能体电商系统个性化的一项实用且有效的基础模块。 Abstract: LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

[20] CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Shohei Higashiyama,Masao Ideuchi,Masao Utiyama

Main category: cs.CL

TL;DR: 本文提出了一种面向日语实体链接任务的语料库构建策略,并构建了一个覆盖日本特有实体表达的标注语料库,验证了其高质量与挑战性,可用于训练和评测日语实体链接系统。

Details Motivation: 现有实体链接资源主要针对英语,日语相关评测资源匮乏,亟需构建高质量、具日本文化特性的日语实体链接语料库。 Method: 制定日语实体链接语料库设计规范,人工构建含丰富日本特有实体指代表达的标注语料库,并通过标注者间一致性检验与基于字符串匹配的消歧预实验评估语料质量。 Result: 语料库标注一致性高,且包含大量非平凡消歧案例,证明其适合作为日语实体链接的训练与评测基准。 Conclusion: 该语料库填补了日语实体链接资源空白,具备良好实用性与挑战性,可有效支撑后续日语实体链接系统研发与评估。 Abstract: Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

[21] Open Machine Translation for Esperanto

Ona de Gibert,Lluís de Gibert

Main category: cs.CL

TL;DR: 本文首次全面评估了开源机器翻译系统在世界语(Esperanto)上的表现,比较了基于规则、编码器-解码器模型及大语言模型(LLM)等多种方法,并在多个语言方向上通过自动指标与人工评估验证效果;结果表明NLLB系列模型性能最优,其次为自训练的轻量模型和微调的通用LLM,所有代码与最佳模型均已开源。

Details Motivation: 世界语虽有丰富资源和规则性优势,但在现代机器翻译研究中仍被相对忽视,亟需系统性评估与开源支持。 Method: 对多种开源MT系统(规则系统、encoder-decoder模型、不同规模LLM)在英、西、加、世六向翻译任务上进行自动评估(多指标)与人工评估,并开源代码与最佳模型。 Result: NLLB系列在所有语言对上表现最佳,自训练紧凑模型和微调通用LLM紧随其后;人工评估中NLLB约半数胜出,但仍存在明显错误。 Conclusion: NLLB是当前世界语机器翻译的最优选择,轻量模型与微调LLM具竞争力;研究强调开放协作,推动低资源语言MT发展。 Abstract: Esperanto is a widespread constructed language, known for its regular grammar and productive word formation. Besides having substantial resources available thanks to its online community, it remains relatively underexplored in the context of modern machine translation (MT) approaches. In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes. We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation. Our results show that the NLLB family achieves the best performance in all language pairs, followed closely by our trained compact models and a fine-tuned general-purpose LLM. Human evaluation confirms this trend, with NLLB translations preferred in approximately half of the comparisons, although noticeable errors remain. In line with Esperanto's tradition of openness and international collaboration, we release our code and best-performing models publicly.

[22] L-ReLF: A Framework for Lexical Dataset Creation

Anass Sedrati,Mounir Afifi,Reda Benkhadra

Main category: cs.CL

TL;DR: 本文提出了L-ReLF(低资源词汇框架),一种用于构建欠服务语言高质量结构化词汇数据集的新方法,以解决如摩洛哥达里贾语等语言缺乏标准化术语的问题,并确保其与Wikidata Lexemes兼容,支持下游NLP任务。

Details Motivation: 解决欠服务语言(如摩洛哥达里贾语)缺乏标准化术语的问题,从而促进知识公平,尤其是在维基百科等平台上的应用。 Method: 开发了一套技术流程,包括低资源数据源识别、利用存在偏见的OCR(偏向现代标准阿拉伯语)、以及严格的后处理以修正错误并标准化数据模型。 Result: 构建了一个完全兼容Wikidata Lexemes的结构化词汇数据集,并验证了该方法在低资源语言词汇建设中的有效性与可复现性。 Conclusion: L-ReLF方法具有通用性,为其他语言社区提供了构建基础词汇数据的清晰路径,支撑机器翻译、形态分析等下游NLP应用。 Abstract: This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

[23] Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese

Amane Watahiki,Tomoki Doi,Akari Kikuchi,Hiroshi Ohata,Yuki I. Nakata,Takuya Niikawa,Taiga Shinozaki,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文提出了首个系统化的Labovian叙事分析日语指南,扩展了Labovian框架以适配日语语法与话语特征,并验证了其在标注中具有较高一致性。

Details Motivation: 现有Labovian模型应用耗时费力,且缺乏适用于日语(语法和话语惯例与英语差异显著)的系统性分析指南。 Method: 在保留Labovian六类结构基础上,制定面向日语的显式从句切分规则,扩展涵盖更广的从句类型与叙事类型,并开展标注实验评估一致性。 Result: 标注者在从句切分上达成高一致性(Fleiss' kappa = 0.80),在两类结构分类任务中达中等一致性(Krippendorff's alpha = 0.41 和 0.45),其中一项略高于先前研究。 Conclusion: 所提指南为日语定性研究中的结构性叙事分析提供了可行基础,未来可支撑更大规模日语叙事数据集的构建。 Abstract: Narrative analysis is a cornerstone of qualitative research. One leading approach is the Labovian model, but its application is labor-intensive, requiring a holistic, recursive interpretive process that moves back and forth between individual parts of the transcript and the transcript as a whole. Existing Labovian datasets are available only in English, which differs markedly from Japanese in terms of grammar and discourse conventions. To address this gap, we introduce the first systematic guidelines for Labovian narrative analysis of Japanese narrative data. Our guidelines retain all six Labovian categories and extend the framework by providing explicit rules for clause segmentation tailored to Japanese constructions. In addition, our guidelines cover a broader range of clause types and narrative types. Using these guidelines, annotators achieved high agreement in clause segmentation (Fleiss' kappa = 0.80) and moderate agreement in two structural classification tasks (Krippendorff's alpha = 0.41 and 0.45, respectively), one of which is slightly higher than that found in prior work despite the use of finer-grained distinctions. This paper describes the Labovian model, the proposed guidelines, the annotation process, and their utility. It concludes by discussing the challenges encountered during the annotation process and the prospects for developing a larger dataset for structural narrative analysis in Japanese qualitative research.

[24] Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

Yahan Li,Xinyi Jie,Wanjia Ruan,Xubei Zhang,Huaijie Zhu,Yicheng Gao,Chaohao Du,Ruishan Liu

Main category: cs.CL

TL;DR: 本文提出CPB-Bench,一个面向临床中挑战性患者行为(如信息矛盾、事实错误、自我诊断、抗拒医疗)的双语评测基准,揭示现有医学大模型在应对不清晰/误导性患者输入时存在系统性安全缺陷,并评估多种干预策略效果有限。

Details Motivation: 现有医学大语言模型评测多基于理想化、表述清晰的患者问题,缺乏对真实临床中常见挑战性患者行为(如自相矛盾、错误认知、抗拒建议等)的考察,难以反映模型在高风险场景下的实际安全性。 Method: 定义四类临床常见的挑战性患者行为(信息矛盾、事实错误、自我诊断、抗拒医疗),为每类设定具体的安全失败判定标准;基于四个现有医学对话数据集构建双语(中英文)CPB-Bench基准(692个多轮对话);系统评测多个开源与闭源医学大模型,并分析其失败模式;探索四种干预策略的效果。 Result: 模型在整体表现良好,但在处理矛盾或医学上明显不合理的患者信息时表现出一致且显著的失败;四种干预策略改善效果不一致,且易引入不必要的纠正。 Conclusion: 当前医学大语言模型在应对真实、复杂的患者交互行为时仍存在关键安全短板,需更贴近临床现实的评测基准与针对性的安全增强方法。 Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

[25] Is my model perplexed for the right reason? Contrasting LLMs' Benchmark Behavior with Token-Level Perplexity

Zoë Prins,Samuele Punzo,Frank Wildenburg,Giovanni Cinà,Sandro Pezzelle

Main category: cs.CL

TL;DR: 本文提出一种基于词元级困惑度的可解释性框架,用于检验大语言模型是否依赖语言学上相关的线索,发现模型虽受重要语言词元影响,但并不完全依赖预期的语言学机制,而是混合使用其他启发式策略。

Details Motivation: 现有大语言模型评估主要关注任务性能,难以判断其正确行为是否源于恰当的内在机制,存在确认偏误风险。 Method: 提出基于词元级困惑度的可解释性框架,通过比较仅在少数‘关键’词元上不同的最小句对的困惑度分布,进行假设驱动的精确分析,避免不稳定的特征归因技术。 Result: 在多个开源大语言模型和受控语言学基准上的实验表明,语言学上重要的词元确实影响模型行为,但从不完全解释困惑度变化,说明模型还依赖于非预期的语言学启发式策略。 Conclusion: 该框架揭示了当前大语言模型在语言处理中并非严格遵循语言学原理,而是结合多种启发式策略,为模型机制诊断提供了新工具。 Abstract: Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal' tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.

[26] CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Yahan Li,Chaohao Du,Zeyang Li,Christopher Chun Kuizon,Shupeng Cheng,Angel Hsing-Chi Hwang,Adam C. Frank,Ruishan Liu

Main category: cs.CL

TL;DR: 本文提出了CounselReflect,一个用于审计心理健康支持对话质量的端到端工具包,提供多维度、可解释、证据支持的评估报告,并通过模型指标与基于文献/用户自定义的LLM裁判指标相结合,支持实时与批量审计。

Details Motivation: 用户缺乏对LLM等会话系统提供的心理健康支持进行结构化质量与风险审计的有效手段。 Method: 设计并实现CounselReflect工具包,融合12个任务特定预测模型指标与69项文献衍生+用户自定义的rubric指标(由可配置LLM裁判执行),输出会话级摘要、轮次级评分及证据链接片段;提供Web应用、浏览器插件和CLI三种使用方式。 Result: 通过20名普通用户的研究和6位心理健康专业人士的专家评审,验证了CounselReflect具备可理解性、可用性与可信性;同时提供演示视频与开源代码。 Conclusion: CounselReflect为心理健康对话系统提供了透明、可审计、可扩展的质量评估框架,填补了用户侧自主监督工具的空白。 Abstract: Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.

[27] Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

Baoyi Zeng,Andrea Nini

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(LLM)是否能成功冒充特定作者以规避现有作者身份验证(AV)系统,结果表明当前AV系统对LLM生成的冒充文本仍具鲁棒性,部分归因于LLM文本更高的词汇多样性与熵。

Details Motivation: 随着大语言模型(LLMs)的发展,犯罪者可能利用其进行作者风格冒充,威胁法证语言学中的作者身份验证(AV)任务,亟需评估现有AV系统对此类新型攻击的防御能力。 Method: 使用GPT-4o作为对抗模型,在四种提示条件下生成三类文体(邮件、短信、社交媒体帖文)的冒充文本;在似然比框架下,评估其对多种非神经(n-gram追踪、Ranking-Based Impostors Method、LambdaG)和神经(AdHominem、LUAR、STAR)AV方法的规避能力。 Result: LLM生成的冒充文本未能充分复现作者个体特征,无法绕过现有AV系统;部分AV方法对冒充文本的识别准确率甚至高于真实负样本;该鲁棒性至少部分源于LLM文本更高的词汇多样性与熵。 Conclusion: 尽管LLM易于获取,当前AV系统对入门级LLM驱动的作者冒充攻击仍保持稳健,尤其在多语体场景下;高词汇熵与多样性反而成为其可检测性的关键线索。 Abstract: Authorship verification (AV), the task of determining whether a questioned text was written by a specific individual, is a critical part of forensic linguistics. While manual authorial impersonation by perpetrators has long been a recognized threat in historical forensic cases, recent advances in large language models (LLMs) raise new challenges, as adversaries may exploit these tools to impersonate another's writing. This study investigates whether prompted LLMs can generate convincing authorial impersonations and whether such outputs can evade existing forensic AV systems. Using GPT-4o as the adversary model, we generated impersonation texts under four prompting conditions across three genres: emails, text messages, and social media posts. We then evaluated these outputs against both non-neural AV methods (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural approaches (AdHominem, LUAR, STAR) within a likelihood-ratio framework. Results show that LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. We also observed that some methods achieved even higher accuracy when rejecting impersonation texts compared to genuine negative samples. Overall, these findings indicate that, despite the accessibility of LLMs, current AV systems remain robust against entry-level impersonation attempts across multiple genres. Furthermore, we demonstrate that this counter-intuitive resilience stems, at least in part, from the higher lexical diversity and entropy inherent in LLM-generated texts.

[28] M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Seung Hun Han,Youssef Mohamed,Mohamed Elhoseiny

Main category: cs.CL

TL;DR: 本文提出了一种多语言视觉大语言模型M-MiniGPT4,通过混合原生多语言与翻译数据,并引入基于平行语料的多语言对齐训练阶段,在11种语言的视觉-语言理解任务中显著提升性能,在MMMU多语言基准上达到36%准确率,超越同量级SOTA模型,并开源模型、代码与数据集。

Details Motivation: 提升MiniGPT4架构在多语言视觉-语言理解(VLU)任务中的能力,尤其面向低资源和多语言场景。 Method: 基于MiniGPT4架构,融合原生多语言与翻译数据进行训练,并新增一个利用平行文本语料的多语言对齐训练阶段。 Result: M-MiniGPT4在多语言MMMU基准上达到36%准确率,优于同参数量级的现有最先进模型。 Conclusion: M-MiniGPT4有效增强了多语言VLU能力,其开源模型、代码和翻译数据集将推动低资源及多语言视觉语言研究。 Abstract: This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

[29] Calibrated Confidence Expression for Radiology Report Generation

David Bani-Harouni,Chantal Pellegrini,Julian Lüers,Su Hwan Kim,Markus Baalmann,Benedikt Wiestler,Rickmer Braren,Nassir Navab,Matthias Keicher

Main category: cs.CL

TL;DR: 本文提出ConRad框架,利用强化学习对医学大视觉语言模型进行微调,使其在生成放射学报告的同时输出校准后的口头化置信度估计,从而提升临床安全性与可解释性。

Details Motivation: 现有大视觉语言模型在放射学报告生成中常过度自信,缺乏多模态场景下的置信度校准研究,难以支持临床安全部署。 Method: 提出ConRad框架,基于GRPO强化学习算法,结合对数评分规则设计奖励函数,分别训练报告级和句子级的口头化置信度输出。 Result: ConRad显著提升置信度校准性能,优于现有方法;临床评估显示其报告级置信分与医生判断高度一致。 Conclusion: ConRad能有效支持AI辅助报告生成的安全临床落地,通过标识需重点审核的报告或低置信语句,实现靶向人工复核。 Abstract: Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

[30] MemFactory: Unified Inference & Training Framework for Agent Memory

Ziliang Guo,Ziheng Li,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出了MemFactory,一个统一的、模块化的训练和推理框架,用于内存增强型大语言模型(LLMs)代理,支持通过插件式组件构建自定义记忆系统,并集成GRPO算法优化记忆管理策略,在多个基准上显著提升性能。

Details Motivation: 现有内存增强LLM的强化学习实现高度碎片化、任务特定,缺乏统一基础设施来简化集成、训练与评估。 Method: 提出MemFactory框架,将记忆生命周期抽象为原子化、即插即用组件;集成Group Relative Policy Optimization(GRPO)以多维环境奖励驱动记忆策略微调;支持Memory-R1、RMM、MemAgent等前沿范式。 Result: 在开源MemAgent架构上实证验证,MemFactory在域内和分布外评测集上均稳定超越基线模型,相对性能提升最高达14.8%。 Conclusion: MemFactory提供了标准化、可扩展且易用的基础设施,显著降低了内存驱动AI代理的研究门槛,推动该领域未来发展。 Abstract: Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

[31] Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi

Main category: cs.CL

TL;DR: 本文提出一种知识蒸馏方法,将超大语言模型Mistral Large 3(675B)的文本隐私评估能力压缩至仅150M参数的轻量编码器模型,在保持与人工标注高度一致的同时显著降低计算开销,适用于大规模敏感文本的隐私评估与去标识化系统评测。

Details Motivation: 现有基于大语言模型(LLM)的文本隐私评估方法虽准确但计算成本高、难以在大规模敏感数据上部署,亟需轻量化且可靠的替代方案。 Method: 利用涵盖10个领域的大型隐私标注文本数据集,通过知识蒸馏将Mistral Large 3(675B)的隐私判断能力迁移至轻量级编码器分类器(最小仅150M参数)。 Result: 所提轻量模型在人类标注测试集上保持强一致性,并成功作为实用评估指标应用于去标识化系统性能评测。 Conclusion: 知识蒸馏可有效将LLM的隐私评估能力压缩为高效、可部署的轻量模型,为隐私保护NLP提供兼顾准确性与实用性的新范式。 Abstract: Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.

[32] LLM Probe: Evaluating LLMs for Low-Resource Languages

Hailay Kidu Teklehaymanot,Gebrearegawi Gebremariam,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文提出LLM Probe,一种基于词典的评估框架,用于系统评估大语言模型(LLMs)在低资源、形态丰富语言中的语言能力,并构建了带语言学标注的双语词典基准数据集,揭示不同架构模型在各类语言任务上的性能差异。

Details Motivation: 现有大语言模型在低资源且形态丰富的语言中语言能力尚不明确,主要受限于标注资源匮乏和缺乏标准化评估框架。 Method: 提出LLM Probe词典式评估框架,涵盖词汇对齐、词性识别、形态句法探测和翻译准确性四方面;构建人工标注的低资源闪米特语双语词典基准数据集,含词性、语法性别及形态句法特征等高一致性标注;测试多种因果语言模型与seq2seq模型。 Result: seq2seq模型在形态句法分析和翻译质量上表现更优,而因果模型在词汇对齐上更强但翻译准确性较弱;验证了语言学驱动评估对揭示LLM在低资源场景下局限性的重要性。 Conclusion: 需采用语言学基础扎实的评估方法来深入理解LLM在低资源语言中的能力边界;LLM Probe及配套数据集已开源,以推动可复现基准测试和更具包容性的多语言技术发展。 Abstract: Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.

[33] Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

Alain Vázquez,Maria Inés Torres

Main category: cs.CL

TL;DR: 本文研究了在对话NLG中,通过向生成器提供任务示范(MR-句子对)是否能提升微调模型的生成效果,并在多个数据集和评估指标下进行了系统分析。

Details Motivation: 提升对话系统中NLG模块生成语言的多样性与准确性,探究MR输入增强方式对生成质量的影响。 Method: 引入MR-句子对作为任务示范,在训练和推理时丰富输入;在四个不同特性的数据集上,采用五种评估指标(涵盖语义与词汇层面)进行对比分析。 Result: 示范增强在复杂任务、小规模高变异性数据集及零样本场景下显著有效;语义类指标(尤其基于人工评分训练的)比词汇类指标更能准确反映生成质量,且能捕捉嵌入式指标易忽略的语义缺失等问题;Slot Accuracy与DA Accuracy表现优异,体现模型强适应性与鲁棒性。 Conclusion: MR示范增强是一种高效策略,尤其适用于数据稀缺或MR/句子变异性高的场景;评估应更倚重高质量语义指标,而非单纯词汇匹配。 Abstract: Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

[34] Baby Scale: Investigating Models Trained on Individual Children's Language Input

Steven Y. Feng,Alvin W. M. Tan,Michael C. Frank

Main category: cs.CL

TL;DR: 本文探讨了现代语言模型与儿童语言学习之间的数据差距,通过使用BabyView数据集(6-36个月婴儿视角视频转录文本)训练语言模型,分析其在语法、语义及世界知识任务上的表现,并考察模型预测与儿童实际词汇习得的相关性。

Details Motivation: 现代语言模型所需训练数据远超儿童语言习得所接触的数据量,需通过儿童尺度数据基准测试来理解语言知识如何从真实儿童输入中涌现,从而揭示数据效率与语言习得机制。 Method: 基于BabyView数据集的儿童视角语音转录文本,训练语言模型并评估其在不同儿童数据子集上的缩放性能;分析数据集间性能差异及语言特征(分布性与互动性)对模型表现的影响;计算模型对单个词的似然性并与儿童实际词汇掌握情况进行相关性分析。 Result: 在儿童尺度数据上,语言模型在语法任务中表现出可接受的缩放性,但在语义和世界知识任务上弱于合成数据训练模型;不同儿童数据集间性能差异显著;模型性能最相关的是分布性与互动性语言特征组合;模型对词的似然性与儿童词汇习得呈正相关。 Conclusion: 儿童语言输入的质量(而不仅是数量)显著影响模型学习效果,该发现既有助于构建高效小规模语言模型,也为理解人类语言习得提供了新视角。 Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

[35] Can LLM Agents Identify Spoken Dialects like a Linguist?

Tobias Bystrich,Lukas Hamm,Maria Hassan,Lea Fischbach,Lucie Flek,Akbar Karimi

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLMs)在瑞士德语方言分类任务中的潜力,利用ASR生成的音标转录及方言语言学资源(如方言特征图、元音演变史和规则)作为输入,发现加入语言学信息可提升LLM性能;同时提供了LLM与人类语言学家基线,表明自动转录对分类有益但仍有改进空间。

Details Motivation: 由于方言语音标注数据稀缺,方言音频分类(尤其是瑞士德语)极具挑战性,需探索LLM等新方法替代传统语音模型(如HuBERT)并评估其潜力。 Method: 使用ASR系统生成的音素转录,并融合方言特征图、元音历史和语言规则等语言学资源,将这些作为提示输入给LLM进行方言分类;同时构建LLM基线和人类语言学家基线用于对比。 Result: 当提供语言学信息时,LLM的方言分类性能得到提升;人类基线显示自动转录有助于分类,但也揭示了当前ASR转录质量的不足和改进方向。 Conclusion: LLM在结合语言学知识后可成为方言分类的有效工具,但其性能仍依赖高质量语言资源和转录;该方法为低资源方言识别提供了新思路,但尚未超越传统语音模型(如HuBERT)。 Abstract: Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

[36] Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng,Steven Y. Feng,Michael C. Frank

Main category: cs.CL

TL;DR: 本文通过语言模型训练模拟多语种输入条件,使用合成数据和机器翻译构建匹配的单语和双语数据集,发现双语模型在两种语言上均表现良好,表明双语输入对统计学习者并无本质挑战。

Details Motivation: 探究儿童同时学习多种语言是否存在学习延迟,以及不同多语输入结构的优劣,但因无法随机分配儿童为多语者且数据常缺乏跨语言匹配,相关问题难以通过传统相关性研究得出明确结论。 Method: 利用语言模型训练模拟多种高度受控的多语暴露条件;构建基于合成数据和机器翻译的1亿词规模匹配单语与双语数据集;在不同暴露制度下训练GPT-2模型,并以困惑度、语法性和语义知识为指标评估性能。 Result: 在不同模型规模和各项评估指标下,双语模型在一种语言上的表现与单语模型相当,同时在第二语言上也展现出强性能;不同双语暴露制度之间未见显著差异。 Conclusion: 双语输入对无偏见的统计学习者不构成根本性挑战,现有结果不支持多语习得必然导致延迟或需特定输入结构的观点。 Abstract: Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

[37] When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Robinson Ferrer,Damla Turgut,Zhongzhou Chen,Shashank Sonkar

Main category: cs.CL

TL;DR: 本文研究如何预测大语言模型(LLM)在自动评分任务中的可靠性,提出通过置信度估计实现选择性自动化:高置信预测自动处理,低置信样本交由人工复核;实验比较三种置信估计方法(自报置信、自一致性投票、token概率),发现自报置信校准效果最优(平均ECE 0.166),且更高效;大模型(如GPT-OSS-120B)校准与判别能力更强,但置信分布普遍存在顶部偏斜现象,需谨慎设阈值。

Details Motivation: LLM自动评分结果不可靠,直接提升准确性困难;因此转向解决其输出的可靠性预测问题,以支持人机协同的 selective automation。 Method: 在三个科学教育数据集(RiceChem、SciEntsBank、Beetle)上,对七种不同规模(4B–120B)的LLM,系统评估三种置信度估计方法:自报置信、自一致性投票、token概率;主要评估指标为期望校准误差(ECE)和AUC。 Result: 自报置信在校准性能上最优(平均ECE 0.166),显著优于自一致性(0.229),且计算成本低;大模型(尤其GPT-OSS-120B)校准最佳(ECE 0.100)且判别力强(AUC 0.668);所有方法均存在显著的置信顶部偏斜(confidence floor)。 Conclusion: 直接让LLM报告自身置信度是一种简单、高效且实用的可靠性预测策略,可有效支撑可信的自动评分部署;模型规模提升有助于校准,但置信分布特性要求实际应用中需适配阈值设定。 Abstract: Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.

[38] Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer,David Bani-Harouni,Tobias Zellner,Nassir Navab,Florian Eyer,Matthias Keicher

Main category: cs.CL

TL;DR: 本文提出DeToxR,一种基于强化学习(RL)的毒理学决策支持系统,通过微调大语言模型(LLM)并使用临床性能奖励优化其推理能力,在急性多物质中毒诊断中显著优于基线模型及专家毒理学家。

Details Motivation: 急性多物质中毒临床场景中存在信息不全、症状非特异、数据异构(非结构化叙事+结构化生命体征)等挑战,现有LLM难以有效融合信息并做出准确诊断。 Method: 提出DeToxR框架,采用分组相对策略优化(GRPO)对LLM进行微调,并以多标签一致性指标作为临床性能奖励信号,直接优化模型对14类物质的多标签预测能力,尤其惩罚漏报和幻觉。 Result: DeToxR在多项指标上显著超越原始LLM及监督学习基线;临床验证中Micro-F1达0.644,高于专家毒理学家的0.473。 Conclusion: RL对齐的LLM可有效融合异构临床数据,提升高风险急救场景下的诊断决策支持能力,为LLM在真实医疗环境落地提供新范式。 Abstract: Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

[39] Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

Brian Felipe Keith-Norambuena,Carolina Inés Rojas-Córdova,Claudio Juvenal Meneses-Villegas,Elizabeth Johanna Lam-Esquenazi,Angélica María Flores-Bustos,Ignacio Alejandro Molina-Villablanca,Joshua Emanuel Leyton-Vallejos

Main category: cs.CL

TL;DR: 本文提出基于议程的叙事提取方法,利用大语言模型(LLM)在Narrative Trails路径搜索中动态引导故事线生成,兼顾用户指定视角(agenda alignment)与叙事连贯性(coherence),在新闻语料上验证了其优于关键词匹配的语义对齐能力,且连贯性损失极小。

Details Motivation: 现有叙事提取方法难以同时满足连贯性、交互性与多故事线支持;Narrative Maps支持交互与多故事线但牺牲单条路径连贯性,Narrative Trails连贯性高但缺乏用户引导和多视角能力。 Method: 将大语言模型集成到Narrative Trails的路径优化过程中,在每一步用LLM根据用户指定议程对候选文档进行语义排序,从而生成符合特定视角的叙事路径;同一语料库可因不同议程输出不同故事线。 Result: 在64个端点对、6种议程下评估显示:LLM引导比关键词匹配在语义议程对齐上提升9.9%(p=0.017),在'Regime Crackdown'议程上提升13.3%(p=0.037);连贯性仅下降2.2%;反事实议程得分普遍偏低(2.2–2.5),表明方法无法捏造不支持的叙事。 Conclusion: 基于议程的LLM引导叙事提取有效弥合了交互性、多视角与连贯性之间的权衡,是一种可控、可信且语义敏感的叙事生成新范式。 Abstract: Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but provides no mechanism for user guidance or multiple perspectives. We introduce agenda-based narrative extraction, a method that bridges this gap by integrating large language models into the Narrative Trails pathfinding process to steer storyline construction toward user-specified perspectives. Our approach uses an LLM at each step to rank candidate documents based on their alignment with a given agenda while maintaining narrative coherence. Running the algorithm with different agendas yields different storylines through the same corpus. We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas. LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on \textit{Regime Crackdown} specifically (p=0.037), while keyword matching remains competitive on agendas with literal keyword overlap. The coherence cost is minimal: LLM steering reduces coherence by only 2.2% compared to the agenda-agnostic baseline. Counter-agendas that contradict the source material score uniformly low (2.2-2.5) across all methods, confirming that steering cannot fabricate unsupported narratives.

[40] Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Ella Rabinovich,David Boaz,Naama Zwerdling,Ateret Anaby-Tavor

Main category: cs.CL

TL;DR: 本文提出一种新指标来检测基于LLM的智能体在业务流程自动化中出现的‘近失’(latent failures)——即虽结果正确但绕过必要策略检查的决策问题。方法基于ToolGuard框架,分析智能体调用工具时是否充分依据策略;实验表明,即使最终状态正确,8–17%的轨迹仍存在此类隐性违规。

Details Motivation: 现有评估方法仅比对最终系统状态与真值,易忽略智能体绕过策略检查却因巧合达成正确结果的‘近失’(latent failures),导致策略合规性评估存在盲区。 Method: 基于ToolGuard框架将自然语言策略转为可执行守卫代码,通过分析智能体对话轨迹,判断其工具调用决策是否充分受策略约束(即是否‘足够知情’)。 Result: 在τ²-verified Airlines基准上测试多个主流开源与闭源LLM,发现含状态变更工具调用的轨迹中,8–17%存在latent failures,而最终状态均符合预期。 Conclusion: 仅依赖结果正确性不足以保证策略合规;需引入过程导向的评估指标,以揭示智能体决策链中的隐性策略失效。 Abstract: Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

[41] ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Laura Melosi,Mehwish Alam

Main category: cs.CL

TL;DR: 本文介绍了ENEIDE,一个用于历史意大利语文本中命名实体识别与链接(NERL)的银标准数据集,包含来自两位意大利重要人物数字典藏的2111份文档和8000多个实体标注,并提供了半自动标注方法、质量控制流程及基线实验结果。

Details Motivation: 缺乏适用于历史意大利语文本的多领域、公开可用的NERL数据集,尤其是覆盖不同时期且具备训练/验证/测试划分的数据集。 Method: 从两个学术性数字典藏(Digital Zibaldone 和 Aldo Moro Digitale)中半自动提取命名实体标注,涵盖多种实体类型并链接至Wikidata(含NIL实体),辅以质量控制与标注增强流程。 Result: 构建了首个面向历史意大利语、跨领域、带完整数据划分的公开NERL数据集ENEIDE;基线实验表明该数据集对现有NERL模型具有挑战性,凸显微调模型优于零样本方法;其两世纪时间跨度适合时序实体消歧与跨域评估。 Conclusion: ENEIDE填补了历史意大利语NERL资源的空白,为相关研究提供了高质量、可复现且具时间维度的基准数据集,并开源发布(CC BY-NC-SA 4.0)。 Abstract: This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916--1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset's challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.

[42] SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

Adar Avsian,Larry Heck

Main category: cs.CL

TL;DR: 本文提出SNEAK基准,用于评估大语言模型在不对称信息下进行选择性信息共享(兼顾信息量与保密性)的能力;实验表明当前模型在此类战略通信任务上远逊于人类。

Details Motivation: 现有LLM评测基准未涵盖多智能体中需兼顾信息传递与秘密保护的战略通信能力,尤其缺乏对不对称信息场景下选择性信息共享的评估。 Method: 构建SNEAK基准:模型需基于语义类别、候选词集和秘密词生成一条既暗示知晓秘密、又不明显泄露秘密的消息;通过模拟盟友(知密)与变色龙(不知密)两类代理,分别计算效用(utility)和泄露(leakage)指标。 Result: 当前主流大语言模型在SNEAK任务中表现较差,效用与保密性难以兼顾;人类参与者表现显著优于所有模型,得分最高达模型的四倍。 Conclusion: 战略性通信——即在不对称信息下平衡信息传递与保密性——仍是当前大语言模型的关键短板,亟需新方法与新基准推动发展。 Abstract: Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

[43] Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

Benjamin Josef Schüßler,Jakob Prange

Main category: cs.CL

TL;DR: 本文扩展了一个德语ESG报告句子级可读性数据集,并通过众包标注评估了多种可读性评分方法与人类判断的一致性;结果表明,微调的小型Transformer模型在预测人类可读性方面误差最低,而大语言模型提示虽有潜力但表现稍逊。

Details Motivation: ESG报告需面向非专业公众,但其实际可读性尚不明确,亟需可靠、客观的可读性评估方法。 Method: 扩展德语ESG句子级数据集,引入众包可读性标注;对比评估多种传统指标与LLM提示、微调小型Transformer等方法在预测人类可读性排序和评分上的误差及相关性。 Result: 人类主观评价显示ESG句子总体易读但存在显著个体差异;微调的小型Transformer模型预测误差最低;LLM提示具有一定区分能力;多模型平均可轻微提升性能但牺牲推理速度。 Conclusion: 自动化可读性评估需结合人类标注与模型优化,小型专用模型优于通用LLM提示,适合部署于ESG报告可访问性改进实践。 Abstract: With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.

[44] FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Daban Q. Jaff,Mohammad Mohammadamini

Main category: cs.CL

TL;DR: 本文介绍了FLEURS-Kobani,一个针对北库尔德语(KMR)的新型语音基准数据集,包含5162条由31位母语者录制的有效语音样本,总时长18小时24分钟。作者基于Whisper v3-large模型进行了ASR和端到端语音翻译(S2TT)的基线实验,并提出两阶段微调策略,在ASR任务上取得WER 28.11、CER 9.84的性能;在KMR→EN S2TT任务中达到BLEU 8.68。该数据集是首个公开的北库尔德语语音评测基准,以CC BY 4.0协议开放。

Details Motivation: FLEURS基准未涵盖北库尔德语(KMR),限制了该低资源语言在自动语音识别(ASR)与语音翻译(S2TT)任务上的评测能力,亟需构建专用基准数据集。 Method: 构建FLEURS-Kobani数据集:采集并验证31名母语者的5162条语音(18h24m),覆盖北库尔德语方言;采用两阶段微调策略(先在Common Voice上预微调,再在FLEURS-Kobani上精调)训练Whisper v3-large模型,用于ASR和端到端S2TT任务;同时报告级联式S2TT及pivot-derived目标结果。 Result: ASR任务中,两阶段微调Whisper v3-large在测试集上实现WER 28.11、CER 9.84;端到端S2TT(KMR→EN)达BLEU 8.68;首次提供北库尔德语ASR、S2TT与S2ST的公开评测基准。 Conclusion: FLEURS-Kobani填补了北库尔德语语音处理基准的空白,显著提升了对该低资源库尔德语变体的研究支持能力,并为未来多语言语音模型开发与评估提供了关键资源。 Abstract: FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.

[45] Rewrite the News: Tracing Editorial Reuse Across News Agencies

Soveatin Kuntur,Nina Smirnova,Anna Wroblewska,Philipp Mayr,Sebastijan Razboršek Maček

Main category: cs.CL

TL;DR: 本文提出了一种弱监督方法,用于检测多语言新闻中的句子级跨语言文本重用,无需全文翻译,并结合发布时间识别最早来源,分析重用在文章中的位置分布。

Details Motivation: 为减轻记者信息过载,支持自动化预筛选,需有效检测多语言新闻中非字面、跨语言的句子级文本重用。 Method: 提出弱监督跨语言句子重用检测方法,利用发布时间确定最早可能来源,结合多语言新闻语料(STA与15家外电共23万余篇)进行对齐与过滤。 Result: 发现52%的STA文章含重用句子,而外电仅1.6%;重用多为中后部非字面改写,导语更常原创;共识别1087对对齐句对。 Conclusion: 相比单语重用研究,本工作首次在跨语言场景中实现免全译检测、基于时间溯源及重用位置分析,揭示了编辑实践中被传统词法匹配忽略的深层重用模式。 Abstract: This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: https://github.com/kunturs/lrec2026-rewrite-news.

[46] Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

Junwei Yu,Mufeng Yang,Yepeng Ding,Hiroyuki Sato

Main category: cs.CL

TL;DR: 本文提出GEO-SFE框架,系统性地研究内容结构(宏观、中观、微观)对生成式搜索引擎引用行为的影响,并通过架构感知的优化策略显著提升引用率与主观质量。

Details Motivation: 生成式搜索引擎兴起导致信息发现方式改变,现有生成式引擎优化(GEO)多关注语义层面,而结构特征对引用行为的影响尚未被充分探索。 Method: 提出GEO-SFE结构特征工程框架,将内容结构分解为宏观(文档架构)、中观(信息分块)、微观(视觉强调)三个层级,并建模其在不同生成引擎上的引用概率影响;设计架构感知的优化策略与预测模型,在保持语义完整性前提下提升结构有效性。 Result: 在六个主流生成式引擎上的实验表明,引用率提升17.3%,主观质量提升18.5%,验证了框架的有效性与泛化能力。 Conclusion: 结构优化是生成式引擎优化(GEO)的基础组成部分,本工作为LLM驱动的信息生态中提升内容可见性提供了数据驱动的方法论。 Abstract: The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.

[47] Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

Mohammadhossein Khojasteh,Yifan Jiang,Stefano De Giorgis,Frank van Harmelen,Filip Ilievski

Main category: cs.CL

TL;DR: 本文提出YARN框架,利用大语言模型(LLM)生成叙事单元的多级抽象,并结合结构映射进行类比推理,显著提升叙事类比推理性能,优于端到端LLM基线。

Details Motivation: 现有认知引擎需预抽取实体,而LLM对提示格式和表面相似性敏感,难以有效处理叙事结构类比,因此需探索LLM衍生抽象对结构映射与类比推理的影响。 Method: 提出模块化框架YARN:用LLM分解叙事为单元、生成四层语义与角色双重抽象(基于框架理论),再交由映射组件对齐跨故事元素以完成类比推理。 Result: 实验表明,引入抽象显著提升模型性能,达到或超越端到端LLM基线;错误分析揭示了抽象粒度、隐式因果建模及类比模式分类等现存挑战。 Conclusion: LLM驱动的结构化抽象可有效增强叙事类比推理能力;YARN提供可解释、可调节的模块化范式,并开源代码以支持后续研究。 Abstract: Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs' performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.

[48] ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

Yufeng Li,Rrubaa Panchendrarajan,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 本文提出ContextClaim方法,通过从声明中提取实体、从维基百科检索相关信息并生成上下文摘要,以提升可验证声明检测的性能,并在多个数据集和模型设置下验证了其有效性。

Details Motivation: 现有可验证声明检测方法仅依赖声明文本本身,忽略了实体、事件及外部证据的存在性对判断声明是否可验证的重要作用。 Method: 提出Context-Driven Claim Detection(ContextClaim)范式:提取声明中的实体,从维基百科检索相关知识,利用大语言模型生成结构化上下文摘要,辅助下游分类。 Result: 在CheckThat! 2022 COVID-19 Twitter和PoliClaim两个数据集上,ContextClaim在不同模型架构(encoder-only/decoder-only)和学习设置(微调/零样本/少样本)下均显示出上下文增强对检测性能的提升,但效果因领域、模型和设置而异。 Conclusion: 引入外部上下文信息有助于提升可验证声明检测的可靠性,但其增益具有条件依赖性;该工作为将证据检索前移至检测阶段提供了可行路径与实证支持。 Abstract: Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.

[49] Covertly improving intelligibility with data-driven adaptations of speech timing

Paige Tuttösí,Angelica Lim,H. Henny Yeung,Yue Wang,Jean-Julien Aucouturier

Main category: cs.CL

TL;DR: 本文通过逆相关实验发现,语音速率在目标元音对比前呈现剪刀式时间影响模式,该模式在母语和二语听者中均稳定存在;基于此,研究构建了能复现该模式的数据驱动TTS算法,证明针对性的语音速率调整可显著提升挑战性条件下的语音可懂度,且听者对此无意识;而常见的全局降速策略虽被主观判断为更清晰,实际却增加理解错误。

Details Motivation: 探究说话人常采用的全局降速策略是否真能提升听力障碍或二语成人听者的语音可懂度,并寻找更有效的、基于语音时间结构的调控方法。 Method: 采用逆相关实验揭示语音速率对目标元音对比的时间影响模式;比较L1与L2(法语、汉语、日语母语者)听者反应;在挑战性声学条件下验证该模式的作用;构建并测试数据驱动的文本到语音(TTS)算法以复现该剪刀式速率结构。 Result: 发现语音速率影响呈稳定的‘剪刀式’时间模式(早期与晚期上下文窗口效应相反);该结构既助L2也助L1听者理解;针对性速率调整显著提升词义理解,但听者未察觉;全局降速被主观认为更清晰,实则导致更多理解错误。 Conclusion: 针对性的语音速率调整比全局降速更有效提升语音可懂度,且具普适性;本研究提供了一种数据驱动的、可拓展至其他语音特征与听者群体的机器语音可访问性优化新范式。 Abstract: Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

cs.CV [Back]

[50] DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design

Rongjun Dong,Xin Chen,Morgan R Alexander,Karthikeyan Sivakumar,Reza Omdivar,David A Winkler,Grazziela Figueredo

Main category: cs.CV

TL;DR: 本文提出DF-ACBlurGAN,一种面向结构感知的条件生成对抗网络,用于在弱监督和类别不平衡下生成具有可控重复周期结构的生物材料表面图像。

Details Motivation: 现有图像生成模型难以保证全局结构一致性,尤其在需精确控制重复尺度、间距与边界一致性的生物材料微拓扑表面设计中表现不足。 Method: 提出DF-ACBlurGAN,融合频域重复尺度估计、尺度自适应高斯模糊和单元胞重建,显式建模长程重复性,并以实验获得的生物学响应标签为条件进行生成。 Result: 在多个生物材料数据集上验证,相比传统生成方法,该方法显著提升了重复结构的一致性与结构变化的可控性。 Conclusion: DF-ACBlurGAN有效解决了弱监督下周期性结构图像的可控生成问题,为功能导向的生物材料设计提供了新范式。 Abstract: Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.

[51] OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

Tianran Liu,Shengwen Zhao,Mozhgan Pourkeshavarz,Weican Li,Nicholas Rhinehart

Main category: cs.CV

TL;DR: OccSim是一种基于占用率的世界模型驱动的3D自动驾驶仿真器,仅需单帧输入和未来自车动作即可生成超长(>3000帧)连续仿真序列,构建覆盖4公里以上的3D占用图,显著提升开放生成能力与下游任务性能。

Details Motivation: 现有数据驱动的自动驾驶仿真严重依赖预录驾驶日志或高精地图等空间先验,导致可扩展性差、难以支持开放式的长时序大规模生成。 Method: 提出OccSim,包含两个核心模块:基于W-DiT的静态占用世界模型(通过显式引入刚性变换支持超长时序静态环境生成)和布局生成器(根据合成的道路拓扑生成动态前景智能体)。 Result: 稳定生成超3000帧连续序列(较SOTA提升80倍以上),构建超4公里3D占用图;其生成数据预训练的4D语义占用预测模型在零样本测试中达67%性能,比基于资产的仿真器高11%;数据量扩大5倍后零样本性能达74%,优势扩大至22.1%。 Conclusion: OccSim首次实现了不依赖日志或HD地图的大规模、长时序、开放生成式自动驾驶仿真,显著提升了仿真系统的可扩展性与实用性,并在下游任务中展现出强大泛化能力。 Abstract: Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

[52] Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

Ruxiao Duan,Erin Hong,Dongxu Zhao,Eric Turner,Alex Wong,Yunwen Zhou

Main category: cs.CV

TL;DR: 本文提出Fisheye3R框架,使多视角3D重建基础模型能原生支持鱼眼图像输入,同时不损害其在透视图像上的性能;通过自监督与无需鱼眼数据的监督适应策略,缓解鱼眼图像标注稀缺问题,并在多个基础模型上验证了其在位姿、深度、点云和视场估计上的提升。

Details Motivation: 现有前馈式多视角3D重建基础模型在鱼眼等大视场图像上性能下降,主因是非线性投影导致像素空间位置变化;而直接用鱼眼图像训练受限于标注数据稀缺。 Method: 提出Fisheye3R适应框架,支持自监督(仅需无标签透视图像)和监督(无需任何鱼眼训练数据)两种灵活学习方案,扩展基础模型以原生处理鱼眼输入。 Result: 在VGGT、π³和MapAnything三个基础模型上实验表明,该方法显著提升了鱼眼图像下的相机位姿、深度、点云地图和视场估计精度,且不降低透视图像性能。 Conclusion: Fisheye3R是一种高效、通用的适应框架,能在标注稀缺条件下实现鱼眼图像的高质量3D重建,并保持对传统透视图像的兼容性与鲁棒性。 Abstract: Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, $π^3$, and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.

[53] Decoding Functional Networks for Visual Categories via GNNs

Shira Karmi,Galia Avidan,Tammy Riklin Raviv

Main category: cs.CV

TL;DR: 本文提出了一种基于7T fMRI数据和带符号图神经网络的框架,用于解码视觉类别特异性的功能连接状态,并揭示腹侧与背侧视觉通路中可重复、有生物学意义的子网络。

Details Motivation: 理解大规模脑网络如何表征视觉类别,是连接感知与皮层组织的关键问题。 Method: 利用Natural Scenes Dataset的高分辨率7T fMRI数据构建基于脑区(parcel)的功能图,并训练一种带符号图神经网络,建模正负功能连接,结合稀疏边掩码与类别特异性显著性。 Result: 模型能准确解码运动、食物、车辆等类别的功能连接状态,并在腹侧与背侧视觉通路中发现可重复、具生物学意义的子网络。 Conclusion: 该框架将传统体素级类别选择性拓展至基于功能连接的视觉加工表征,实现了机器学习与神经科学的交叉融合。 Abstract: Understanding how large-scale brain networks represent visual categories is fundamental to linking perception and cortical organization. Using high-resolution 7T fMRI from the Natural Scenes Dataset, we construct parcel-level functional graphs and train a signed Graph Neural Network that models both positive and negative interactions, with a sparse edge mask and class-specific saliency. The model accurately decodes category-specific functional connectivity states (sports, food, vehicles) and reveals reproducible, biologically meaningful subnetworks along the ventral and dorsal visual pathways. This framework bridges machine learning and neuroscience by extending voxel-level category selectivity to a connectivity-based representation of visual processing.

[54] Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Felix Wimbauer,Fabian Manhardt,Michael Oechsle,Nikolai Kalischek,Christian Rupprecht,Daniel Cremers,Federico Tombari

Main category: cs.CV

TL;DR: 本文提出Stepper框架,通过分步式全景场景扩展实现文本驱动的沉浸式3D场景合成,在保持高视觉保真度的同时提升可探索性。

Details Motivation: 现有文本生成沉浸式3D场景方法在视觉保真度与可探索性之间存在权衡:自回归扩展易出现上下文漂移,而全景视频生成分辨率受限。 Method: 提出Stepper统一框架,包含新型多视角360°扩散模型以支持一致、高分辨率的分步全景扩展,并结合几何重建流程保障几何一致性;使用新构建的大规模多视角全景数据集进行训练。 Result: 在视觉保真度和结构一致性上达到SOTA性能,显著优于先前方法。 Conclusion: Stepper为沉浸式场景生成设立了新标准,推动AR/VR与世界建模等应用发展。 Abstract: The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.

[55] Hybrid Quantum-Classical AI for Industrial Defect Classification in Welding Images

Akshaya Srinivasan,Xiaoyin Cheng,Jianming Yi,Alexander Geng,Desislava Ivanova,Andreas Weinmann,Ali Moghiseh

Main category: cs.CV

TL;DR: 本文研究了两种混合量子-经典机器学习方法用于铝TIG焊接图像缺陷分类,并与传统深度学习模型(CNN)进行性能对比,结果表明混合模型具有竞争力,展示了其在工业缺陷检测中的应用潜力。

Details Motivation: 探索混合量子-经典机器学习在工业自动化质量控制中的可行性,特别是针对铝TIG焊接图像缺陷分类任务。 Method: 1)使用CNN提取图像特征;2)第一种量子方法:采用参数化量子特征映射将特征编码为量子态,构建量子核矩阵,并用变分量子线性求解器(VQLS)求解SVM优化问题;3)第二种量子方法:对特征进行角度编码,嵌入变分量子线路并用经典优化器训练;4)在二分类与多分类任务中评估性能。 Result: 混合量子-经典模型在缺陷分类任务中表现与CNN相当,尤其验证了量子核条件数对分类性能的影响;两类量子方法均展现出实际应用潜力。 Conclusion: 混合量子-经典方法在近中期工业缺陷检测与质量保证中具备实用价值,是传统深度学习的有力补充。 Abstract: Hybrid quantum-classical machine learning offers a promising direction for advancing automated quality control in industrial settings. In this study, we investigate two hybrid quantum-classical approaches for classifying defects in aluminium TIG welding images and benchmarking their performance against a conventional deep learning model. A convolutional neural network is used to extract compact and informative feature vectors from weld images, effectively reducing the higher-dimensional pixel space to a lower-dimensional feature space. Our first quantum approach encodes these features into quantum states using a parameterized quantum feature map composed of rotation and entangling gates. We compute a quantum kernel matrix from the inner products of these states, defining a linear system in a higher-dimensional Hilbert space corresponding to the support vector machine (SVM) optimization problem and solving it using a Variational Quantum Linear Solver (VQLS). We also examine the effect of the quantum kernel condition number on classification performance. In our second method, we apply angle encoding to the extracted features in a variational quantum circuit and use a classical optimizer for model training. Both quantum models are tested on binary and multiclass classification tasks and the performance is compared with the classical CNN model. Our results show that while the CNN model demonstrates robust performance, hybrid quantum-classical models perform competitively. This highlights the potential of hybrid quantum-classical approaches for near-term real-world applications in industrial defect detection and quality assurance.

[56] GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

Youngjoong Kwon,Yao He,Heejung Choi,Chen Geng,Zhengmao Liu,Jiajun Wu,Ehsan Adeli

Main category: cs.CV

TL;DR: 本文提出了一种前馈式单目RGB视频人体性能捕捉方法,通过构建并持续更新一个规范空间来累积外观信息,结合概率回归实现高质量、鲁棒的新视角渲染。

Details Motivation: 单目RGB视频中观测不足,尤其是未见区域难以重建,需利用时间连续性提升可见性。 Method: 构建随时间更新的规范空间以累积外观信息,并将渲染建模为概率回归,协调历史与当前观测,缓解冲突。 Result: 在4D-Dress(域内)和MVHumanNet(域外)数据集上验证了方法有效性,生成更清晰的重建结果,并支持无历史观测区域的合理合成。 Conclusion: 该方法通过规范空间建模与概率渲染,在单目条件下显著提升了人体新视角合成的质量与鲁棒性。 Abstract: We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

[57] MEDiC: Multi-objective Exploration of Distillation from CLIP

Konstantinos Georgiou,Maofeng Tang,Hairong Qi

Main category: cs.CV

TL;DR: MEDiC是一种结合像素空间重建与CLIP教师模型蒸馏的多目标自监督学习框架,通过patch级token蒸馏、全局CLS对齐和轻量像素重建三个互补目标,在ImageNet-1K上达到73.9% kNN准确率;研究发现教师模型已具备语义感知能力,因此更“语义合理”的进化掩码并未带来增益,且损失权重极为敏感。

Details Motivation: 现有MIM方法通常只在原始像素空间或预训练教师引导的潜在特征空间中建模,缺乏对二者协同潜力的系统探索;同时,掩码策略与损失权重设计的经验性较强,缺乏深入分析。 Method: 提出MEDiC框架,融合三种目标:1)从冻结CLIP编码器进行patch级token蒸馏;2)全局[CLS] token对齐;3)轻量解码器像素重建;并系统研究掩码策略(引入带相对位置偏置的层次聚类进化掩码)与损失权重敏感性。 Result: 完整三目标组合达73.9% ImageNet-1K kNN准确率;进化掩码虽语义更连贯,但未超越简单块掩码;损失权重极敏感,微小扰动可致kNN准确率下降高达17个百分点;ViT-Base下300轮训练达73.9% kNN与85.1%微调准确率。 Conclusion: 像素重建与教师蒸馏可有效互补;教师模型自身语义能力削弱了复杂语义掩码的必要性;多目标训练需谨慎设计损失平衡,其脆弱性提示需更鲁棒的优化机制。 Abstract: Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

[58] UltraG-Ray: Physics-Based Gaussian Ray Casting for Novel Ultrasound View Synthesis

Felix Duelmer,Jakob Klaushofer,Magdalena Wysocki,Nassir Navab,Mohammad Farid Azampour

Main category: cs.CV

TL;DR: 本文提出UltraG-Ray,一种基于可学习3D高斯场的超声场景表示方法,结合物理驱动的B模式合成模块,显式编码衰减、反射等超声特有参数,并通过新设计的光线投射方案生成更真实的超声图像,在MS-SSIM等指标上提升达15%。

Details Motivation: 现有超声新视角合成(NVS)方法难以建模复杂组织结构及视角相关的声学效应;纯数据驱动或现有物理融合方法仍存在仿真与现实之间的显著差距。 Method: 提出UltraG-Ray:以可学习3D高斯场作为场景表示,显式嵌入超声特有物理参数(如衰减、反射);设计面向B模式成像的高效物理渲染模块,采用新型光线投射方案实现视角相关衰减建模。 Result: 在图像质量指标(如MS-SSIM)上相较SOTA方法提升最高达15%,合成图像展现出更高解剖合理性和物理真实性。 Conclusion: UltraG-Ray通过将超声物理先验深度融入隐式场景表示与渲染流程,显著缩小了仿真与真实超声图像间的鸿沟,为临床训练与数据增强提供了更可靠的NV S工具。 Abstract: Novel view synthesis (NVS) in ultrasound has gained attention as a technique for generating anatomically plausible views beyond the acquired frames, offering new capabilities for training clinicians or data augmentation. However, current methods struggle with complex tissue and view-dependent acoustic effects. Physics-based NVS aims to address these limitations by including the ultrasound image formation process into the simulation. Recent approaches combine a learnable implicit scene representation with an ultrasound-specific rendering module, yet a substantial gap between simulation and reality remains. In this work, we introduce UltraG-Ray, a novel ultrasound scene representation based on a learnable 3D Gaussian field, coupled to an efficient physics-based module for B-mode synthesis. We explicitly encode ultrasound-specific parameters, such as attenuation and reflection, into a Gaussian-based spatial representation and realize image synthesis within a novel ray casting scheme. In contrast to previous methods, this approach naturally captures view-dependent attenuation effects, thereby enabling the generation of physically informed B-mode images with increased realism. We compare our method to state-of-the-art and observe consistent gains in image quality metrics (up to 15% increase on MS-SSIM), demonstrating clear improvement in terms of realism of the synthesized ultrasound images.

[59] MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Bharath Krishnamurthy,Ajita Rattani

Main category: cs.CV

TL;DR: 本文提出MMFace-DiT,一种统一的双流扩散Transformer模型,通过共享RoPE注意力机制实现文本与空间先验(如掩码/草图)的深度协同融合,显著提升可控人脸生成的空间-语义一致性与视觉质量。

Details Motivation: 现有基于预训练文生图模型扩展的多模态人脸生成方法存在架构限制、参数冗余及模态冲突等问题,难以实现语义与空间域的协同融合。 Method: 提出双流扩散Transformer(MMFace-DiT),包含并行处理文本与空间token的双流Transformer块,采用共享Rotary Position-Embedded(RoPE)注意力机制进行深层模态融合,并引入可动态适配不同空间条件的Modality Embedder。 Result: 在六种SOTA多模态人脸生成模型上实现40%的视觉保真度和提示对齐性能提升,并开源代码与数据集。 Conclusion: MMFace-DiT为端到端可控生成建模提供了新范式,验证了统一双流架构与共享注意力机制在多模态协同融合中的有效性。 Abstract: Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/

[60] The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations

Kushal Vyas,Alper Kayabasi,Daniel Kim,Vishwanath Saragadam,Ashok Veeraraghavan,Guha Balakrishnan

Main category: cs.CV

TL;DR: 本文研究了隐式神经表示(INRs)的参数初始化策略,发现预训练于具有自然图像频谱特性的噪声(如1/|f^α|)可在信号拟合与逆成像任务(如去噪)间取得最佳平衡,优于无结构噪声和现有数据驱动方法。

Details Motivation: 隐式神经表示(INRs)的逼近与收敛性能对参数初始化高度敏感;尽管已有数据驱动初始化方法表现更好,但其成功原因——是否编码经典统计先验或更复杂特征——尚不清楚。 Method: 通过在多种噪声类别(如高斯噪声、Dead Leaves、谱噪声等)上对INRs进行噪声预训练,并评估其对未见信号的拟合能力及在逆成像任务(去噪)中作为深度图像先验的效果。 Result: 预训练于无结构噪声(均匀、高斯)显著提升信号拟合能力,但在去噪任务中表现差;而具有自然图像典型1/|f^α|频谱结构的噪声则在信号拟合与逆成像能力上达到最优平衡,性能媲美最优数据驱动初始化方法。 Conclusion: 具有自然图像频谱特性的噪声预训练可作为高效、通用且无需大量领域数据的INR初始化策略,为缺乏充足先验数据的应用提供新思路。 Abstract: The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success -- specifically, whether they encode classical statistical signal priors or more complex features -- remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic $1/|f^α|$ spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data. For more details, visit project page at https://kushalvyas.github.io/noisepretraining.html

[61] Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

Yujin Ham,Junho Kim,Vivek Boominathan,Guha Balakrishnan

Main category: cs.CV

TL;DR: 本文提出了一种生成式算法,用于从第一人称步行游览视频中真实地去除人类及其阴影,通过构建半合成视频数据集并微调Casper视频扩散模型,显著提升了人类移除效果,并成功应用于城市环境的3D/4D建模。

Details Motivation: 第一人称步行视频中大量人类及阴影干扰了环境建模任务,需有效去除人类以提升视频在环境建模中的可用性。 Method: 构建包含纯背景视频与叠加仿真人类及阴影的合成视频对的半合成数据集,并基于该数据集微调Casper视频扩散模型,实现人类及阴影的高质量视频修复(inpainting)。 Result: 所提方法在定性和定量评估中均显著优于原始Casper模型,尤其在高人群密度和复杂背景场景下表现优异;生成的无人大视频成功支持了城市地点的3D/4D建模。 Conclusion: 基于半合成数据微调的视频扩散模型可高效、逼真地去除步行视频中的人类及其阴影,为环境建模提供了高质量视觉数据源。 Abstract: Egocentric "walking tour" videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.

[62] Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery

Peiran Li,Fangzhou Lin,Shuo Xing,Jiashuo Sun,Dylan Zhang,Siyuan Yang,Chaoqun Ni,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出DASES框架,通过创新者、深渊证伪者和机制因果提取器的协同进化,实现主动证伪驱动的自主科学发现,避免静态验证导致的虚假成功,并在损失函数发现任务中取得超越现有方法的泛化性能。

Details Motivation: 当前自主科学发现面临风险:当评估器固定后,强大的搜索过程可能仅学会通过考试而未真正理解任务背后的机制;因此需要从被动验证转向主动、适应性证伪。 Method: 提出DASES框架,包含三个协同进化的组件:Innovator(生成科学人工物)、Abyss Falsifier(构建可接受的反例环境进行证伪)、Mechanistic Causal Extractor(提取机制因果结构),三者在固定科学契约下共同演化。 Result: 在可控的损失发现任务中,DASES拒绝了静态验证会接受的错误人工物,识别出首个通过可接受证伪前沿的候选者,并发现新损失函数FNG-CE,其在ImageNet等标准基准上稳定优于CE和CE+L2,且具备跨环境迁移能力。 Conclusion: 主动证伪驱动的评估范式(让深渊回望)比静态验证更可靠,能有效提升自主科学发现的真实性与泛化性;DASES为可信AI驱动的科学发现提供了新路径。 Abstract: Autonomous scientific discovery is entering a more dangerous regime: once the evaluator is frozen, a sufficiently strong search process can learn to win the exam without learning the mechanism the task was meant to reveal. This is the idea behind our title. To let the abyss stare back is to make evaluation actively push against the candidate through adaptive falsification, rather than passively certify it through static validation. We introduce DASES, a falsification-driven framework in which an Innovator, an Abyss Falsifier, and a Mechanistic Causal Extractor co-evolve executable scientific artifacts and scientifically admissible counterexample environments under a fixed scientific contract. In a controlled loss-discovery problem with a single editable locus, DASES rejects artifacts that static validation would have accepted, identifies the first candidate that survives the admissible falsification frontier, and discovers FNG-CE, a loss that transfers beyond the synthetic discovery environment and consistently outperforms CE and CE+L2 under controlled comparisons across standard benchmarks, including ImageNet.

[63] LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出LA-Sign,一种基于循环Transformer与几何感知对齐的骨架驱动孤立手语识别框架,通过参数共享的递归精炼和超球面对比学习提升多尺度动作理解能力,在WLASL和MSASL上达到SOTA性能。

Details Motivation: 现有基于骨架的手语识别方法依赖深层前馈网络,缺乏递归精炼机制和结构化表征能力,难以建模从手指微动到全身动态的多尺度关节运动。 Method: 提出LA-Sign:1)采用循环式Transformer架构(非堆叠深层),通过重复访问隐状态实现参数共享下的递归精炼;2)设计几何感知对比损失,将骨骼与文本特征映射至自适应双曲空间(Poincaré模型),促进多尺度语义组织;3)系统评估三种循环结构与多种几何流形组合。 Result: 在WLASL和MSASL基准上取得SOTA结果,且使用更少的独有层,验证了递归隐状态精炼与几何感知表征学习的有效性。 Conclusion: 递归深度优于堆叠深度,结合自适应双曲空间对齐能更好建模手语的多尺度时空结构,为ISLR提供了新范式。 Abstract: Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

[64] Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Rhea Chowers,Oshri Naparstek,Udi Barzelay,Yair Weiss

Main category: cs.CV

TL;DR: 本文揭示了多模态模型中图像与文本嵌入间存在全局‘模态间隙’的现象,并证明该间隙与模型鲁棒性呈单调关系;通过简单的后处理(如将一种模态向另一模态均值平移)可显著提升鲁棒性,且不损害原始准确率。

Details Motivation: 现有多模态模型(如CLIP)虽追求模态对齐,却普遍存在图像与文本嵌入在共享空间中明显分离的‘模态间隙’现象;其成因不明,且尚不清楚缩小该间隙是否能提升下游任务性能。 Method: 理论分析对比损失最小化下的表示特性,推导出模态间隙为一正交于嵌入的全局向量;实证验证该间隙与鲁棒性的单调关系,并采用模态均值平移作为简单后处理策略。 Result: 理论证明模态间隙由对比损失诱导且正交于嵌入空间;实验表明,对多种真实VLM进行模态平移后处理,可在保持干净准确率不变的前提下显著提升对抗鲁棒性。 Conclusion: 模态间隙并非缺陷而是对比学习的固有属性,其大小可控且与鲁棒性密切相关;无需重训练,仅靠后处理即可有效增强多模态模型鲁棒性。 Abstract: Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

[65] WorldFlow3D: Flowing Through 3D Distributions for Unbounded World Generation

Amogh Joshi,Julian Ost,Felix Heide

Main category: cs.CV

TL;DR: 本文提出WorldFlow3D,一种基于流匹配的无界3D世界生成方法,无需潜在表示,能高效生成结构准确、纹理高质量的3D场景,并支持几何布局与纹理属性的可控生成。

Details Motivation: 无界3D世界生成是计算机视觉、图形学和机器人领域中场景建模的基础任务,现有方法在结构准确性、生成效率和可控性方面存在局限。 Method: 提出WorldFlow3D,利用流匹配原理将3D生成建模为在3D数据分布间的流动过程,摒弃传统条件去噪范式;采用无潜在空间的流方法,并结合向量化场景布局条件与场景属性实现几何与纹理的可控生成。 Result: 在真实户外驾驶场景和合成室内场景上验证了方法的有效性,展现出跨域泛化能力及对真实数据分布的高质量生成能力,在所有测试设置中均优于现有无界场景生成方法。 Conclusion: WorldFlow3D提供了一种更通用、高效且可控的3D世界生成新范式,推动了无界3D生成向更真实、更实用方向发展。 Abstract: Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see https://light.princeton.edu/worldflow3d.

[66] TrajectoryMover: Generative Movement of Object Trajectories in Videos

Kiran Chhatre,Hyeonho Jeong,Yulia Gryaditskaya,Christopher E. Peters,Chun-Hao Paul Huang,Paul Guerrero

Main category: cs.CV

TL;DR: 本文提出TrajectoryAtlas数据生成管道和TrajectoryMover视频生成器,解决生成式视频编辑中物体3D运动轨迹移动的难题,通过大规模合成配对视频数据实现轨迹的生成式移动。

Details Motivation: 现有生成式视频编辑方法无法在保持物体相对3D运动的前提下移动其3D运动轨迹,主要挑战在于缺乏合适的配对视频训练数据。 Method: 提出TrajectoryAtlas数据生成管道用于构建大规模合成配对视频数据,并基于该数据微调视频生成模型TrajectoryMover。 Result: 成功实现了物体3D运动轨迹的生成式移动,验证了方法在保持视频合理性和身份一致性方面的有效性。 Conclusion: TrajectoryAtlas与TrajectoryMover联合方案有效解决了生成式视频编辑中物体轨迹移动这一未被满足的需求,为非专业用户提供了更直观的视频编辑能力。 Abstract: Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover

[67] Enhancing Box and Block Test with Computer Vision for Post-Stroke Upper Extremity Motor Evaluation

David Robinson,Animesh Gupta,Elizabeth Clark,Olga Melnik,Qiushi Fu,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出一种无需深度传感器或校准物的计算机视觉框架,通过单目视频提取世界对齐的关节角度来量化卒中后上肢运动质量,并在Box and Block Test中验证其能区分健康人与卒中患者、甚至相同分数但运动模式不同的患者。

Details Motivation: 现有卒中后上肢功能临床评估方法(如序数评分或计时任务)缺乏敏感性或无法反映运动质量。 Method: 基于单目视频,利用计算机视觉估计手指、手臂和躯干的世界对齐关节角度;对136段BBT视频(48名健康人+7名卒中患者)提取关节角特征,采用无监督降维分析运动模式。 Result: 降维嵌入可清晰分离健康与卒中运动模式;部分BBT得分相同的卒中患者展现出可区分的姿势模式;关节角度特征提供了超越标准计时评分的运动质量信息。 Conclusion: 该免校准、仅需手机/相机视频的视觉框架,可在不改变现有临床流程的前提下,客观、敏感地量化上肢运动质量,具有临床转化潜力。 Abstract: Standard clinical assessments of upper-extremity motor function after stroke either rely on ordinal scoring, which lacks sensitivity, or time-based task metrics, which do not capture movement quality. In this work, we present a computer vision-based framework for analysis of upper-extremity movement during the Box and Block Test (BBT) through world-aligned joint angles of fingers, arm, and trunk without depth sensors or calibration objects. We apply this framework to a dataset of 136 BBT recordings collected from 48 healthy individuals and 7 individuals post stroke. Using unsupervised dimensionality reduction of joint-angle features, we analyze movement patterns without relying on expert clinical labels. The resulting embeddings show separation between healthy movement patterns and stroke-related movement deviations. Importantly, some patients with the same BBT scores can be separated with different postural patterns. These results show that world-aligned joint angles can capture meaningful information of upper-extremity functions beyond standard time-based BBT scores, with no effort from the clinician other than monocular video recordings of the patient using a phone or camera. This work highlights the potential of a camera-based, calibration-free framework to measure movement quality in clinical assessments without changing the widely adopted clinical routine.

[68] Dual-Imbalance Continual Learning for Real-World Food Recognition

Xiaoyan Zhang,Jiangpeng He

Main category: cs.CV

TL;DR: 本文提出DIME框架,用于解决持续食物识别中的双重不平衡问题(类别内样本不平衡和各增量学习步中新类数量不平衡),通过轻量级适配器学习与类别计数引导的谱合并策略,在真实长尾食物数据集上显著优于现有方法。

Details Motivation: 现实中的食物识别存在严重的数据长尾分布,且在持续学习场景中,各增量步骤引入的新类别数量差异很大,导致双重不平衡问题,而现有方法通常假设每步新增类别数量相同,无法应对实际场景。 Method: 提出DIME框架:采用参数高效微调学习每个任务的轻量适配器;设计类别计数引导的谱合并策略进行适配器融合;引入秩向阈值调制机制稳定合并过程,保留主导知识并支持自适应更新;最终仅需单个合并适配器进行推理。 Result: 在符合实际步长不平衡设定的长尾食物基准上,DIME持续超越最强的现有持续学习基线方法3%以上。 Conclusion: DIME有效缓解了持续食物识别中的双重不平衡挑战,兼顾性能、稳定性与部署效率,为真实场景下的持续视觉识别提供了新思路。 Abstract: Visual food recognition in real-world dietary logging scenarios naturally exhibits severe data imbalance, where a small number of food categories appear frequently while many others occur rarely, resulting in long-tailed class distributions. In practice, food recognition systems often operate in a continual learning setting, where new categories are introduced sequentially over time. However, existing studies typically assume that each incremental step introduces a similar number of new food classes, which rarely happens in real world where the number of newly observed categories can vary significantly across steps, leading to highly uneven learning dynamics. As a result, continual food recognition exhibits a dual imbalance: imbalanced samples within each food class and imbalanced numbers of new food classes to learn at each incremental learning step. In this work, we introduce DIME, a Dual-Imbalance-aware Adapter Merging framework for continual food recognition. DIME learns lightweight adapters for each task using parameter-efficient fine-tuning and progressively integrates them through a class-count guided spectral merging strategy. A rank-wise threshold modulation mechanism further stabilizes the merging process by preserving dominant knowledge while allowing adaptive updates. The resulting model maintains a single merged adapter for inference, enabling efficient deployment without accumulating task-specific modules. Experiments on realistic long-tailed food benchmarks under our step-imbalanced setup show that the proposed method consistently improves by more than 3% over the strongest existing continual learning baselines. Code is available at https://github.com/xiaoyanzhang1/DIME.

[69] SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

Wenchao Sun,Xuewu Lin,Keyu Chen,Zixiang Pei,Xiang Li,Yining Shi,Sifa Zheng

Main category: cs.CV

TL;DR: 本文通过系统性扩展研究发现,增加静态轨迹词典的密度可持续提升评分式规划方法性能,进而提出SparseDriveV2:采用因子化结构表示稀疏但组合覆盖的轨迹词典,并结合粗粒度因子评分与细粒度组合轨迹评分策略,在多个自动驾驶基准上达到先进性能。

Details Motivation: 探究静态轨迹词典是否在足够密集时可媲美动态生成方案的性能,而非必须依赖动态生成;现有方法未厘清性能瓶颈源于离散化粗糙还是建模能力不足。 Method: 1)提出因子化轨迹表示:将轨迹解耦为几何路径与速度曲线,实现组合式高覆盖低冗余的稀疏词典;2)设计两级评分策略:先对路径和速度分别进行粗粒度因子评分,再对少量组合轨迹做细粒度精评。 Result: 在NAVSIM上达92.0 PDMS和90.1 EPDMS;在Bench2Drive上以轻量ResNet-34主干取得89.15 Driving Score和70.00 Success Rate。 Conclusion: 静态词典方法在合理结构设计与评分策略下,无需动态生成即可达到甚至超越当前最优性能,挑战了动态提案的必要性共识。 Abstract: End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at https://github.com/swc-17/SparseDriveV2.

[70] LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao,Lei Chen,Mingfei Han,Changlin Li,Dong An,Yuqiang Yang,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出LatentPilot,一种利用训练阶段未来视觉观测学习动作-视觉动态因果关系的新范式,在推理时无需未来帧;通过飞轮式训练与无监督视觉潜在token建模,实现‘预见性’导航推理,并在多个VLN基准和真实机器人实验中达到SOTA。

Details Motivation: 现有VLN模型忽略动作对未来视觉状态的因果影响,缺乏对环境-动作动态关系的建模能力,而人类可通过想象近未来提升导航决策;因此需构建能隐式学习动作条件视觉动态的模型。 Method: 提出LatentPilot:1)飞轮式训练机制——迭代采集on-policy轨迹并重训练,结合专家接管防止偏差累积;2)无监督学习连续潜在空间中的视觉token,该token跨时间步传递,既为当前输出又为下一时刻输入,支持‘前瞻性’推理。 Result: 在R2R-CE、RxR-CE和R2R-PE三个主流VLN基准上均取得新SOTA性能;真实机器人实验验证其在多样化环境中对环境-动作动态关系的理解显著优于现有方法。 Conclusion: 引入未来视觉动态建模(仅用于训练)与可传递潜在token机制,使VLN模型具备类人‘行动前预演’能力,显著提升泛化性与鲁棒性,为具身智能中的因果推理提供了新思路。 Abstract: Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/

[71] CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study

Bo Ma,Jinsong Wu,Weiqi Yan,Hongjiang Wei

Main category: cs.CV

TL;DR: 本文探讨了在不使用CT图像进行推理的前提下,能否利用CT图像作为训练监督信号来提升X射线图像的疾病分类性能,将其建模为跨模态师生蒸馏问题,并通过多种实验设计揭示该任务的挑战性与评估可靠性要求。

Details Motivation: X光与CT提供互补的胸部疾病信息,但现有CAD模型多限于单一模态;本文聚焦于一个更实际的问题:能否仅用配对的CT图像作为训练监督,构建仅需X光推理的二分类模型? Method: 将CT→X-ray建模为跨模态知识蒸馏问题,以JDCNet为可执行基线框架,对比多种方法:朴素跨模态logit蒸馏、模块增强JDCNet、late fusion、同模态蒸馏、注意力迁移与特征提示,并采用患者级蒙特卡洛重采样、不平衡敏感分析等严格机制控制。 Result: 在原始划分上,朴素跨模态logit-KD表现最优(准确率0.875,macro-F1 0.714);但在八次重采样下,其平衡准确率骤降至0.500,而late fusion和同模态蒸馏分别在准确率和macro-F1/平衡准确率上更优;注意力迁移与特征提示均未恢复稳定跨模态优势。 Conclusion: 本研究未提出被验证优越的CT-to-X-ray架构,而是建立了一个可复现、有证据支撑的试点协议,明确界定了任务定义、失败模式、排序不稳定性及未来可信跨模态迁移声明所需的最低验证标准。 Abstract: Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher--student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.

[72] Segmentation of Gray Matters and White Matters from Brain MRI data

Chang Sun,Rui Shi,Tsukasa Koike,Tetsuro Sekine,Akio Morita,Tetsuya Sakai

Main category: cs.CV

TL;DR: 本文提出了一种改进的MedSAM模型,用于多类别脑组织(灰质、白质)分割,通过修改解码器为三类、冻结图像编码器并微调提示编码器和解码器,在IXI数据集上达到最高Dice分数0.8751。

Details Motivation: 传统方法(如FSL FAST)需任务特定调整且对多样成像条件鲁棒性差;现有基础模型(如MedSAM)主要面向二值分割,缺乏对多类别脑组织分割的适配。 Method: 基于MedSAM,扩展其掩码解码器至三类(背景、灰质、白质);冻结预训练图像编码器,微调提示编码器与解码器;预处理包括FSL BET颅骨剥离、FSL FAST生成组织概率图,并切片为2D轴向/矢状/冠状面带多类标签。 Result: 在IXI数据集上实现最高Dice分数0.8751;验证了仅需少量架构改动即可将MedSAM适配于多类别医学图像分割任务。 Conclusion: MedSAM等基础模型可通过轻量级修改有效支持多类别脑组织分割,为拓展至更广泛的医学影像场景提供了可行路径。 Abstract: Accurate segmentation of brain tissues such as gray matter and white matter from magnetic resonance imaging is essential for studying brain anatomy, diagnosing neurological disorders, and monitoring disease progression. Traditional methods, such as FSL FAST, produce tissue probability maps but often require task-specific adjustments and face challenges with diverse imaging conditions. Recent foundation models, such as MedSAM, offer a prompt-based approach that leverages large-scale pretraining. In this paper, we propose a modified MedSAM model designed for multi-class brain tissue segmentation. Our preprocessing pipeline includes skull stripping with FSL BET, tissue probability mapping with FSL FAST, and converting these into 2D axial, sagittal, coronal slices with multi-class labels (background, gray matter, and white matter). We extend MedSAM's mask decoder to three classes, freezing the pre-trained image encoder and fine-tuning the prompt encoder and decoder. Experiments on the IXI dataset achieve Dice scores up to 0.8751. This work demonstrates that foundation models like MedSAM can be adapted for multi-class medical image segmentation with minimal architectural modifications. Our findings suggest that such models can be extended to more diverse medical imaging scenarios in future work.

[73] Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

Huaqi Tao,Bingxi Liu,Guangcheng Chen,Fulin Tang,Li He,Hong Zhang

Main category: cs.CV

TL;DR: SplatHLoc是一种基于Feature Gaussian Splatting的新型分层视觉重定位框架,通过自适应视角检索和混合特征匹配策略,提升了重定位的精度与鲁棒性,在室内外数据集上达到SOTA。

Details Motivation: 现有基于点的分层重定位方法受限于图像观测稀疏性和特征匹配薄弱性。 Method: 提出SplatHLoc框架,采用Feature Gaussian Splatting作为场景表示;设计自适应视角检索以合成更匹配查询视角的虚拟候选图像;引入混合特征匹配策略:高斯渲染特征用于粗匹配,原始图像特征用于细匹配。 Result: 在多个室内外数据集上验证了方法有效性,显著提升重定位鲁棒性,达到当前最优性能(SOTA)。 Conclusion: Feature Gaussian Splatting结合自适应视角合成与阶段适配的混合特征匹配,可有效克服传统稀疏匹配瓶颈,推动分层重定位向更鲁棒、高效方向发展。 Abstract: Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera's pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

[74] SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Ryosuke Matsuda,Keito Kudo,Haruto Yoshida,Nobuyuki Shimizu,Jun Suzuki

Main category: cs.CV

TL;DR: 本文提出SLVMEval基准,用于元评估文本到视频(T2V)评价系统在长达3小时视频上的性能,通过合成降质视频对并结合众包筛选,发现现有评价系统在多数方面不及人类判断。

Details Motivation: 现有T2V评价系统在长视频质量评估上缺乏可靠验证,尤其在人类易判别场景下是否准确尚不明确。 Method: 构建基于密集视频字幕数据集的合成长视频降质对,覆盖10个维度;通过众包筛选出人类可清晰感知质量差异的样本,形成SLVMEval基准;采用成对比较元评估框架测试现有评价系统的排序能力。 Result: 人类评估者在长视频优劣判断中准确率达84.7%–96.8%;现有评价系统在10个维度中的9个表现逊于人类。 Conclusion: 当前文本到长视频评价系统存在显著局限性,SLVMEval为未来研究提供了可复现、可控且面向人类感知的元评估基准。 Abstract: This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

[75] 3D Architect: An Automated Approach to Three-Dimensional Modeling

Sunil Tiwari,Payal Fofadiya,Vicky Vishwakarma

Main category: cs.CV

TL;DR: This paper proposes a method to reconstruct a 3D object from its orthographic views using Harris corner detection, envelope construction via perpendicular projection of control points, intersection of envelopes to obtain 3D points, surface regeneration via computational geometry, and final rendering with OpenGL.

Details Motivation: To reconstruct a 3D object from its 2D orthographic views, enabling visualization and modeling without requiring complex 3D scanning equipment. Method: Apply Harris corner detector to extract control points from orthographic views; project these points perpendicularly to construct view-specific envelopes; compute intersections of these mutually perpendicular envelopes to obtain 3D points; reconstruct surfaces using computational geometry techniques; render the final 3D model using OpenGL. Result: A rendered 3D object generated solely from orthographic 2D views, validated through surface reconstruction and OpenGL visualization. Conclusion: The proposed method effectively enables 3D reconstruction from limited 2D orthographic inputs using classical computer vision and geometric techniques, offering a practical alternative for basic 3D modeling scenarios. Abstract: The aim of our paper is to render an object in 3-dimension using a set of its orthographic views. Corner detector (Harris Detector) is applied on the input views to obtain control points. These control points are projected perpendicular to respective views, in order to construct an envelope. A set of points describing the object in 3-dimension, are obtained from the intersection of these mutually perpendicular envelopes. These set of points are used to regenerate the surfaces of the object using computational geometry. At the end, the object in 3-dimension is rendered using OpenGL

[76] Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

Payal Fofadiya,Sunil Tiwari

Main category: cs.CV

TL;DR: 本文提出了一种自适应上下文压缩框架,通过重要性感知记忆选择、连贯性敏感过滤和动态预算分配,在长对话中保持关键信息并控制上下文增长,显著提升了LLM在长程交互中的稳定性、检索性能和效率。

Details Motivation: 大型语言模型(LLMs)在长时交互中因上下文增长、内存饱和与计算开销增大而性能下降。 Method: 提出一种自适应上下文压缩框架,包含重要性感知记忆选择、连贯性敏感过滤和动态预算分配三个核心机制。 Result: 在LOCOMO、LOCCO和LongBench基准上验证,该方法在回答质量、检索准确率、连贯性保持和效率方面均优于现有记忆与压缩方法,降低了token使用量和推理延迟。 Conclusion: 自适应上下文压缩能有效平衡LLM长期记忆保留与计算效率,适用于持续性交互场景。 Abstract: Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression framework that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to retain essential conversational information while controlling context growth. The approach is evaluated on LOCOMO, LOCCO, and LongBench benchmarks to assess answer quality, retrieval accuracy, coherence preservation, and efficiency. Experimental results demonstrate that the proposed method achieves consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared with existing memory and compression-based approaches. These findings indicate that adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions

[77] Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

Sunil Tiwari,Payal Fofadiya

Main category: cs.CV

TL;DR: 本文提出了一种多层记忆框架,通过分解对话历史为工作记忆、情景记忆和语义记忆三层,并结合自适应检索门控与保留正则化,有效缓解长周期对话中的语义漂移与记忆不稳定问题,在多个基准上显著提升了成功率、F1分数及长期记忆保留率,同时降低了错误记忆率和上下文使用量。

Details Motivation: 长周期对话系统面临语义漂移和跨会话记忆保持不稳定的问题。 Method: 提出多层记忆框架,将对话历史分解为工作记忆、情景记忆和语义记忆三层,并引入适应性检索门控与保留正则化机制,以控制语义漂移、限制上下文增长并保证计算效率。 Result: 在LOCOMO、LOCCO和LoCoMo数据集上取得46.85的成功率、0.618的整体F1(其中多跳F1为0.594)、56.90%的六周期记忆保留率,错误记忆率降至5.1%,上下文使用量为58.40%。 Conclusion: 该框架显著增强了长时记忆保持能力与推理稳定性,尤其适用于上下文资源受限的实际场景。 Abstract: Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.

[78] LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

Tianyu Huang,Zhenyang Ren,Zhenchen Wan,Jiyang Zheng,Wenjie Wang,Runnan Chen,Mingming Gong,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出LightHarmony3D框架,通过生成式模块单次前向预测360°HDR环境光图,实现3D高斯泼溅(3DGS)场景中光照一致的网格物体插入,并构建首个专用评测基准。

Details Motivation: 在3DGS重建场景中插入外部网格物体时,难以实现物理一致的光照与阴影,因需精确估计场景照明并保证多视角一致性。 Method: 提出LightHarmony3D框架,核心是生成式模块,单次前向预测插入位置的全向HDR环境光图;利用生成先验替代迭代优化,支持物理合理的着色与阴影,并保持多视角一致性;同时构建首个面向3DGS网格插入的专用评测基准。 Result: 在多个真实世界3DGS重建数据集上实验表明,该方法在真实感和多视角一致性方面达到SOTA水平。 Conclusion: LightHarmony3D有效解决了3DGS场景中光照一致的网格插入难题,兼顾效率、真实感与多视角一致性,推动AR/VR、虚拟布景等应用发展。 Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity reconstruction of scene geometry and appearance. Building on this capability, inserting external mesh objects into reconstructed 3DGS scenes enables interactive editing and content augmentation for immersive applications such as AR/VR, virtual staging, and digital content creation. However, achieving physically consistent lighting and shadows for mesh insertion remains challenging, as it requires accurate scene illumination estimation and multi-view consistent rendering. To address this challenge, we present LightHarmony3D, a novel framework for illumination-consistent mesh insertion in 3DGS scenes. Central to our approach is our proposed generative module that predicts a full 360° HDR environment map at the insertion location via a single forward pass. By leveraging generative priors instead of iterative optimization, our method efficiently captures dominant scene illumination and enables physically grounded shading and shadows for inserted meshes while maintaining multi-view coherence. Furthermore, we introduce the first dedicated benchmark for mesh insertion in 3DGS, providing a standardized evaluation framework for assessing lighting consistency and photorealism. Extensive experiments across multiple real-world reconstruction datasets demonstrate that LightHarmony3D achieves state-of-the-art realism and multi-view consistency.

[79] CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection

Zikai Liao,Zhaozheng Yin

Main category: cs.CV

TL;DR: 本文提出了一种伪装感知抗干扰网络(CCDNet),用于解决红外目标检测中目标对比度低、易与复杂背景混淆及易受干扰物影响导致误检的问题。通过设计加权多分支感知器(WMP)主干网络、聚合-细化融合颈部(ARFN)和对比辅助干扰物判别器(CaDD),显著提升了检测精度并降低了误报率。

Details Motivation: 红外目标检测面临目标对比度低、易与复杂背景融合以及相似干扰物导致高误报率等挑战。 Method: 提出CCDNet,包括:1)基于加权多分支感知器(WMP)的主干网络,聚合自调节多级特征;2)聚合-细化融合颈部(ARFN),双向重建目标与背景关系;3)对比辅助干扰物判别器(CaDD),局部与全局自适应计算目标与背景相似性以区分干扰物。 Result: 在多个红外图像数据集上的实验表明,CCDNet性能优于当前最先进方法。 Conclusion: CCDNet通过联合建模目标-背景关系与干扰物判别机制,有效提升了红外目标检测的准确性与鲁棒性。 Abstract: Infrared target detection (IRSTD) tasks have critical applications in areas like wilderness rescue and maritime search. However, detecting infrared targets is challenging due to their low contrast and tendency to blend into complex backgrounds, effectively camouflaging themselves. Additionally, other objects with similar features (distractors) can cause false alarms, further degrading detection performance. To address these issues, we propose a novel \textbf{C}amouflage-aware \textbf{C}ounter-\textbf{D}istraction \textbf{Net}work (CCDNet) in this paper. We design a backbone with Weighted Multi-branch Perceptrons (WMPs), which aggregates self-conditioned multi-level features to accurately represent the target and background. Based on these rich features, we then propose a novel Aggregation-and-Refinement Fusion Neck (ARFN) to refine structures/semantics from shallow/deep features maps, and bidirectionally reconstruct the relations between the targets and the backgrounds, highlighting the targets while suppressing the complex backgrounds to improve detection accuracy. Furthermore, we present a new Contrastive-aided Distractor Discriminator (CaDD), enforcing adaptive similarity computation both locally and globally between the real targets and the backgrounds to more precisely discriminate distractors, so as to reduce the false alarm rate. Extensive experiments on infrared image datasets confirm that CCDNet outperforms other state-of-the-art methods.

[80] M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

U. V. B. L. Udugama,George Vosselman,Francesco Nex

Main category: cs.CV

TL;DR: 本文提出M2H-MX模型,一种轻量、实时的单目多任务感知模型,通过多尺度特征、门控全局上下文与受控跨任务交互,联合提升深度与语义预测精度,并无缝接入单目SLAM系统,在密集预测指标和实际建图性能上均显著优于基线。

Details Motivation: 单目相机虽成本低、易部署,但实现实时、可靠的单目空间理解仍具挑战;现有多任务密集预测进展尚未有效转化为稳定、低延迟的单目建图系统。 Method: 提出M2H-MX模型:在轻量解码器中保留多尺度特征,引入寄存器门控的全局上下文机制和受控的跨任务交互,使深度与语义预测相互增强;设计紧凑的感知-建图接口,直接对接未修改的单目SLAM流程。 Result: 在NYUDv2上,M2H-MX-L语义mIoU提升6.6%,深度RMSE降低9.4%;在ScanNet上部署后,平均轨迹误差降低60.7%,生成更干净的度量-语义地图。 Conclusion: 现代多任务密集预测模型可在严格实时约束下可靠部署于机器人单目空间感知系统,显著提升端到端建图性能。 Abstract: Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

[81] Diffusion Mental Averages

Phonphrm Thawatdamrongkit,Sukit Seripanitkarn,Supasorn Suwajanakorn

Main category: cs.CV

TL;DR: 本文提出Diffusion Mental Averages(DMA),在扩散模型内部语义空间中对同一提示生成的样本进行轨迹对齐式平均,从而生成清晰、真实的概念‘心理平均’图像,首次实现对抽象或多模态概念的一致、高质量平均。

Details Motivation: 现有基于数据集的图像平均方法在对同一提示生成的扩散样本进行平均时结果模糊,且忽略扩散模型自身的生成过程;作者希望在模型内部语义空间中实现更本质、更清晰的概念平均。 Method: 提出DMA方法:将概念平均建模为多条去噪轨迹的对齐优化问题,使不同噪声潜变量在逐步去噪过程中收敛到共享的由粗到细的语义路径;针对多模态概念(如多种犬类),在CLIP等语义空间聚类,并用Textual Inversion或LoRA将聚类结果映射回扩散空间。 Result: DMA能生成清晰、逼真、一致的概念平均图像,适用于抽象及多模态概念;可作为可视化摘要,并揭示模型对概念的表征偏差。 Conclusion: DMA是首个在扩散模型内部实现高质量概念平均的方法,突破了传统数据驱动平均的模糊性限制,为理解扩散模型内部概念表征提供了新工具。 Abstract: Can a diffusion model produce its own "mental average" of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

[82] Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method

Yanjiao Song,Bowen Cai,Timo Balz,Zhenfeng Shao,Neema Simon Sumari,James Magidi,Walter Musakwa

Main category: cs.CV

TL;DR: 本文构建了PhiSat-2-Height数据集(PHDataset),并提出了Two-Stream Ordinal Network(TSONet)用于单目光学影像建筑高度估计,显著提升了精度,并验证了PhiSat-2卫星数据在该任务中的潜力。

Details Motivation: 单目光学影像建筑高度估计面临高度线索模糊、城市间建筑形态差异大及高度分布长尾等问题;PhiSat-2虽具全球覆盖、多光谱和较高空间分辨率等优势,但其潜力尚未被系统评估。 Method: 构建包含26个城市共9475对图像-标签块的PHDataset;提出TSONet,联合建模建筑轮廓分割与高度估计,并引入Cross-Stream Exchange Module(CSEM)和Feature-Enhanced Bin Refinement(FEBR)模块以实现足迹感知特征交互与序数高度精细化。 Result: TSONet在PHDataset上MAE和RMSE分别降低13.2%和9.7%,IoU和F1-score分别提升14.0%和10.1%;消融实验验证了CSEM、FEBR及序数回归与轮廓辅助联合使用的有效性;分析表明PhiSat-2的空间细节与多光谱信息协同有利于高度估计。 Conclusion: 本研究证实PhiSat-2在单目建筑高度估计中的应用潜力,提供了首个专用数据集PHDataset和高效方法TSONet,为后续研究奠定基础。 Abstract: Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.

[83] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen,Kun Zhang,Qiong Wu,Xiao Chen,Chao Chang,Xiaoshuai Sun,Yiyi Zhou,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的灵活视觉记忆机制FlexMem,用于解决多模态大语言模型(MLLMs)在长视频理解中的输入长度限制问题,通过动态压缩与读取视觉KV缓存,支持无限长度视频理解,并在单卡3090上实现超1000帧处理及SOTA性能。

Details Motivation: 长视频理解受限于MLLMs的输入长度上限,现有方法无法有效处理无限长度视频,亟需类人式的动态记忆机制。 Method: 提出FlexMem:将视觉KV缓存作为记忆源,采用双路径压缩实现记忆写入;设计多种记忆读取策略适配不同视频理解任务(含流式视频)。 Result: 在单张3090 GPU上支持>1000帧处理,显著优于现有高效视频理解方法,并使基础MLLM在部分基准上达到甚至超越GPT-4o、Gemini-1.5 Pro等SOTA模型性能。 Conclusion: FlexMem是一种训练免费、可扩展性强的视觉记忆机制,为MLLMs实现真正长视频乃至流式视频理解提供了新范式。 Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

[84] Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Jingqi Xu

Main category: cs.CV

TL;DR: 本文提出Omni-NegCLIP,通过改进CLIP的对比损失函数并针对性微调文本编码器前层,显著提升其对存在型与缺失型否定表达的理解能力,同时不损害甚至增强通用图文检索性能。

Details Motivation: 现有视觉语言模型(如CLIP)在理解自然语言中常见的否定表达(尤其是存在型和缺失型否定)方面表现较差。 Method: 设计两种新型对比学习目标:存在型对比目标(拉近图像与原描述、推远与存在型否定描述)和缺失型对比目标(使图像同时靠近原描述与缺失型否定描述,但保持两类文本嵌入语义区分);并仅微调CLIP文本编码器前几层Transformer,因其对否定文本建模能力更强。 Result: Omni-NegCLIP在存在型和缺失型否定任务上分别最高提升52.65%和12.50%,图文检索性能不降反升,最高提升19.62%;相比先前工作,具备更全面的否定理解能力。 Conclusion: 通过细粒度建模不同否定类型并高效微调文本编码器关键层,可显著增强VLM对复杂语言逻辑(如否定)的理解,且兼顾通用多模态能力。 Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

[85] Unbiased Model Prediction Without Using Protected Attribute Information

Puspita Majumdar,Surbhi Mittal,Mayank Vatsa,Richa Singh

Main category: cs.CV

TL;DR: 本文提出了一种无需受保护属性信息的去偏算法NPAD,利用非受保护属性的辅助信息,并结合两种新损失函数DACL和FRL,在LFWA和CelebA数据集上显著降低了性别和年龄子群体间的偏差。

Details Motivation: 现有公平性算法大多依赖受保护属性(如性别、种族)进行偏差缓解,限制了其在真实场景中的应用。 Method: 提出非受保护属性驱动的去偏算法(NPAD),并设计两种损失函数:基于属性聚类的去偏损失(DACL)和滤除冗余损失(FRL),利用非受保护属性优化模型公平性。 Result: 在LFWA和CelebA数据集上的面部属性预测任务中,NPAD显著降低了不同性别和年龄子群体间的性能差异。 Conclusion: NPAD是一种不依赖受保护属性即可有效缓解模型偏差的新方法,提升了公平性算法在现实部署中的可行性。 Abstract: The problem of bias persists in the deep learning community as models continue to provide disparate performance across different demographic subgroups. Therefore, several algorithms have been proposed to improve the fairness of deep models. However, a majority of these algorithms utilize the protected attribute information for bias mitigation, which severely limits their application in real-world scenarios. To address this concern, we have proposed a novel algorithm, termed as \textbf{Non-Protected Attribute-based Debiasing (NPAD)} algorithm for bias mitigation, that does not require the protected attribute information. The proposed NPAD algorithm utilizes the auxiliary information provided by the non-protected attributes to optimize the model for bias mitigation. Further, two different loss functions, \textbf{Debiasing via Attribute Cluster Loss (DACL)} and \textbf{Filter Redundancy Loss (FRL)} have been proposed to optimize the model for fairness goals. Multiple experiments are performed on the LFWA and CelebA datasets for facial attribute prediction, and a significant reduction in bias across different gender and age subgroups is observed.

[86] ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Wenyang Chen,Zhanxuan Hu,Yaping Zhang,Hailong Ning,Yonghang Tai

Main category: cs.CV

TL;DR: 本文提出ConInfer框架,通过建模空间单元间的语义依赖关系,实现上下文感知的开放词汇遥感图像分割,显著提升分割一致性、鲁棒性与泛化能力。

Details Motivation: 现有开放词汇遥感分割方法采用独立的patch级预测,忽视遥感图像固有的大尺度、强空间与语义相关性,导致分割不准确。 Method: 提出ConInfer——一种上下文感知推理框架,对多个空间单元进行联合预测,并显式建模其语义依赖关系,引入全局上下文线索。 Result: 在多个基准数据集上显著优于SegEarth-OV等SOTA方法,在开放词汇语义分割和目标提取任务中平均提升2.80%和6.13%。 Conclusion: ConInfer通过引入上下文建模有效克服了传统独立预测范式的局限性,为开放词汇遥感分割提供了更鲁棒、一致且泛化能力强的新思路。 Abstract: Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer

[87] MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

Soomin Park,Eunseong Lee,Kwang Bin Lee,Sung-Hee Lee

Main category: cs.CV

TL;DR: MaskAdapt 是一种用于物理仿真人形机器人运动自适应的两阶段残差学习框架,通过掩码不变基策略与可定制残差策略实现灵活、鲁棒的部分身体运动调整。

Details Motivation: 现有方法难以在缺失观测或需局部修改时保持整体运动稳定性与可控性;亟需一种既能鲁棒应对遮挡/缺失输入,又能精准调控指定身体部位的运动适应框架。 Method: 采用两阶段残差学习:第一阶段训练掩码不变的基策略(引入随机肢体掩码与一致性正则项);第二阶段在冻结基策略上训练仅作用于目标肢体的残差策略。应用包括运动组合与文本驱动的部分目标跟踪。 Result: 实验表明 MaskAdapt 在掩码观测下行为多样且鲁棒,在定向运动适应任务中显著优于先前方法。 Conclusion: MaskAdapt 提供了一种通用、模块化且高效的部分运动自适应机制,为复杂、条件化的人形控制开辟了新路径。 Abstract: We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

[88] PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Amirreza Rouhi,Parikshit Sakurikar,Satya Sai Reddy,Narsimha Menga,Anirudh Govil,Sri Harsha Chittajallu,Rajat Aggarwal,Anoop Namboodiri,Sashi Reddi

Main category: cs.CV

TL;DR: 本文提出了PRISM数据集,一个专为零售环境设计的27万样本多视角视频监督微调语料库,旨在提升具身视觉语言模型(VLMs)在空间理解、物理动态和具身动作方面的能力;该数据集基于三维知识本体构建,覆盖四大评估维度共20+能力探针,在微调后显著降低错误率(整体下降66.6%,具身动作理解准确率提升36.4%)。

Details Motivation: 物理AI系统在真实世界部署中失败的主要原因不是视觉识别能力差,而是对空间、物理动态和具身动作的理解不足。 Method: 构建了PRISM数据集——一个270K样本、多视角(第一人称、第三人称、360°)、五超市采集、4fps、含开放问答/思维链/多选标注的大规模视频监督微调语料库,并基于空间、时序与物理、具身动作三类知识构建三维本体结构。 Result: 在20+能力探针上整体错误率下降66.6%,具身动作理解准确率提升36.4%;PRISM是首个在单一真实部署领域(零售)中同时实例化这三类知识维度的数据集。 Conclusion: 基于本体结构的领域专用监督微调能有效增强具身VLMs在真实场景中的鲁棒性与实用性。 Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

[89] MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network

Guozhi Qiu,Zhiwei Chen,Zixu Li,Qinlei Huang,Zhiheng Fu,Xuemeng Song,Yupeng Hu

Main category: cs.CV

TL;DR: 本文提出MELT网络,通过平衡罕见语义与常见语义的注意力,并结合扩散去噪机制提升对难负样本的鲁棒性,从而改进组合图像检索(CIR)性能。

Details Motivation: 现有CIR方法存在频率偏差导致的‘罕见样本忽视’问题,以及相似度分数易受难负样本和噪声干扰的问题。 Method: 提出MELT网络:在多模态上下文中增强对罕见修改语义的注意力,并采用基于扩散的去噪策略处理高相似度难负样本,以提升多模态融合与匹配能力。 Result: 在两个CIR基准上进行了大量实验,验证了MELT的优越性能。 Conclusion: MELT有效缓解了罕见语义定位不对称和相似度估计受干扰的问题,显著提升了CIR任务的准确性和鲁棒性。 Abstract: Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of ``modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to ``Rare Sample Neglect'', and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at https://github.com/luckylittlezhi/MELT.

[90] GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Yaning Zhang,Linlin Shen,Zitong Yu,Chunjie Ma,Zan Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于注视引导和自适应增强细粒度语言提示的CLIP模型(DFAD),用于提升深度伪造检测与归因任务在未见生成方法(如扩散模型)上的泛化能力;通过构建新基准、引入注视感知编码器(GIE)和语言精炼编码器(LRE),在检测与归因联合任务上显著优于现有方法。

Details Motivation: 现有深度伪造归因与检测方法仅依赖视觉模态,泛化性差,尤其对新型生成器(如扩散模型)效果不佳,且忽视检测与归因任务间的协同关系。 Method: 提出 gaze-guided CLIP 框架:1)构建面向新型生成器的细粒度DFAD基准;2)设计注视感知图像编码器(GIE),融合注视特征与伪造图像嵌入;3)构建语言精炼编码器(LRE),通过自适应词选择增强语言提示以实现精准跨模态匹配。 Result: 在自建基准上,平均检测准确率(ACC)提升6.56%,AUC提升5.32%,显著优于当前最优方法。 Conclusion: 利用注视线索与多模态协同建模可有效提升深度伪造检测与归因在未知生成器上的泛化能力,验证了细粒度视觉-语言对齐在DFAD任务中的关键作用。 Abstract: Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

[91] MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

Haoran Zhou,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出MotionScale,一种用于单目视频动态4D场景重建的高斯泼溅框架,通过可扩展运动场和渐进式优化策略,显著提升几何精度与运动一致性。

Details Motivation: 现有神经渲染方法在复杂环境中难以准确恢复3D几何结构和时间一致的运动。 Method: 提出MotionScale框架,核心是基于聚类中心基变换的可扩展运动场,并引入两阶段解耦传播优化策略:背景扩展阶段(处理新可见区域、相机位姿优化、瞬态阴影建模)和前景传播阶段(三阶段运动一致性细化)。 Result: 在真实世界基准测试中,MotionScale在重建质量和时间稳定性上显著优于现有最先进方法。 Conclusion: MotionScale实现了大规模、长时间序列下高保真结构与运动一致性的4D场景重建,推动了单目动态场景理解的发展。 Abstract: Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.

[92] Self-Consistency for LLM-Based Motion Trajectory Generation and Verification

Jiaju Ma,R. Kenny Jones,Jiajun Wu,Maneesh Agrawala

Main category: cs.CV

TL;DR: 本文提出了一种将自一致性(self-consistency)技术适配到视觉领域的新方法,用于改进大语言模型(LLM)生成和验证运动图形轨迹的准确性。通过建模提示词对应的轨迹形状族为原型轨迹加几何变换群,并利用层次化变换群关系自动恢复形状族,实现了生成准确率提升4-6%,验证精度提升11%。

Details Motivation: 将已在自然语言推理中证明有效的自一致性技术拓展至视觉领域,特别是针对LLM生成运动轨迹缺乏一致性和可验证性的问题。 Method: 对同一提示采样多样轨迹,基于几何变换群(刚性、相似、仿射等)定义轨迹间的一致性;通过层次化候选变换群关系自动恢复轨迹对应的形状族;进而实现轨迹生成与验证的自一致性优化。 Result: 在轨迹生成任务上准确率提升4–6%;在轨迹验证任务上相较视觉语言模型(VLM)基线提升11%精度。 Conclusion: 自一致性可通过建模视觉概念(如运动轨迹)的几何不变性进行有效迁移;所提形状族建模与层次化变换恢复方法为视觉推理中的轻量级无监督提升提供了新范式。 Abstract: Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., "Move the circle in a spiral path"), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at https://majiaju.io/trajectory-self-consistency .

[93] HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations

Aryan Yazdan Parast,Khawar Islam,Soyoun Won,Basim Azam,Naveed Akhtar

Main category: cs.CV

TL;DR: 本文提出一种双层元学习方法,在特征空间中直接进行数据增强,以改进分类器头对虚假相关性的处理能力,从而提升模型在分布偏移和少数群体样本上的鲁棒性与泛化性能。

Details Motivation: 深度神经网络常依赖虚假特征进行预测,导致在分布偏移或少数群体样本上表现脆弱;已有研究表明问题多出在分类器头而非特征提取器,因此需针对性优化分类器头。 Method: 提出一种双层元学习框架,在特征空间(backbone输出)中对支持集特征进行可学习的编辑,使分类器经少量内循环更新后在难例和最差组上损失更低;不涉及像素空间操作或端到端训练,高效稳定。 Result: 该方法显著提升模型在分布偏移和少数群体样本上的性能,仅需单卡几分钟训练;CLIP可视化证实特征编辑具有语义意义,且与虚假属性对齐。 Conclusion: 在特征空间进行元学习式编辑是一种高效、稳定且语义可解释的策略,能有效解耦虚假相关性,提升分类器头的鲁棒性,为鲁棒机器学习提供了新思路。 Abstract: Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.

[94] FOSCU: Feasibility of Synthetic MRI Generation via Duo-Diffusion Models for Enhancement of 3D U-Nets in Hepatic Segmentation

Youngung Han,Kyeonghun Kim,Seoyoung Ju,Yeonju Jean,Minkyung Cha,Seohyoung Park,Hyeonseok Jung,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Hyuk-Jae Lee

Main category: cs.CV

TL;DR: 本文提出FOSCU框架,结合Duo-Diffusion(一种带ControlNet的3D潜在扩散模型)与增强型3D U-Net训练流程,生成高分辨率、解剖结构真实的合成MRI图像及对应分割标签,缓解医学图像数据稀缺、标注昂贵等问题;实验表明其在720例腹部MRI上提升Dice分数0.67%,降低FID达36.4%。

Details Motivation: 医学图像分割面临临床数据获取受限、标注成本高和数据量不足等系统性障碍,阻碍鲁棒分割算法的发展。 Method: 提出FOSCU框架,包含Duo-Diffusion(3D潜在扩散模型+ControlNet)用于同步生成高质量合成MRI体积图像及对应分割标签,并采用分割条件扩散保证空间一致性和解剖细节;配合增强型3D U-Net训练流程。 Result: 在720例腹部MRI数据上验证,使用真实+合成数据训练的模型相比仅用真实数据,平均Dice分数提高0.67%,Fréchet Inception Distance(FID)降低36.4%,表明图像保真度显著提升。 Conclusion: FOSCU有效缓解医学图像数据瓶颈,通过高质量合成数据增强模型性能,为低资源医学图像分割任务提供了可行解决方案。 Abstract: Medical image segmentation faces fundamental challenges including restricted access, costly annotation, and data shortage to clinical datasets through Picture Archiving and Communication Systems (PACS). These systemic barriers significantly impede the development of robust segmentation algorithms. To address these challenges, we propose FOSCU, which integrates Duo-Diffusion, a 3D latent diffusion model with ControlNet that simultaneously generates high-resolution, anatomically realistic synthetic MRI volumes and corresponding segmentation labels, and an enhanced 3D U-Net training pipeline. Duo-Diffusion employs segmentation-conditioned diffusion to ensure spatial consistency and precise anatomical detail in the generated data. Experimental evaluation on 720 abdominal MRI scans shows that models trained with combined real and synthetic data yield a mean Dice score gain of 0.67% over those using only real data, and achieve a 36.4% reduction in Fréchet Inception Distance (FID), reflecting enhanced image fidelity.

[95] CIPHER: Counterfeit Image Pattern High-level Examination via Representation

Kyeonghun Kim,Youngung Han,Seoyoung Ju,Yeonju Jean,YooHyun Kim,Minseo Choi,SuYeon Lim,Kyungtae Park,Seungwoo Baek,Sieun Hyeon,Nam-Joon Kim,Hyuk-Jae Lee

Main category: cs.CV

TL;DR: 本文提出CIPHER框架,通过重用和微调生成模型(如ProGAN、扩散模型)的判别器来提取生成无关的伪造图像特征,显著提升了跨模型深伪检测性能。

Details Motivation: 随着GAN和扩散模型的发展,合成人脸越来越逼真,导致虚假信息、欺诈和身份滥用风险加剧,亟需对多种生成模型都鲁棒的检测器。 Method: CIPHER框架系统性地重用并微调生成模型(如ProGAN判别器、扩散模型)的判别器,提取尺度自适应特征和时序一致性特征,以捕获传统检测器易忽略的生成无关伪影。 Result: 在9种最先进生成模型上实验表明,CIPHER跨模型检测F1-score最高达74.33%,平均比现有ViT检测器高30%以上;在CIFAKE数据集上F1达88%,而基线方法接近零。 Conclusion: 判别器重用与跨模型微调策略有效,CIPHER为应对快速演进的生成技术提供了更通用、鲁棒的深伪检测新路径。 Abstract: The rapid progress of generative adversarial networks (GANs) and diffusion models has enabled the creation of synthetic faces that are increasingly difficult to distinguish from real images. This progress, however, has also amplified the risks of misinformation, fraud, and identity abuse, underscoring the urgent need for detectors that remain robust across diverse generative models. In this work, we introduce Counterfeit Image Pattern High-level Examination via Representation(CIPHER), a deepfake detection framework that systematically reuses and fine-tunes discriminators originally trained for image generation. By extracting scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models, CIPHER captures generation-agnostic artifacts that conventional detectors often overlook. Through extensive experiments across nine state-of-the-art generative models, CIPHER demonstrates superior cross-model detection performance, achieving up to 74.33% F1-score and outperforming existing ViT-based detectors by over 30% in F1-score on average. Notably, our approach maintains robust performance on challenging datasets where baseline methods fail, with up to 88% F1-score on CIFAKE compared to near-zero performance from conventional detectors. These results validate the effectiveness of discriminator reuse and cross-model fine-tuning, establishing CIPHER as a promising approach toward building more generalizable and robust deepfake detection systems in an era of rapidly evolving generative technologies.

[96] Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin,Adnan Ibney Faruq,Mushfiqur Rahman,Niloy Kumar Mondal,Md. Mehedi Hasan Shawon,Md Rakibul Hasan

Main category: cs.CV

TL;DR: 本文提出ViTAS模型,通过选择性聚焦病理相关视觉块而非整张图像,显著提升了放射学报告摘要生成性能,在MIMIC-CXR上达到SOTA结果。

Details Motivation: 现有多模态模型在放射学报告摘要任务中常受视觉噪声干扰,且难以超越强文本基线;作者质疑‘更多视觉输入总是更好’和‘多模态价值有限’两个假设。 Method: 提出ViTAS(Visual-Text Attention Summarizer):结合MedSAM2肺部分割、双向交叉注意力多视图融合、Shapley引导的自适应图像块聚类、分层视觉标记化与ViT编码。 Result: 在MIMIC-CXR上取得SOTA:BLEU-4达29.25%,ROUGE-L达69.83%;定性分析显示事实一致性提升;专家人工评估得分最高。 Conclusion: 减少但更相关的视觉输入不仅足够,而且优于全图输入,为多模态放射学摘要提供了新范式。 Abstract: Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

[97] Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties

Jintao Sun,Hu Zhang,Gangyi Ding,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了一种联合建模位置与语义不确定性的新框架,用于提升轨迹预测鲁棒性,在nuScenes数据集上验证了其有效性。

Details Motivation: 现实地图存在位置不准(传感器限制/遮挡)和语义错误(场景误理解)两类不确定性,影响轨迹预测性能。 Method: 采用双头架构双通路独立估计语义与位置预测,并端到端推导其方差作为不确定性指标;再将不确定性与对应预测融合以增强轨迹预测鲁棒性。 Result: 在nuScenes数据集上,该方法能有效量化位置与语义两方面地图不确定性,并在minADE、minFDE和MR等指标上持续提升多种基线模型性能。 Conclusion: 联合建模并显式融入位置与语义不确定性可显著提升轨迹预测模型的鲁棒性与准确性。 Abstract: Trajectory prediction seeks to forecast the future motion of dynamic entities, such as vehicles and pedestrians, given a temporal horizon of historical movement data and environmental context. A central challenge in this domain is the inherent uncertainty in real-time maps, arising from two primary sources: (1) positional inaccuracies due to sensor limitations or environmental occlusions, and (2) semantic errors stemming from misinterpretations of scene context. To address these challenges, we propose a novel unified framework that jointly models positional and semantic uncertainties and explicitly integrates them into the trajectory prediction pipeline. Our approach employs a dual-head architecture to independently estimate semantic and positional predictions in a dual-pass manner, deriving prediction variances as uncertainty indicators in an end-to-end fashion. These uncertainties are subsequently fused with the semantic and positional predictions to enhance the robustness of trajectory forecasts. We evaluate our uncertainty-aware framework on the nuScenes real-world driving dataset, conducting extensive experiments across four map estimation methods and two trajectory prediction baselines. Results verify that our method (1) effectively quantifies map uncertainties through both positional and semantic dimensions, and (2) consistently improves the performance of existing trajectory prediction models across multiple metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). Code will available at https://github.com/JT-Sun/UATP.

[98] StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

Ziyang Chen,Yansong Qu,You Shen,Xuan Cheng,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出StereoVGGT,一种专为立体视觉任务设计的特征骨干网络,通过冻结预训练的VGGT模型并引入无需训练的特征调整流程,缓解其在几何细节提取上的退化问题,从而有效利用其内嵌的相机标定知识,在KITTI基准上达到SOTA性能。

Details Motivation: 现有立体视觉骨干网络(如基于单目深度估计或视觉基础模型的方法)大多未在预训练中显式引入相机位姿等几何先验,导致缺乏必要的空间约束,限制了性能;而虽有预训练于大量3D先验(含相机姿态)的VGGT模型,但其直接用于立体视觉时因几何细节退化而表现不佳。 Method: 提出StereoVGGT:冻结VGGT主干,并设计一个无需训练的特征调整流程,以校正其在特征提取过程中造成的几何细节退化,从而适配双目立体视觉对精确几何关系的需求。 Result: 基于StereoVGGT的立体匹配网络在KITTI基准上取得所有已发表方法中的第1名。 Conclusion: StereoVGGT成功桥接了通用3D基础模型与专用立体视觉任务之间的鸿沟,证明了显式利用相机几何先验可显著提升立体视觉性能,是一种高效、即插即用的立体视觉骨干网络。 Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

[99] Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

Fabian Kabus,Julia Hindel,Jelena Bratulić,Meropi Karakioulaki,Ayush Gupta,Cristina Has,Thomas Brox,Abhinav Valada,Harald Binder

Main category: cs.CV

TL;DR: 本文提出TriDerm多模态框架,利用专家三元组判断评估嵌入空间,结合伤口图像、边界掩码和专家报告学习可解释表征,在RDEB病例相似性检索中达到73.5%的专家一致率,优于单模态基线模型。

Details Motivation: 现有通用基础模型难以可靠捕捉罕见异质性皮肤病RDEB的临床相关特征,且缺乏对专家判断一致性的结构化评估方法。 Method: 提出基于专家三元组判断(ordinal comparisons)的嵌入空间评估方法;构建TriDerm框架:视觉端采用病灶级注意力池化与非对比表征学习适配视觉基础模型,文本端通过大语言模型提示与软序数嵌入(SOE)提取医学意义表征;最后融合双模态表征。 Result: 双模态融合在专家一致性上达73.5%,较最优单模态基线提升5.6个百分点;验证了视觉与文本模态表征互补。 Conclusion: 专家驱动的三元组评估与多模态融合能有效提升罕见皮肤病表征的临床相关性与可解释性,TriDerm为小样本罕见病AI建模提供了可行范式。 Abstract: Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.

[100] PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

Jianpeng Wang,Haoyu Wang,Baoying Chen,Jishen Zeng,Yiming Qin,Yiqi Yang,Zhongjie Ba

Main category: cs.CV

TL;DR: 本文提出了一种针对提示驱动AI图像编辑伪造的定位方法,包括自动掩码标注框架、大规模数据集PromptForge-350k和新型网络ICL-Net,在定位精度、鲁棒性和泛化性上均取得SOTA性能。

Details Motivation: 提示驱动的AI图像编辑技术普及加剧了恶意内容伪造与虚假信息风险,但针对此类新兴编辑方式的伪造定位方法研究严重不足。 Method: 构建基于关键点对齐与语义空间相似性的全自动掩码标注框架;发布覆盖4种前沿提示编辑模型的大规模伪造定位数据集PromptForge-350k;提出具有三流主干与图像内对比学习机制的ICL-Net网络。 Result: 在PromptForge-350k上IoU达62.5%,超越SOTA 5.1%;抗常见退化能力强(IoU下降<1%);在未见编辑模型上平均IoU达41.5%。 Conclusion: 所提框架、数据集与ICL-Net显著推动了提示驱动图像伪造定位研究,提升了检测精度、鲁棒性与跨模型泛化能力。 Abstract: The rapid democratization of prompt-based AI image editing has recently exacerbated the risks associated with malicious content fabrication and misinformation. However, forgery localization methods targeting these emerging editing techniques remain significantly under-explored. To bridge this gap, we first introduce a fully automated mask annotating framework that leverages keypoint alignment and semantic space similarity to generate precise ground-truth masks for edited regions. Based on this framework, we construct PromptForge-350k, a large-scale forgery localization dataset covering four state-of-the-art prompt-based AI image editing models, thereby mitigating the data scarcity in this domain. Furthermore, we propose ICL-Net, an effective forgery localization network featuring a triple-stream backbone and intra-image contrastive learning. This design enables the model to capture highly robust and generalizable forensic features. Extensive experiments demonstrate that our method achieves an IoU of 62.5% on PromptForge-350k, outperforming SOTA methods by 5.1%. Additionally, it exhibits strong robustness against common degradations with an IoU drop of less than 1%, and shows promising generalization capabilities on unseen editing models, achieving an average IoU of 41.5%.

[101] Extend3D: Town-Scale 3D Generation

Seungwoo Yoon,Jinmo Kim,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出Extend3D,一种无需训练的单图生成3D场景方法,通过扩展对象中心生成模型的潜在空间、分块生成与耦合、点云先验初始化、SDEdit迭代修复及'欠去噪'策略实现高质量3D重建。

Details Motivation: 解决对象中心3D生成模型在表示大范围场景时受限于固定大小潜在空间的问题,以及子场景生成中几何结构和纹理保真度不足的问题。 Method: 扩展潜在空间(x/y方向),划分为重叠块并逐块生成与时间步耦合;利用单目深度估计获得点云先验初始化场景;通过SDEdit迭代修复遮挡区域,提出'欠去噪'概念完成3D结构补全;在去噪过程中联合优化扩展潜在表示,并引入3D感知优化目标提升几何与纹理质量。 Result: 在人类偏好评估与定量实验中均优于现有方法,显著提升3D场景生成质量。 Conclusion: Extend3D是一种高效、无需训练的单图像3D场景生成框架,在保持对象中心建模优势的同时,有效拓展其对宽广场景的表达能力与生成精度。 Abstract: In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

[102] AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting

Taewoo Suh,Sungpyo Kim,Jongmin Park,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出AA-Splat,一种支持任意分辨率抗锯齿渲染的前馈式3D高斯点绘(FF-3DGS)方法,通过不透明度平衡的带限设计(OBBL)解决现有方法在分布外采样率下渲染伪影严重的问题。

Details Motivation: 现有FF-3DGS方法采用错误的屏幕空间膨胀滤波器,导致在非标准采样率下出现严重渲染伪影。 Method: 提出AA-Splat模型,包含两个核心组件:1)3D带限后置滤波器,将多视角最大频率约束融入前馈重建流程,实现3D场景表示的带限化并消除退化高斯;2)不透明度平衡(OB)机制,无缝整合像素对齐的高斯基元,补偿其扩展后的重叠增加。 Result: 在4×至1/4×全分辨率范围内,相比SOTA基线DepthSplat,NVS性能平均提升5.4~7.5dB PSNR。 Conclusion: AA-Splat实现了鲁棒、任意分辨率的抗锯齿渲染,显著提升了稀疏视角3D重建与新视角合成的质量和泛化能力。 Abstract: Feed-forward 3D Gaussian Splatting (FF-3DGS) emerges as a fast and robust solution for sparse-view 3D reconstruction and novel view synthesis (NVS). However, existing FF-3DGS methods are built on incorrect screen-space dilation filters, causing severe rendering artifacts when rendering at out-of-distribution sampling rates. We firstly propose an FF-3DGS model, called AA-Splat, to enable robust anti-aliased rendering at any resolution. AA-Splat utilizes an opacity-balanced band-limiting (OBBL) design, which combines two components: a 3D band-limiting post-filter integrates multi-view maximal frequency bounds into the feed-forward reconstruction pipeline, effectively band-limiting the resulting 3D scene representations and eliminating degenerate Gaussians; an Opacity Balancing (OB) to seamlessly integrate all pixel-aligned Gaussian primitives into the rendering process, compensating for the increased overlap between expanded Gaussian primitives. AA-Splat demonstrates drastic improvements with average 5.4$\sim$7.5dB PSNR gains on NVS performance over a state-of-the-art (SOTA) baseline, DepthSplat, at all resolutions, between $4\times$ and $1/4\times$. Code will be made available.

[103] Hallucination-aware intermediate representation edit in large vision-language models

Wei Suo,Hanzu Zhang,Lijun Zhang,Ji Ma,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种动态检测并编辑幻觉表征的框架HIRE,以低成本实现视觉-语言模型中幻觉现象的有效消除,在多个基准上达到SOTA性能。

Details Motivation: 现有大视觉语言模型存在严重幻觉问题,而当前缓解方法(如重训练和对比解码)分别面临高计算成本或双重推理开销,实用性受限。 Method: 提出HIRE框架,通过动态检测幻觉表征并对其进行针对性编辑,仅引入极小额外计算开销。 Result: 在多个现有基准上达到最先进(SOTA)性能,实验证明其高效、鲁棒且具备强幻觉可控性。 Conclusion: HIRE是一种轻量、高效且实用的幻觉消除方法,显著提升了视觉-语言模型的可靠性与部署可行性。 Abstract: Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE

[104] AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui,Xianchao Guan,Zijun Xiong,Zheng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种对齐引导微调(AGFT)框架,通过软对齐分布和分布一致性校准机制,在提升零样本对抗鲁棒性的同时,保持预训练视觉-语言模型的跨模态语义结构。

Details Motivation: 现有基于分类引导的对抗微调方法会破坏预训练的跨模态对齐,削弱视觉-文本对应关系,从而损害零样本性能。 Method: AGFT利用原始模型的概率预测进行文本引导的对抗训练,通过软对齐分布使对抗视觉特征与文本嵌入对齐;并引入分布一致性校准机制,使鲁棒模型输出匹配温度缩放后的预训练模型预测。 Result: 在多个零样本基准上,AGFT显著优于现有最先进方法,同时大幅提升零样本对抗鲁棒性。 Conclusion: AGFT在不牺牲跨模态语义结构的前提下,有效提升了视觉-语言模型的零样本对抗鲁棒性,为安全可靠的多模态模型部署提供了新思路。 Abstract: Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

[105] Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种外参感知的跨注意力框架,直接在图像块和激光雷达点群的原生域中进行对齐,避免了深度图投影带来的几何失真,显著提升了大初始误差下的标定精度和鲁棒性。

Details Motivation: 现有基于学习的方法通常将激光雷达点投影到深度图中进行特征融合,这会扭曲3D几何结构,并在外参初始化远离真值时降低性能。 Method: 提出一种外参感知的跨注意力机制,直接在图像块和激光雷达点群的原生域中建模对应关系,并显式地将外参假设注入注意力过程,实现几何一致的跨模态交互。 Result: 在KITTI和nuScenes数据集上,该方法在精度和鲁棒性上均超越现有最优方法;在大外参扰动下,KITTI和nuScenes上的准确标定率分别达88%和99%。 Conclusion: 所提方法通过避免深度图投影、引入外参假设到注意力机制中,实现了更鲁棒、几何一致的相机-激光雷达外参标定。 Abstract: Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on https://github.com/gitouni/ProjFusion to benefit the community.

[106] Adversarial Prompt Injection Attack on Multimodal Large Language Models

Meiwen Ding,Song Xia,Chenqi Kong,Xudong Jiang

Main category: cs.CV

TL;DR: 本文提出了一种针对多模态大语言模型(MLLMs)的不可感知视觉提示注入攻击方法,通过在输入图像中自适应嵌入恶意文本提示并优化不可见视觉扰动,使模型误执行指令,实验表明该方法在多个闭源MLLM上效果优于现有方法。

Details Motivation: 现有提示注入攻击主要依赖可感知的文本或视觉提示,而真实场景中需要更隐蔽、不可被人类察觉的攻击方式,尤其针对日益广泛应用的闭源多模态大语言模型。 Method: 提出一种自适应视觉提示注入方法:1)在图像上叠加有界文本以提供语义引导;2)迭代优化不可见视觉扰动,使其在粗粒度和细粒度特征层面同时对齐恶意视觉目标(文本渲染图像)与文本目标;3)动态优化视觉目标以提升语义保真度与跨模型迁移性。 Result: 在两个多模态理解任务、多个闭源MLLM(如GPT-4V、Gemini等)上验证了方法有效性,攻击成功率显著高于现有文本/可见视觉注入方法,且扰动对人眼不可见。 Conclusion: 不可感知视觉提示注入是一种切实可行且高威胁的新型攻击范式,揭示了当前闭源MLLM在视觉-语言对齐机制上的深层脆弱性,为后续鲁棒性研究与防御设计提供了重要启示。 Abstract: Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

[107] Multimodal Models Meet Presentation Attack Detection on ID Documents

Marina Villanueva,Juan M. Espin,Juan E. Tapia

Main category: cs.CV

TL;DR: 本文探讨了将多模态模型(如Paligemma、Llava、Qwen)应用于身份证件呈现攻击检测(PAD)的任务,融合视觉与文本模态信息以提升检测能力,但实验表明现有模型在此任务上效果不佳。

Details Motivation: 传统仅依赖视觉特征的PAD系统难以应对复杂伪造攻击,需引入文本等多模态信息提升鲁棒性。 Method: 采用预训练多模态大模型(Paligemma、Llava、Qwen),融合图像特征与文档元数据(如证件类型、签发机构、日期)进行ID文档PAD检测。 Result: 实验结果表明,所用多模态模型在ID文档PAD任务上检测准确率不理想,未能有效提升性能。 Conclusion: 当前通用多模态大模型虽具潜力,但直接迁移至ID文档PAD任务存在适配不足问题,需针对性优化或设计专用架构。 Abstract: The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

[108] A2BFR: Attribute-Aware Blind Face Restoration

Chenxin Zhu,Yushun Fang,Lu Liu,Shibo Yin,Xiaohong Liu,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出A²BFR框架,结合扩散模型与文本引导,实现高保真且可控的盲脸修复,通过属性感知学习和语义双训练提升修复质量与提示遵循能力。

Details Motivation: 现有盲脸修复方法存在解不唯一、不可控问题;扩散模型虽提升感知质量但缺乏可控性,而文本引导编辑又难以保证修复可靠性。 Method: 提出基于Diffusion Transformer的A²BFR框架,引入统一图像-文本跨模态注意力机制,并设计属性感知学习(利用属性编码器提取嵌入监督去噪隐变量)与语义双训练(基于AttrFace-90K数据集增强属性判别与保真)。 Result: 在恢复保真度和指令遵循性上达到SOTA:LPIPS降低0.0467,属性准确率提升52.58%,支持严重退化下的细粒度、提示可控修复。 Conclusion: A²BFR成功融合高保真重建与提示可控生成,为盲脸修复提供了兼具质量与可控性的新范式。 Abstract: Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A$^2$BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A$^2$BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A$^2$BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.

[109] Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Xuesong Wang,Harry Wang

Main category: cs.CV

TL;DR: 本文提出一种无需训练的工具引导推理框架,利用通用图像处理工具和幻觉类型路由系统,提升视觉语言模型在光学错觉任务中的鲁棒性与跨结构泛化能力。

Details Motivation: 视觉语言模型在经典光学错觉上存在系统性偏差(倾向于将错觉判断为真实),且难以泛化到结构新颖的错觉变体。 Method: 设计无需训练的工具引导推理框架:使用现成VLM,接入线稿生成、区域裁剪、并排对比、通道分离等通用图像工具,并通过幻觉类型路由提示决定调用哪些工具;所有工具输出图像存入持久化注册表,供后续推理链引用与组合。 Result: 在DataCV 2026挑战赛任务I/II中显著提升性能;在结构陌生的测试错觉(如旋转后的Mach Bands)上保持稳定泛化;发现三大现象:强正向检测偏差、空间推理与逻辑推理的解离、对压缩伪影高度敏感。 Conclusion: 通用工具+路由机制可有效缓解VLM的错觉偏差,避免硬编码模块,实现跨结构泛化;所揭示的三大经验现象为后续VLM感知-认知对齐研究提供新方向。 Abstract: Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as "real" regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

[110] SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Wenli Li,Kai Zhao,Haoran Jiang,Enquan Yang,Yi Su,Dan Zeng

Main category: cs.CV

TL;DR: 本文提出SeGPruner,一种语义感知与几何引导的视觉令牌裁剪框架,用于提升多视角图像下3D问答的推理效率,在大幅减少视觉令牌(91%)和延迟(86%)的同时保持高性能。

Details Motivation: 现有VLMs在3D问答中因多视角视觉令牌拼接导致严重冗余,影响推理效率;而现有裁剪方法多面向2D或依赖间接几何线索,难以兼顾语义关键物体保留与充分空间覆盖。 Method: SeGPruner包含两个模块:基于注意力的显著性感知令牌选择器(保留语义关键令牌)和几何感知令牌多样化选择器(结合语义相关性与3D几何距离补充空间多样令牌),协同实现语义证据保留与场景全局覆盖。 Result: 在ScanQA和OpenEQA数据集上,视觉令牌预算减少91%,推理延迟降低86%,同时保持有竞争力的3D推理性能。 Conclusion: SeGPruner通过语义与几何双驱动的令牌裁剪策略,有效解决了多视角VLMs中令牌冗余问题,显著提升了3D QA的效率与实用性。 Abstract: Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

[111] EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images

Yijie Zheng,Weijie Wu,Bingyue Wu,Long Zhao,Guoqing Li,Mikolaj Czerkawski,Konstantin Klemmer

Main category: cs.CV

TL;DR: 本文介绍了EarthEmbeddingExplorer,一个将地球观测基础模型和嵌入数据集转化为交互式、可访问工具的Web应用,旨在降低学术成果向实际应用转化的门槛。

Details Motivation: 尽管地球观测领域已出现大量高影响力的基础模型和全球地球嵌入数据集,但将其转化为免费、易用的工具仍存在显著障碍。 Method: 开发了一个名为EarthEmbeddingExplorer的云原生交互式Web应用,支持自然语言、视觉与地理定位等跨模态查询,并提供从检索结果中提取科学洞见的实践指南。 Result: 实现了对预计算地球嵌入数据的民主化访问,使研究人员能便捷地将前沿模型与数据存档应用于真实场景分析。 Conclusion: 该教程通过构建开放、易用的工具平台,有效弥合了地球观测学术研究与实际应用之间的鸿沟。 Abstract: While the Earth observation community has witnessed a surge in high-impact foundation models and global Earth embedding datasets, a significant barrier remains in translating these academic assets into freely accessible tools. This tutorial introduces EarthEmbeddingExplorer, an interactive web application designed to bridge this gap, transforming static research artifacts into dynamic, practical workflows for discovery. We will provide a comprehensive hands-on guide to the system, detailing its cloud-native software architecture, demonstrating cross-modal queries (natural language, visual, and geolocation), and showcasing how to derive scientific insights from retrieval results. By democratizing access to precomputed Earth embeddings, this tutorial empowers researchers to seamlessly transition from state-of-the-art models and data archives to real-world application and analysis. The web application is available at https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer.

[112] NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

Youngung Han,Minkyung Cha,Kyeonghun Kim,Induk Um,Myeongbin Sho,Joo Young Bae,Jaewon Jung,Jung Hyeok Park,Seojun Lee,Nam-Joon Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Ken Ying-Kai Liao,Hyuk-Jae Lee

Main category: cs.CV

TL;DR: 本文提出NeoNet,一种端到端3D深度学习框架,用于胆管癌中非侵入性预测神经周围侵犯(PNI),通过NeoSeg、NeoGen(含ControlNet的3D潜在扩散模型)和NeoCls(含PattenNet)三模块协同实现,显著提升PNI识别性能(AUC达0.7903)。

Details Motivation: 非侵入性诊断神经周围侵犯(PNI)仍具挑战性,因缺乏清晰一致的影像学判别标准,需减少有创检查以降低患者风险。 Method: 提出NeoNet框架:NeoSeg用肿瘤局部ROI裁剪;NeoGen用基于解剖掩膜引导的3D潜在扩散模型(LDM)+ControlNet生成平衡数据集;NeoCls采用冻结LDM编码器与3D双注意力块(DAB)的PNI-Attention网络(PattenNet)进行预测。 Result: 在5折交叉验证中,NeoNet性能优于基线3D模型,最高AUC达0.7903。 Conclusion: NeoNet为胆管癌PNI的无创影像诊断提供了有效新方法,验证了结合生成建模与注意力机制的端到端3D深度学习框架的可行性与优势。 Abstract: Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

[113] Few-shot Writer Adaptation via Multimodal In-Context Learning

Tom Simon,Stephane Nicolas,Pierrick Tranouez,Clement Chatelain,Thierry Paquet

Main category: cs.CV

TL;DR: 本文提出了一种无需参数更新的上下文驱动手写文本识别(HTR)框架,通过少量目标书写者样本实现推理时自适应,在IAM和RIMES数据集上达到领先性能。

Details Motivation: 现有HTR模型在面对训练数据中罕见或未见的书写风格时表现不佳,而主流书写者自适应方法依赖计算开销大、调参复杂的梯度更新。 Method: 受多模态上下文学习启发,设计一种无需参数更新的上下文驱动HTR框架;引入紧凑型8M参数CNN-Transformer架构;探索上下文长度影响,并融合上下文驱动与标准OCR训练策略。 Result: 在IAM和RIMES数据集上字符错误率分别达3.92%和2.34%,优于所有不依赖推理时参数更新的书写者无关HTR模型。 Conclusion: 上下文驱动范式可有效实现高效、轻量的书写者自适应HTR,无需梯度计算与参数更新,兼具性能与实用性。 Abstract: While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

[114] FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion

Ningzhi Gao,Siquan Huang,Leyu Shi,Ying Gao

Main category: cs.CV

TL;DR: 本文提出FedDBP,一种新的联邦原型学习方法,通过客户端双分支特征投影器和服务器端个性化全局原型融合,解决现有FPL方法在特征保真度与判别力平衡及单一半全局原型限制上的问题。

Details Motivation: 现有联邦原型学习(FPL)方法难以兼顾特征的保真度与判别力,且依赖单一全局原型,无法有效应对数据与模型异质性。 Method: 客户端采用结合L2对齐与对比学习的双分支特征投影器;服务器端基于Fisher信息进行本地原型通道重要性评估,实现个性化全局原型融合。 Result: 在多个基准数据集上,FedDBP显著优于十种现有先进方法。 Conclusion: FedDBP通过提升局部特征质量与原型融合灵活性,有效增强了异构联邦学习的性能与鲁棒性。 Abstract: Federated prototype learning (FPL), as a solution to heterogeneous federated learning (HFL), effectively alleviates the challenges of data and model heterogeneity.However, existing FPL methods fail to balance the fidelity and discriminability of the feature, and are limited by a single global prototype. In this paper, we propose FedDBP, a novel FPL method to address the above issues. On the client-side, we design a Dual-Branch feature projector that employs L2 alignment and contrastive learning simultaneously, thereby ensuring both the fidelity and discriminability of local features. On the server-side, we introduce a Personalized global prototype fusion approach that leverages Fisher information to identify the important channels of local prototypes. Extensive experiments demonstrate the superiority of FedDBP over ten existing advanced methods.

[115] Square Superpixel Generation and Representation Learning via Granular Ball Computing

Shuyin Xia,Meng Yang,Dawei Dai,Fan Chen,Shilin Zhao,Junwei Han,Xinbo Gao,Guoyin Wang,Wen Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度方形块的方形超像素生成方法,利用粒球计算的思想,通过纯度评分选择高质量方形块,以解决传统不规则超像素难以与卷积等规则算子兼容的问题,提升并行性与端到端可学习性,并支持GNN和ViT等架构。

Details Motivation: 现有超像素算法生成不规则区域,难以与卷积等规则算子对齐,导致其通常仅作为离线预处理步骤,限制了并行实现和深度学习管道中的端到端优化。 Method: 受粒球计算自适应表示与覆盖特性的启发,采用多尺度方形块近似超像素;对每个方形块计算基于像素强度相似性的纯度得分,并据此筛选高质量块。 Result: 所生成的方形超像素可直接作为图神经网络(GNN)的节点或视觉Transformer(ViT)的token,支持多尺度信息聚合与结构化视觉表征;下游任务实验显示一致性能提升。 Conclusion: 该方形超像素方法克服了传统超像素在形状不规则性上的局限,提升了计算效率、并行性与模型可学习性,在多种视觉任务中展现出有效性与通用性。 Abstract: Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.

[116] VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Anmin Liu,Ruixuan Yang,Huiqiang Jiang,Bin Lin,Minmin Sun,Yong Li,Chen Zhang,Tao Xie

Main category: cs.CV

TL;DR: 本文提出VecAttention,一种基于垂直向量稀疏模式的新型稀疏注意力机制,显著提升视频模型在长上下文理解与生成任务中的效率与精度平衡。

Details Motivation: Transformer视频模型因自注意力的二次计算复杂度,在长视频处理中面临巨大计算挑战;现有粗粒度稀疏注意力方法存在冗余计算和性能次优问题。 Method: 提出向量级稀疏注意力框架VecAttention:利用视频注意力图中固有的垂直向量稀疏性,设计轻量级重要向量选择模块与优化的向量稀疏注意力核,动态选取并处理关键垂直向量。 Result: 在VideoMME、LongVideoBench、VCRBench(理解)和VBench(生成)上,VecAttention相较全注意力提速2.65×,相较SOTA稀疏方法提速1.83×,同时保持与全注意力相当的精度。 Conclusion: 垂直向量稀疏模式是比传统粗粒度稀疏更优的结构先验;VecAttention为长视频建模提供了高效、准确且硬件友好的新范式。 Abstract: Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.

[117] All-in-One Augmented Reality Guided Head and Neck Tumor Resection

Yue Yang,Matthieu Chabanas,Carrie Reale,Annie Benson,Jason Slagle,Matthew Weinger,Michael Topf,Jie Ying Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于HoloLens 2的无标记增强现实(AR)系统,用于在头颈鳞癌手术中精确定位并可视化阳性切缘,显著提升术中再切除精度。

Details Motivation: 头颈鳞癌术中阳性切缘常见,但目前依赖病理口头报告定位,导致再切除不精准。 Method: 开发集成式无标记AR系统,利用HoloLens 2深度感知与全自动表面配准,将离体标本的阳性切缘重定位至术中切除床并原位可视化。 Result: 在硅胶模型实验中,无标记配准误差中位数为1.8 mm(对标有标记基线1.7 mm);切缘重定位任务中,AR引导误差中位数降至3.2 mm(远低于口头引导的14.2 mm),全部误差<5 mm。 Conclusion: 该无标记AR系统可行且可显著提高术中阳性切缘再切除的精确性。 Abstract: Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

[118] Transmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing

Francesco Moretti,Giulia Bianchi,Andrea Gallo

Main category: cs.CV

TL;DR: 本文提出了一种两阶段夜间图像去雾框架,结合透射率校正与结构-纹理分层优化,以解决夜间雾霾图像的低可见度、色彩失真和对比度下降等问题。

Details Motivation: 现有夜间去雾方法通常只解决部分退化问题(如光晕抑制或亮度增强),未能联合处理大气散射、颗粒吸收及非均匀人工光照等多重因素导致的综合质量下降。 Method: 第一阶段:基于边界约束的初始透射率图生成,并进行区域自适应补偿与归一化(区分光源区与非光源区);采用YUV空间中的二次高斯滤波估计空间变化的大气光图;结合改进的夜间成像模型生成初始去雾图像。第二阶段:提出STAR-YUV分解模型,在YUV空间将图像分离为结构层与纹理层;结构层采用Gamma校正与MSRCR进行照度补偿与色彩校正,纹理层用LoG滤波增强细节;最后通过非线性Retinex融合增强层并线性融合初始结果。 Result: 该方法在多个夜间雾天数据集上实现了更优的可见度提升、色彩保真度和细节恢复效果,定量与定性评估均优于现有方法。 Conclusion: 所提两阶段框架能更全面地建模并校正夜间雾霾图像的多重退化机制,显著提升去雾质量,为复杂低光环境下的图像复原提供了新思路。 Abstract: Nighttime images captured under hazy conditions suffer from severe quality degradation, including low visibility, color distortion, and reduced contrast, caused by the combined effects of atmospheric scattering, absorption by suspended particles, and non-uniform illumination from artificial light sources. While existing nighttime dehazing methods have achieved partial success, they typically address only a subset of these issues, such as glow suppression or brightness enhancement, without jointly tackling the full spectrum of degradation factors. In this paper, we propose a two-stage nighttime image dehazing framework that integrates transmittance correction with structure-texture layered optimization. In the first stage, we introduce a novel transmittance correction method that establishes boundary-constrained initial transmittance maps and subsequently applies region-adaptive compensation and normalization based on whether image regions correspond to light source areas. A quadratic Gaussian filtering scheme operating in the YUV color space is employed to estimate the spatially varying atmospheric light map. The corrected transmittance map and atmospheric light map are then used in conjunction with an improved nighttime imaging model to produce the initial dehazed image. In the second stage, we propose a STAR-YUV decomposition model that separates the dehazed image into structure and texture layers within the YUV color space. Gamma correction and MSRCR-based color restoration are applied to the structure layer for illumination compensation and color bias correction, while Laplacian-of-Gaussian filtering is applied to the texture layer for detail enhancement. A novel two-phase fusion strategy, comprising nonlinear Retinex-based fusion of the enhanced layers followed by linear blending with the initial dehazing result, yields the final output.

[119] Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Sowmya Vajrala,Aakash Parmar,Prasanna R,Sravanth Kodavanti,Manjunath Arveti,Srinivas Soumitri Miriyala,Ashok Senapati

Main category: cs.CV

TL;DR: 本文提出了一种在边缘设备上高效运行多任务生成式AI的统一框架,通过将LoRA权重作为运行时输入,并结合量化感知训练策略QUAD,显著降低了内存占用和延迟。

Details Motivation: 现有移动端部署大型视觉模型(LVMs)面临高内存与计算开销问题;传统LoRA部署方式为每个任务编译独立模型,导致存储冗余和运行时开销大。 Method: 提出统一框架:1)将LoRA权重作为运行时输入,实现单模型动态多任务切换;2)设计量化感知训练方法QUAD,使多个LoRA适配器共享统一量化配置;3)构建轻量级、支持移动NPU的运行时栈。 Result: 实验表明,在多个芯片平台上,内存占用最多降低6倍,延迟最多降低4倍,同时保持多任务下高质量视觉输出。 Conclusion: 该框架有效解决了边缘端多任务GenAI部署中的存储冗余与效率瓶颈,为移动端高效生成式AI提供了可行路径。 Abstract: Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

[120] Generating Key Postures of Bharatanatyam Adavus with Pose Estimation

Jagadish Kashinath Kamble,Jayanta Mukhopadhyay,Debaditya Roy,Partha Pratim Das

Main category: cs.CV

TL;DR: 本文提出了一种结合姿态估计与关键点监督的生成框架,用于高保真生成印度古典舞Bharatanatyam的关键姿势,提升数字保存、教育与传播的文化准确性与真实性。

Details Motivation: 传统非物质文化遗产舞蹈(如Bharatanatyam)在数字化保存中面临结构与符号规则严格、姿态精准性要求高的挑战,亟需兼顾解剖学正确性与文化风格完整性的生成方法。 Method: 提出一种姿态感知的生成框架,集成姿态估计模块,采用关键点损失和姿态一致性约束进行监督;对比评估四种配置:标准cGAN、带姿态监督的cGAN、条件扩散模型、带姿态监督的条件扩散模型,均以关键姿势类别标签为条件并优化几何结构保持。 Result: 引入姿态监督显著提升了生成Bharatanatyam姿势的质量、真实感与文化保真度;在cGAN与条件扩散两种范式下,生成姿态均更贴近真实关键点结构。 Conclusion: 该框架为传统舞蹈的数字化存档、教学与全球传播提供了可扩展、高保真且文化精准的技术路径。 Abstract: Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at https://github.com/jagidsh/Generating-Key-Postures-of-Bharatanatyam-Adavus-with-Pose-Estimation.

[121] Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

Rongkang Dong,Cuixin Yang,Cong Zhang,Yushen Zuo,Kin-Man Lam

Main category: cs.CV

TL;DR: 本文提出Emotion Diffusion Classifier(EmoDC)用于面部表情识别(FER),结合条件生成扩散模型提升对抗鲁棒性;为解决标准训练下分类性能不足问题,进一步提出自适应边界差异训练(AMDiT),动态调整噪声预测误差边界,显著提升准确率与鲁棒性。

Details Motivation: 现有基于深度学习的FER方法依赖判别式分类器,易学‘捷径’,对分布偏移敏感、鲁棒性差。 Method: 提出基于条件生成扩散模型的EmoDC;引入边界差异训练以区分正确与错误类别条件下的噪声预测误差;进一步设计自适应边界差异训练(AMDiT),为每样本动态设定误差边界。 Result: AMDiT在RAF-DB(基础/复合子集)、SFEW-2.0和AffectNet数据集上100步评估中显著优于基线扩散模型;EmoDC在抗噪声与模糊干扰方面优于当前最优判别式分类器。 Conclusion: 生成式扩散建模结合自适应边界优化可有效提升FER模型的判别能力与鲁棒性,为面向真实场景的鲁棒情感识别提供新范式。 Abstract: Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model's discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

[122] FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models

Jules Ripoll,David Bertoin,Alasdair Newson,Charles Dossal,Jose Pablo Baraybar

Main category: cs.CV

TL;DR: 本文提出FlowID,一种基于图像生成模型的身份保持面部重建方法,用于修复因暴力死亡导致的严重面部损伤照片,以辅助法医鉴定。

Details Motivation: 法医和执法机构常需处理因暴力事件(如犯罪、战争、移民、气候灾害)导致的死者面部损伤照片,但传统图像编辑工具流程长且效果不佳,难以在保护隐私的前提下高效支持身份识别。 Method: FlowID结合单图像微调技术(适配生成模型以处理分布外的受伤面部)与基于注意力机制的掩码策略(仅对受损区域进行编辑,同时保留身份关键特征),实现身份保持的面部重建。 Result: FlowID在新构建的InjuredFaces基准上优于现有开源方法,内存占用低,适合本地部署且保障数据隐私。 Conclusion: FlowID为极端损伤条件下的身份保持面部重建提供了高效、隐私安全的解决方案,并推动了该领域标准化评估的发展。 Abstract: Every day, many people die under violent circumstances, whether from crimes, war, migration, or climate disasters. Medico-legal and law enforcement institutions document many portraits of the deceased for evidence, but cannot immediately carry out identification on them. While traditional image editing tools can process these photos for public release, the workflow is lengthy and produces suboptimal results. In this work, we leverage advances in image generation models, which can now produce photorealistic human portraits, to introduce FlowID, an identity-preserving facial reconstruction method. Our approach combines single-image fine-tuning, which adapts the generative model to out-of-distribution injured faces, with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Together, these components enable the removal of artifacts from violent death while retaining sufficient identity information to support identification. To evaluate our method, we introduce InjuredFaces, a novel benchmark for identity-preserving facial reconstruction under severe facial damage. Beyond serving as an evaluation tool for this work, InjuredFaces provides a standardized resource for the community to study and compare methods addressing facial reconstruction in extreme conditions. Experimental results show that FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.

[123] Video-Oasis: Rethinking Evaluation of Video Understanding

Geuntaek Lim,Minho Shim,Sungjune Park,Jaeyun Lee,Inwoong Lee,Taeoh Kim,Dongyoon Wee,Yukyung Choi

Main category: cs.CV

TL;DR: 本文提出Video-Oasis诊断套件,系统评估现有视频理解基准,发现超半数样本无需视觉或时序信息即可解决,剩余样本上SOTA模型表现接近随机;进而分析有效算法设计,为基准构建与模型评估提供指导。

Details Motivation: 视频理解任务内在复杂,难以区分性能提升源于视觉感知、语言推理还是知识先验;当前大量基准缺乏对核心能力(如时空建模)的严格检验,亟需系统性诊断工具。 Method: 构建可持续诊断套件Video-Oasis,通过消融视觉输入和时间上下文,定量分析现有基准中样本的可解性;对不可简化样本测试SOTA模型性能,并探究不同算法设计(如时空建模机制)对鲁棒理解的影响。 Result: 发现54%的现有基准样本无需视觉或时序信息即可求解;在其余样本上,SOTA模型准确率仅略高于随机猜测;识别出若干关键算法设计因素可显著提升鲁棒视频理解能力。 Conclusion: 当前主流视频理解基准存在严重偏差,无法真实反映模型的时空理解能力;Video-Oasis为未来基准设计与模型评估提供了可复现、可诊断的标准框架。 Abstract: The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at https://github.com/sejong-rcv/Video-Oasis.

[124] Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen,Quanxin Shou,Hangting Chen,Yucheng Zhou,Kaituo Feng,Wenbo Hu,Yi-Fan Zhang,Yunlong Lin,Wenxuan Huang,Mingyang Song,Dasen Dai,Bolin Jiang,Manyuan Zhang,Shi-Xue Zhang,Zhengkai Jiang,Lucas Wang,Zhao Zhong,Yu Cheng,Nanyun Peng

Main category: cs.CV

TL;DR: 本文提出Unify-Agent,一种基于代理的统一多模态模型,通过将图像生成重构为包含提示理解、多模态证据搜索、基于世界知识的重描述和最终合成的代理流程,显著提升了对长尾和知识密集型概念的图像生成能力。

Details Motivation: 现有统一多模态模型主要依赖冻结参数知识,在处理涉及长尾和知识密集型概念的真实世界图像生成任务时表现不佳,因此作者受智能体在现实任务中成功应用的启发,探索基于代理的建模方法。 Method: 提出Unify-Agent框架,将图像生成建模为包含提示理解、多模态证据搜索、世界知识驱动的重描述和最终图像合成的代理流程;构建了定制化多模态数据流水线,并收集143K高质量代理轨迹用于训练;设计FactIP基准测试,覆盖12类文化重要且长尾的事实性概念,强调外部知识 grounding。 Result: Unify-Agent在多个基准和真实世界生成任务上显著超越其基础统一模型,性能接近最强闭源模型;FactIP等实验验证了其在世界知识 grounding 方面的有效性。 Conclusion: 将推理、搜索与生成紧密耦合的代理式建模,是实现可靠开放世界图像合成的重要方向;本工作为基于代理的世界知识图像生成提供了早期但有价值的探索。 Abstract: Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

[125] BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

Johann-Ludwig Herzog,Mathis Jürgen Adler,Leonard Hackel,Yan Shu,Angelos Zavras,Ioannis Papoutsis,Paolo Rota,Begüm Demir

Main category: cs.CV

TL;DR: 本文介绍了BigEarthNet.txt,一个大规模、多传感器的遥感图像-文本数据集,旨在推动地球观测中的指令驱动图像-文本学习,并验证其在提升视觉语言模型(VLMs)遥感任务性能上的有效性。

Details Motivation: 现有遥感图像-文本数据集规模小、传感器单一、文本标注类型少且弱,难以支撑视觉语言模型在遥感领域的有效训练与评估。 Method: 构建了包含46.4万对Sentinel-1/Sentinel-2共配准图像和960万条多样化文本标注(地理锚定描述、视觉问答对、指代表达检测指令)的BigEarthNet.txt数据集;进行了统计对比分析,并建立人工校验的基准测试划分。 Result: BigEarthNet.txt在文本丰富性和标注多样性上显著优于现有遥感图文数据集;基于该数据集微调的VLMs在各类遥感任务中均取得一致性能提升,尤其改善了对复杂土地利用/覆盖类别的理解能力。 Conclusion: BigEarthNet.txt为遥感领域视觉语言模型的研究提供了高质量、多任务适配的数据基础,有效弥补了当前图文预训练资源的不足,并推动了指令驱动遥感理解的发展。 Abstract: Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.

[126] Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Sherif Abdelwahab

Main category: cs.CV

TL;DR: 本文提出一种面向边缘摄像头的流式跨模态检索架构,通过设备端epsilon-net滤波器筛选语义新颖帧,并结合跨模态适配器与云端重排序器提升检索精度,在低功耗下实现高效准确的视频-语言检索。

Details Motivation: 边缘摄像头持续生成视频流,其中大量冗余帧会干扰跨模态检索,将正确结果挤出top-k排名。 Method: 设计流式检索架构:1)设备端epsilon-net滤波器仅保留语义新颖帧,构建去噪嵌入索引;2)跨模态适配器与云端重排序器弥补轻量编码器对齐能力弱的问题。 Result: 单次遍历的流式滤波器在8个视觉-语言模型(8M–632M参数)及两个自拍视频数据集(AEA、EPIC-KITCHENS)上,性能优于k-means、最远点采样、均匀采样和随机采样等离线方法;整体架构在使用8M参数设备端编码器时,于预留数据上达到45.6% Hit@5,功耗约2.7 mW。 Conclusion: 该流式架构在保持极低功耗的同时显著提升边缘视频流的跨模态检索质量,验证了语义去冗与分层对齐策略的有效性。 Abstract: Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

[127] Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

Mingkun Tan,Xilu Wang,Michael Kloster,Tim W. Nattkemper

Main category: cs.CV

TL;DR: 本文研究了在去中心化和异构数据下标签稀缺的视觉分类问题,特别关注站点间类别集合部分重叠的情况。作者提出了PreDi数据划分方案和PreP-WFL联邦学习方法,以应对预训练和微调阶段中不同形式的数据异构性,并在硅藻分类任务上进行了系统验证。

Details Motivation: 现有自监督联邦学习(SSFL)假设预训练和微调阶段具有相同的数据异构模式,且当前数据划分方式难以可控地模拟真实世界中标签空间的异构性(如类别重叠与缺失),限制了方法设计与分析的深入性。 Method: 提出PreDi划分方案,将标签空间异构性解耦为类别流行度(Prevalence)和类别集大小差异(Disparity)两个正交维度;并据此设计PreP-WFL方法,基于类别流行度进行个性化加权联邦学习,增强低流行度类别的表征能力。 Result: 实验表明:SSFL在各类设置下均优于本地独立训练;未标注数据量异构性有利于表征预训练;在标签空间异构下,类别流行度主导性能下降,而大小差异影响较小;PreP-WFL能有效缓解该下降,且在流行度越低时增益越大。 Conclusion: 本文为去中心化识别系统中标签空间异构性的建模与应对提供了机制性理解,并验证了阶段特异性异构分析与针对性算法设计的有效性。 Abstract: Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

[128] MacTok: Robust Continuous Tokenization for Image Generation

Hengyu Zeng,Xin Gao,Guanghao Li,Yuxiang Yan,Jiaoyang Ruan,Junpeng Ma,Haoyu Albert Wang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出MacTok,一种基于掩码增强的1D连续图像分词器,通过随机掩码和DINO引导的语义掩码结合全局/局部表征对齐,有效防止后验坍塌,在仅用64或128个token的情况下实现高保真、紧凑的视觉表征。

Details Motivation: 现有基于变分框架的连续图像分词器在使用较少token时易发生后验坍塌,导致编码器无法将信息性特征有效压缩进潜在空间。 Method: MacTok采用双重掩码策略(随机掩码 + DINO引导的语义掩码)以增强鲁棒语义编码,并结合全局与局部表征对齐,在高度压缩的1D潜在空间中保留判别性信息。 Result: 在ImageNet上,MacTok以64/128 token达到256×256下gFID=1.44、512×512下gFID=1.52(SiT-XL),token用量减少达64倍。 Conclusion: 掩码正则化与语义引导协同可有效缓解后验坍塌,实现高效高保真的图像分词。 Abstract: Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

[129] Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

Pengfei Zhou,Xiangyue Zhang,Xukun Shen,Yong Hu

Main category: cs.CV

TL;DR: 本文提出了一种新的运动频谱描述符(MSD)来衡量运动局部动态复杂度,并基于此设计了复杂度感知的掩码运动生成方法DynMask,显著提升了对动态复杂动作的生成质量。

Details Motivation: 现有掩码生成模型在处理运动序列时对所有帧一视同仁,但实际运动中局部动态复杂度随时间剧烈变化,导致模型在复杂动作上性能下降;作者旨在解决这一与运动特性不匹配的问题。 Method: 提出参数无关、确定性且可解释的运动频谱描述符(MSD),基于运动速度的短时频谱计算;将其用于指导内容聚焦的掩码策略、作为自注意力中的频谱相似性先验,并调节迭代解码中的token级采样。 Result: DynMask在HumanML3D和KIT-ML数据集上显著提升FID指标,尤其改善了对动态复杂动作的生成效果。 Conclusion: 尊重运动的局部复杂度是掩码运动生成的重要设计原则,MSD和DynMask为此提供了有效且可解释的实现路径。 Abstract: Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: https://xiangyue-zhang.github.io/DynMask

[130] CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao,Yihan Hu,Ying Shan,Yunchao Wei,Xiaodong Cun

Main category: cs.CV

TL;DR: CutClaw是一个基于多模态大模型的自主多智能体视频剪辑框架,能将长视频自动剪辑为节奏对齐、音乐同步、叙事连贯的高质量短视频。

Details Motivation: 手动视频剪辑耗时重复,难以满足社交平台对数字人内容快速生成的需求。 Method: 提出多智能体协同框架CutClaw:采用分层多模态分解提取音视频细粒度与全局特征;Playwriter Agent统筹叙事结构并锚定画面与音乐变化;Editor与Reviewer Agent协作依据美学与语义标准筛选镜头并优化成片。 Result: 在生成节奏对齐、高质量短视频任务上显著超越现有SOTA方法。 Conclusion: 多智能体协同结合多模态大模型可有效实现自动化、高保真、艺术性视频编辑,为数字内容创作提供新范式。 Abstract: Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

[131] CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment

Dimitrios Anastasiou,Razvan Caramalau,Jialang Xu,Runlong He,Freweini Tesfai,Matthew Boal,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos

Main category: cs.CV

TL;DR: 本文提出首个用于视觉手术技能评估(SSA)回归任务的无监督域自适应(UDA)基准,并提出新框架CoRe-DA,在不使用目标域标签的情况下显著提升跨域泛化性能。

Details Motivation: 现有SSA方法受限于人工标注成本高、模型跨任务/环境泛化能力差,而大量未标注手术视频数据为无监督域自适应提供了契机。 Method: 构建涵盖四种手术场景(干实验室/临床、开放/机器人手术)的UDA基准;提出CoRe-DA框架,结合相对分数监督与目标域自训练,学习域不变表征。 Result: 在两种UDA设定下,CoRe-DA在干实验室和临床目标数据集上分别达到0.46和0.41的Spearman相关系数,优于现有方法,且无需目标域标注数据。 Conclusion: CoRe-DA实现了可扩展、跨域鲁棒的视觉手术技能评估,解决了现有方法泛化性差的问题,推动了UDA在SSA中的实际应用。 Abstract: Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at https://github.com/anastadimi/CoRe-DA.

[132] Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy

Ruochen Gao,Marius Staring,Frank Dankers

Main category: cs.CV

TL;DR: 本文提出了一种临床导向的DVH指标损失(CDM loss),结合可微D-metrics、代理V-metrics及无损位掩码ROI编码,显著提升头颈部3D剂量预测在临床DVH指标上的表现,同时大幅降低训练开销。

Details Motivation: 现有基于体素回归损失的深度学习3D剂量预测模型与临床基于DVH指标的评估标准不一致,亟需一种能直接优化临床DVH指标且计算高效的损失函数。 Method: 提出CDM损失函数,融合可微D-metrics和代理V-metrics,并采用无损位掩码ROI编码提升训练效率;在174例头颈部患者数据上以时间划分方式验证(137例训练,37例测试)。 Result: 相比MAE和DVH曲线损失,CDM损失显著改善靶区覆盖(PTV Score从1.544降至0.491),满足全部临床约束,危及器官保护效果相当;位掩码编码使训练时间减少83%,GPU显存占用下降。 Conclusion: 直接优化临床DVH指标可使3D剂量预测更契合实际放疗计划需求;CDM损失与高效ROI编码相结合,为头颈部剂量预测提供了实用、可扩展的解决方案。 Abstract: Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H\&N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textit{D-metrics} and surrogate \textit{V-metrics}, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H\&N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83\% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H\&N dose prediction.

[133] SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

Ning Wang,Tieyue Wu,Naeha Sharif,Farid Boussaid,Guangming Zhu,Lin Mei,Mohammed Bennamoun,zhang liang

Main category: cs.CV

TL;DR: 本文提出SkeletonContext框架,通过引入语言驱动的上下文语义增强骨架动作表征,解决零样本骨架动作识别中因缺乏物体等上下文线索而导致的模态间表征差距问题。

Details Motivation: 现有方法在骨架特征与文本嵌入对齐时缺乏动作涉及的物体等上下文线索,难以区分视觉上相似的动作。 Method: 提出SkeletonContext框架,包含跨模态上下文提示模块(利用大语言模型重建掩码上下文提示)和关键部位解耦模块(解耦运动相关关节特征),以实现细粒度语义对齐与鲁棒动作理解。 Result: 在多个基准数据集上,SkeletonContext在常规与广义零样本设定下均达到SOTA性能。 Conclusion: 引入语言驱动的上下文语义可有效缩小骨架与语义表征间的鸿沟,提升对视觉相似动作的判别能力。 Abstract: Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

[134] Exploring the Impact of Skin Color on Skin Lesion Segmentation

Kuniko Paxton,Medina Kapo,Amila Akagić,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos

Main category: cs.CV

TL;DR: 本文研究了皮肤色调对皮肤病变分割性能的影响,提出了一种基于连续色素(ITA值)分布的分析方法,发现病变与周围皮肤之间的对比度低是导致分割错误的主要原因,而非整体肤色类别。

Details Motivation: 现有AI皮肤病学系统在病变分类中的公平性研究较多,但对分割阶段中皮肤色调影响的研究不足,且多使用粗略的离散肤色分类,缺乏量化分析。 Method: 评估了UNet、DeepLabV3(ResNet50骨干)和DINOv2三种分割模型在HAM10000和ISIC2017数据集上的表现;引入基于像素级ITA值的连续色素分布分析,利用Wasserstein距离度量皮肤区、病变区及全图区域内的色素分布差异,以量化病变-皮肤对比度并关联其与分割性能的关系。 Result: 全局肤色指标(如Fitzpatrick分型或平均ITA)与分割质量弱相关;而病变-皮肤对比度越低,分割误差越大,边界模糊和低对比度是模型失败的关键因素。 Conclusion: 提升皮肤镜分割公平性的重点应放在增强模型对低对比度病变的鲁棒性上;基于色素分布的连续度量比离散肤色类别更能有效反映分割偏差,可作为更优的审计信号。 Abstract: Skin cancer, particularly melanoma, remains a major cause of morbidity and mortality, making early detection critical. AI-driven dermatology systems often rely on skin lesion segmentation as a preprocessing step to delineate the lesion from surrounding skin and support downstream analysis. While fairness concerns regarding skin tone have been widely studied for lesion classification, the influence of skin tone on the segmentation stage remains under-quantified and is frequently assessed using coarse, discrete skin tone categories. In this work, we evaluate three strong segmentation architectures (UNet, DeepLabV3 with a ResNet50 backbone, and DINOv2) on two public dermoscopic datasets (HAM10000 and ISIC2017) and introduce a continuous pigment or contrast analysis that treats pixel-wise ITA values as distributions. Using Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions, we quantify lesion skin contrast and relate it to segmentation performance across multiple metrics. Within the range represented in these datasets, global skin tone metrics (Fitzpatrick grouping or mean ITA) show weak association with segmentation quality. In contrast, low lesion-skin contrast is consistently associated with larger segmentation errors in models, indicating that boundary ambiguity and low contrast are key drivers of failure. These findings suggest that fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions, and the distribution-based pigment measures provide a more informative audit signal than discrete skin-tone categories.

[135] FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

Fengjian Xue,Xuecheng Wu,Heli Sun,Yunyun Shi,Shi Chen,Liangyu Fu,Jinheng Xie,Dingkang Yang,Hao Wang,Junxiao Xue,Liang He

Main category: cs.CV

TL;DR: 本文提出FED-Bench,一个面向面部表情图像编辑的高质量基准与评估体系,包含747组三元组数据、跨粒度评估指标FED-Score,并对18种模型进行评测,揭示细粒度指令遵循是当前瓶颈。

Details Motivation: 现有面部表情编辑基准缺乏高质量图像和精确编辑指令,且评估指标存在系统性偏差(如偏向懒编辑或过拟合编辑)。 Method: 构建FED-Bench:1)通过级联可扩展流程构建747个原始图-指令-真值图三元组;2)提出FED-Score,从Alignment(指令对齐)、Fidelity(保真度/身份保留)和Relative Expression Gain(表情变化量)三方面评估;3)评测18种编辑模型;4)扩展生成20k+野外人脸训练集并验证其有效性。 Result: 评测表明当前方法难以兼顾高保真度与精准表情操控,细粒度指令遵循是主要瓶颈;使用新训练集微调基线模型显著提升性能。 Conclusion: FED-Bench填补了面部表情编辑领域高质量基准与无偏评估的空白,为后续研究提供了可靠测试平台与数据支持。 Abstract: Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.

[136] Compressive sensing inspired self-supervised single-pixel imaging

Jijun Lu,Yifan Chen,Libang Chen,Yiqiang Zhou,Ye Zheng,Mingliang Chen,Zhe Sun,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出SISTA-Net,一种受压缩感知启发的自监督单像素成像方法,通过展开ISTA算法并融合CNN-VSSM架构与可学习稀疏变换,在低采样率下显著提升重建质量与抗噪鲁棒性。

Details Motivation: 现有单像素成像方法缺乏物理稀疏性约束,且未兼顾局部与全局特征建模,导致噪声敏感、结构失真和细节模糊。 Method: 提出SISTA-Net,将ISTA算法展开为含数据保真模块(CNN-VSSM混合架构)和近端映射模块(深度非线性网络+可学习软阈值)的可解释网络,显式引入潜在域物理稀疏约束。 Result: 仿真实验PSNR较SOTA提升2.6 dB;真实远场水下测试平均PSNR提升3.4 dB,验证其强抗干扰能力。 Conclusion: SISTA-Net通过结合物理模型与深度学习,在极低采样率下单像素成像中实现了更高重建精度与鲁棒性,为强扰动环境成像提供了新范式。 Abstract: Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.

[137] Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

Rosario Leonardi,Antonino Furnari,Francesco Ragusa,Giovanni Maria Farinella

Main category: cs.CV

TL;DR: 本文探讨了合成数据在提升以自我为中心图像中手-物交互(HOI)检测性能方面的作用,发现即使仅使用10%真实标注数据,结合合成数据也能显著提升检测精度,并发布了新的合成数据生成流程与HOI-Synth基准数据集。

Details Motivation: 当真实标注数据稀缺或不可用时,如何提升手-物交互检测性能是一个关键挑战,本文旨在探索合成数据在此任务中的潜力。 Method: 在VISOR、EgoHOS和ENIGMA-51三个数据集上进行大量实验与对比分析,系统研究合成数据与真实数据在物体、抓取方式和环境层面的对齐策略,并构建了可自动标注接触状态、边界框和像素级分割掩码的合成数据生成流程。 Result: 仅用10%真实标注数据加合成数据,整体AP分别提升+5.67%(VISOR)、+8.24%(EgoHOS)、+11.69%(ENIGMA-51);合成-真实对齐程度越高,性能提升越明显。 Conclusion: 合成数据能有效弥补真实标注数据不足的问题,尤其在HOI检测任务中具有显著增益;本文发布的HOI-Synth基准和开源工具为后续研究提供了重要资源。 Abstract: In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.

[138] GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

Thomas Tanay,Mohammed Brahimi,Michal Nazarczuk,Qingwen Zhang,Sibi Catley-Chandar,Arthur Moreau,Zhensong Zhang,Eduardo Pérez-Pellitero

Main category: cs.CV

TL;DR: 本文提出了一种面向单目动态视频的新视角合成方法,通过引入循环结构和动态平面扫描机制,在保持几何一致性的同时实现六自由度相机控制,并在UCSD和新构建的Kubric-4D-dyn数据集上超越了多种高资源消耗的场景特定与扩散模型。

Details Motivation: 现有基于显式运动先验的场景特定方法在高度动态区域易失效;扩散模型虽视觉合理但几何不一致;且两类方法均计算开销大。 Method: 将静态通用化新视角合成框架扩展至动态输入:(1)设计递归循环以支持输入与目标视频间无界异步映射;(2)采用高效动态平面扫描,解耦相机与场景运动,实现细粒度六自由度相机控制。 Result: 在UCSD和新提出的Kubric-4D-dyn数据集上,模型在静态与动态区域的几何细节重建方面,均优于四种高斯溅射场景特定方法及两种扩散方法。 Conclusion: 所提通用化动态新视角合成模型兼顾几何一致性、控制精度与效率,为单目动态场景建模提供了更鲁棒、可扩展的新范式。 Abstract: Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

[139] SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Rui Bao,Zheng Gao,Xiaoyu Li,Xiaoyan Feng,Yang Song,Jiaojiao Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为SHIFT的训练无关攻击方法,利用随机扩散重采样偏转生成轨迹,从而破坏基于扩散模型的水印验证过程,在保持图像质量的同时实现高达100%的攻击成功率。

Details Motivation: 现有基于扩散模型的水印方法依赖于扩散轨迹的精确重建来验证水印,这一假设构成了可被利用的根本性漏洞。 Method: 提出SHIFT方法,通过在潜在空间中进行随机扩散重采样,偏转生成轨迹,使重建图像在统计上与原始嵌入水印的轨迹解耦,且无需训练或水印先验知识。 Result: 在九种代表性水印方法(涵盖噪声空间、频域和优化类范式)上,SHIFT实现了95%–100%的攻击成功率,语义质量几乎无损。 Conclusion: SHIFT揭示了当前扩散水印方法在轨迹可重建性上的共性脆弱性,证明了无需模型访问或训练即可高效攻破多种水印机制,为水印鲁棒性研究提供了新视角。 Abstract: Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose $\underline{\mathbf{S}}$tochastic $\underline{\mathbf{Hi}}$dden-Trajectory De$\underline{\mathbf{f}}$lec$\underline{\mathbf{t}}$ion ($\mathbf{SHIFT}$), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%--100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.

[140] TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

Qiucheng Yu,Ruijie Xu,Mingang Chen,Xuequan Lu,Jianfeng Dong,Chaochao Lu,Xin Tan

Main category: cs.CV

TL;DR: 本文提出TSHA基准,旨在解决现有室内安全风险评估基准存在的合成数据域偏移、任务简化及缺乏严格评估协议三大问题,通过多源真实数据构建高质量训练与测试集,并验证其对提升视觉语言模型安全评估能力的有效性。

Details Motivation: 现有视觉语言模型在室内安全风险评估中面临三大挑战:依赖合成数据导致域偏移、安全任务过于简化限制泛化能力、缺乏针对复杂家庭场景的严谨评估协议。 Method: 构建TSHA基准,包含81,809个来自四个互补来源(现有室内数据集、网络图像、AIGC生成图像、新采集图像)的训练样本,以及1707个含多隐患视频与全景图的高难度测试样本;并在23个主流VLM上开展系统实验。 Result: 实验证明当前VLM在安全风险评估上鲁棒性不足;基于TSHA训练的模型在TSHA测试集上最高提升18.3分,且在其他基准上也展现出更强泛化能力。 Conclusion: TSHA是一个可靠、全面且具挑战性的新基准,显著推动了VLM在真实室内安全评估任务中的发展,凸显了高质量、多模态、真实场景数据对安全AI研究的关键价值。 Abstract: Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbf{T}rustworthy \textbf{S}afety \textbf{H}azards \textbf{A}ssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model's robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.

[141] Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

Fengyang Xiao,Peng Hu,Lei Xu,XingE Guo,Guanyi Qin,Yuqi Shen,Chengyu Fang,Rihan Zhang,Chunming He,Sina Farsiu

Main category: cs.CV

TL;DR: 本文提出IQPIR框架,利用无参考图像质量评估模型提取的图像质量先验(IQP)指导图像恢复过程,通过质量条件Transformer、双分支码本结构和离散表示的质量优化策略,提升恢复图像的感知质量。

Details Motivation: 现有图像恢复方法依赖于可能存在感知保真度不一致的真值监督,导致模型收敛到训练数据的平均质量水平,而非最高感知质量。 Method: 提出IQPIR框架,结合图像质量先验(IQP)与学习到的码本先验:(1)质量条件Transformer,以NR-IQA得分作为条件信号;(2)双分支码本结构,分离通用与高质量特征;(3)基于离散表示的质量优化策略,缓解连续潜在空间中的过优化问题。 Result: 在真实世界图像恢复任务上,该方法超越当前最先进方法,并可作为通用的质量引导增强策略适配现有方法。 Conclusion: IQPIR通过引入图像质量先验,有效提升了恢复图像的感知质量,且具备良好的泛化性和即插即用性。 Abstract: Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.

[142] From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

Ganen Sethupathy,Lalit Dumka,Jan Schagen

Main category: cs.CV

TL;DR: 本文提出了一种面向边缘计算的混合式行为检测系统,结合骨架运动分析与视觉-语言模型,在保障隐私和低延迟的同时实现对暴力行为的实时、语义化理解。

Details Motivation: 公共空间需及时可靠地检测潜在暴力行为,但现有自动视频分析在边缘部署时受限于延迟、隐私和资源问题。 Method: 设计并部署一种混合边缘行为检测系统:用轻量级骨架分析实现实时、隐私保护的运动监测;用视觉-语言模型提供上下文理解和零样本推理能力;重点在于系统级对比两种范式在真实边缘约束下的表现。 Result: 在GPU边缘设备上实现并评估了该系统,结果表明骨架驱动与语义驱动方法具有互补优劣,混合架构能以可控开销提升检测鲁棒性与适应性。 Conclusion: 混合架构为公共安全场景中兼顾隐私、实时性与语义理解的视频分析提供了可行且实用的技术路径。 Abstract: Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

[143] MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

Boshko Koloski,Marjan Stoimchev,Jurica Levatić,Dragi Kocev,Sašo Džeroski

Main category: cs.CV

TL;DR: 本文提出MAPLE框架,用于解决遥感图像中多路径分层多标签分类(HMLC)问题,通过图感知文本初始化、图卷积结构编码和自适应多模态融合,显著提升少样本性能。

Details Motivation: 现有HMLC方法在多路径设置下难以充分利用层级信息,导致对遥感图像中复杂标签依赖建模不足。 Method: 提出MAPLE框架,包含三部分:(i) 基于图感知文本描述的层级语义初始化;(ii) 基于图卷积网络(GCN)的图结构编码;(iii) 动态平衡语义先验与视觉证据的自适应多模态融合,并引入层级感知自适应损失函数。 Result: 在CORINE对齐的遥感数据集(AID、DFC-15、MLRSNet)上验证,少样本场景下性能提升最高达+42%,仅增加2.6%参数量。 Conclusion: MAPLE能高效、有效地建模地球观测任务中的层级语义,尤其适用于多路径、少样本的遥感图像分类场景。 Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).

[144] Multi-Feature Fusion Approach for Generative AI Images Detection

Abderrezzaq Sendjasni,Mohamed-Chaker Larabi

Main category: cs.CV

TL;DR: 本文提出了一种融合低层统计、中层纹理和高层语义特征的多特征融合框架,用于鲁棒检测生成式AI图像,在多个基准数据集上优于现有方法。

Details Motivation: 现有基于单特征空间的生成图像检测器在面对多样且快速演进的生成模型时鲁棒性不足。 Method: 构建一个融合三类互补特征的框架:MSCN(低层统计偏差)、CLIP嵌入(高层语义一致性)和MLBP(中层纹理异常)。 Result: 在四个基准数据集上的实验表明,三特征融合显著优于单一特征,并在混合模型场景下表现更稳定,整体性能持续超越当前最优方法。 Conclusion: 多特征融合是提升生成图像检测鲁棒性的关键,该工作为整合互补视觉线索提供了原则性框架。 Abstract: The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.

[145] SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Léopold Maillard,Francis Engelmann,Tom Durand,Boxiao Pan,Yang You,Or Litany,Leonidas Guibas,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 本文提出SceneTeract框架,用于在具身智能中验证3D场景在特定代理约束下的功能性,结合高层语义推理与底层几何检验,揭示合成环境中的功能缺陷及VLMs在功能可实现性推理上的系统性偏差,并支持基于几何约束的VLM后训练优化。

Details Motivation: Embodied AI依赖能支持多样化用户有意义活动的交互式3D环境,但评估其功能可供性仍是核心挑战。 Method: 提出SceneTeract框架,包含一个将高层语义推理与低层几何检验耦合的具身验证引擎;将复杂活动分解为原子动作序列,并依据具身代理特征(如可达性、通行空间、可导航性)进行物理与几何仿真验证;并将其用作VLM后训练的奖励信号。 Result: (i)在合成室内环境中发现频繁的基本交互功能失效;(ii)前沿视觉语言模型(VLMs)在功能可供性推理上存在语义置信度与物理可行性之间的系统性不匹配;(iii)利用SceneTeract作为奖励引擎可有效将几何约束蒸馏进VLM推理过程。 Conclusion: SceneTeract为连接感知与物理现实提供了可扩展、可验证的工具,推动具身AI中3D场景理解从表观识别迈向真实功能可用性评估。 Abstract: Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

[146] AutoFormBench: Benchmark Dataset for Automating Form Understanding

Gaurab Baral,Junxiu Zhou

Main category: cs.CV

TL;DR: 本文提出了AutoFormBench基准数据集,并系统比较了OpenCV与多种YOLO架构在表单元素检测任务上的性能,结果表明YOLOv11表现最优。

Details Motivation: 自动化处理结构化文档(如政府表格、医疗记录和企业发票)因真实场景中版式高度可变而面临持续挑战。 Method: 构建包含407份真实表单的AutoFormBench基准数据集;对比经典OpenCV方法与四种YOLO架构(YOLOv8、YOLOv11、YOLOv26-s、YOLOv26-l)在填空类表单元素(复选框、输入线、文本框)定位与分类任务上的性能。 Result: YOLOv11在所有元素类别和容差水平下,F1分数与Jaccard准确率均持续优于其他方法。 Conclusion: YOLOv11是当前表单元素检测任务中最优模型,AutoFormBench为该领域提供了实用且具代表性的评估基准。 Abstract: Automated processing of structured documents such as government forms, healthcare records, and enterprise invoices remains a persistent challenge due to the high degree of layout variability encountered in real-world settings. This paper introduces AutoFormBench, a benchmark dataset of 407 annotated real-world forms spanning government, healthcare, and enterprise domains, designed to train and evaluate form element detection models. We present a systematic comparison of classical OpenCV approaches and four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, and YOLOv26-l) for localizing and classifying fillable form elements. specifically checkboxes, input lines, and text boxes across diverse PDF document types. YOLOv11 demonstrates consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels.

[147] Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

Minyoung E. Kim,Dae Hee Yun,Aditi V. Patel,Madeline Hon,Webster Guan,Taegeon Lee,Brian Nguyen

Main category: cs.CV

TL;DR: 本文介绍了CANVAS,一个面向亚细胞分辨率全脑3D显微镜数据的基准数据集,包含六种神经元和免疫细胞标记物的高分辨率光片荧光显微图像、全脑细胞标注及评估排行榜,旨在推动适用于百太字节级组织成像数据的可扩展分析方法与基础模型发展。

Details Motivation: 现有亚细胞分辨率全脑3D显微数据缺乏可扩展的处理分析方法,且通用视觉模型难以泛化到此类异质性强、形态多变、规模巨大的组织成像数据,亟需专用基准推动方法学发展。 Method: 构建CANVAS基准:采集并整合六种细胞类型标记的亚细胞级全鼠脑光片荧光显微(LSFM)三维图像数据,提供全脑范围细胞级人工标注,并建立公开 leaderboard;同时评估多种主流目标检测/分类模型在该数据上的泛化性能。 Result: 揭示了现有视觉模型在跨细胞表型与脑区解剖位置时泛化能力显著下降,主要受限于细胞形态的高度异质性;CANVAS成为首个也是最大规模的亚细胞级全脑LSFM基准数据集。 Conclusion: CANVAS填补了亚细胞级全脑成像分析领域缺乏标准化基准的空白,为开发更鲁棒、可扩展的生物医学图像AI模型提供了关键基础设施和评估平台。 Abstract: Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.

[148] Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan,Róbert Rajkó

Main category: cs.CV

TL;DR: 本文提出了一种结合扩散驱动特征去噪与混合特征表示的鲁棒多类手写数字分类框架,通过在特征空间中引入扩散-去噪机制提升对噪声和对抗攻击的鲁棒性,并在AutoAttack下验证其有效性与稳健性。

Details Motivation: 提升手写数字多类分类模型对噪声和对抗攻击的鲁棒性,借鉴脑肿瘤分类中的特征空间防御思想。 Method: 1)用非负矩阵分解(NNMF)生成可解释的紧致示例化表示;2)并行提取CNN深度特征;3)融合二者构建混合特征表示;4)在特征空间施加逐步高斯噪声(扩散步骤);5)训练特征去噪网络逆向恢复干净表征;6)使用去噪后特征进行分类。 Result: 在基准和对抗(AutoAttack)设置下,该扩散式混合模型均优于CNN基线模型,兼具高分类精度与强鲁棒性。 Conclusion: 特征级扩散防御机制能有效增强多类手写数字分类系统的可靠性与鲁棒性。 Abstract: This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

[149] Training deep learning based dynamic MR image reconstruction using synthetic fractals

Anirudh Raman,Olivier Jaubert,Mark Wrobel,Tina Yao,Ruaraidh Campbell,Rebecca Baker,Ruta Virsinskaite,Daniel Knight,Michael Quail,Jennifer Steeden,Vivek Muthurangu

Main category: cs.CV

TL;DR: 本研究探讨了使用合成的分形数据训练深度学习模型用于动态MRI重建的可行性,结果表明分形数据训练的模型在图像质量和临床测量上与真实心脏MRI数据训练的模型效果相当。

Details Motivation: 解决心脏MRI训练数据存在的隐私、授权和可获得性限制问题。 Method: 利用四元数Julia分形生成2D+time图像作为训练数据,模拟多线圈MRI采集以生成全采样和径向欠采样k空间数据对,并训练3D UNet深度伪影抑制模型(F-DL),并与使用真实心脏MRI数据训练的模型(CMR-DL)进行比较。 Result: F-DL与CMR-DL在图像质量主观评分上无显著差异(p=0.9),且均显著优于压缩感知(CS)和低秩深度图像先验(LR-DIP);F-DL所得心室容积和射血分数与参考电影MRI相比无显著偏差,而LR-DIP存在显著偏差(p=0.016)。 Conclusion: 使用合成分形数据训练的深度学习模型可在实时心脏MRI重建中实现与真实数据训练模型相当的图像质量和临床测量精度;分形训练数据为动态MRI提供了开放、可扩展的替代方案,有助于构建更具泛化能力的深度学习重建模型。 Abstract: Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p<0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

[150] Abstraction in Style

Min Lu,Yuanfeng He,Anthony Chen,Jianhuang He,Pu Wang,Daniel Cohen-Or,Hui Huang

Main category: cs.CV

TL;DR: 本文提出Abstraction in Style (AiS)框架,将结构抽象与视觉风格化解耦,通过中间抽象代理实现对目标图像结构的重新诠释,从而支持更广泛、更具表现力和可控性的艺术风格迁移。

Details Motivation: 传统风格迁移方法难以捕捉艺术风格中深层次的结构抽象(如插画、非真实感风格),因其通常保持输入图像几何结构不变。 Method: AiS框架分为两阶段:首先基于少量风格示例推导出能重释目标图像语义结构但放松几何保真度的‘抽象代理’;其次将该代理渲染为最终风格化结果;两阶段均基于共享图像空间类比学习,无需显式几何监督。 Result: AiS实现了更广泛的风格迁移能力,提升了风格化过程的可控性与表现力,并能有效建模抽象逻辑。 Conclusion: 将抽象视为显式、可迁移的过程,而非外观附带效应,是提升艺术风格迁移深度与灵活性的关键路径。 Abstract: Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

[151] End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

Raül Pérez-Gonzalo,Andreas Espersen,Søren Forchhammer,Antonio Agudo

Main category: cs.CV

TL;DR: 本文提出了一种端到端深度学习框架,联合完成风力涡轮机叶片图像的分割与双模(有损+无损)感兴趣区域(ROI)压缩,以在高压缩比下保障叶片区域的高保真重建,从而支持后续缺陷检测。

Details Motivation: 风力涡轮机巡检中高频次、高分辨率图像传输造成带宽瓶颈;传统压缩难以兼顾背景高压缩率与叶片区域高保真度需求,影响后续缺陷识别。 Method: 构建联合分割与双模压缩的端到端框架:1)BU-Netv2+P分割网络配合CRF正则化损失实现精准叶片定位;2)基于超先验的自编码器用于叶片ROI有损压缩;3)扩展的分层bits-back编码器实现叶片完全无损重建;4)复用背景编码比特解除bits-back串行依赖,支持并行压缩。 Result: 在大规模风电机组图像数据集上验证,该方法在压缩率与重建质量(尤其叶片区域)上均优于现有方法,且支持并行化、高效双模压缩,首次实现学习驱动的分割-有损-无损一体化ROI编解码。 Conclusion: 所提框架有效解决了巡检图像中‘背景高压缩、叶片高保真’的矛盾,为自动化缺陷检测提供了实用、可靠的图像压缩基础,具备工程落地价值。 Abstract: Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.

[152] Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang,Fan Zhang,Huaijin Pi,Shuai Guo,Guowei Xu,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了一种基于锚帧(anchor frames)的数字角色视频生成方法,通过紧凑的视觉属性表示和两种新机制(超集内容锚定与RoPE弱条件)提升长时长、多视角下角色身份与外观的一致性。

Details Motivation: 现有方法在生成长时长、多视角一致且富有表现力的数字角色视频时存在身份保持不足或依赖非角色中心信息作为记忆的问题,导致一致性欠佳。 Method: 提出以少量锚帧表征角色视觉属性;引入Superset Content Anchoring机制防止复制粘贴和多参考冲突;采用RoPE作为弱条件编码位置偏移以区分多个锚帧;构建可扩展的锚帧提取流水线。 Result: 实验表明该方法可生成超10分钟的高质量角色视频,在多视角下实现强身份表现力与外观一致性,性能超越现有方法。 Conclusion: 锚帧表征结合新型参考机制有效解决了长时角色视频生成中的一致性难题,为数字角色内容创作提供了新范式。 Abstract: Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

[153] Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Vanessa Emanuela Guarino,Claudia Winklmayr,Jannik Franzen,Josef Lorenz Rumberger,Manuel Pfeuffer,Sonja Greven,Klaus Maier-Hein,Carsten T. Lüth,Christoph Karg,Dagmar Kainmueller

Main category: cs.CV

TL;DR: 本文系统分析了图像分割中不确定性量化(UQ)的聚合策略,指出常用全局平均法的局限性,提出融合空间结构的新策略及跨数据集鲁棒的元聚合器,并在10个数据集上验证其在OOD与失败检测中的优越性。

Details Motivation: 现有图像分割不确定性聚合策略缺乏系统性研究与比较,导致实践不一致、最佳实践不明确,尤其忽略了空间和结构信息对下游任务性能的影响。 Method: (1)形式化分析常见聚合策略的性质、局限与缺陷;(2)提出融合分割不确定性空间结构的新聚合策略;(3)在10个几何与结构各异的数据集上对OOD检测与失败检测进行基准测试;(4)设计可集成多种聚合器的元聚合器以提升跨数据集鲁棒性。 Result: 利用空间结构的聚合器在OOD检测与失败检测中表现更优;但具体策略性能高度依赖数据集特性;所提元聚合器在所有数据集上均保持稳健高性能。 Conclusion: 聚合策略的选择应考虑数据集的空间与结构特性;引入空间建模的聚合方法更有效;元聚合器为实际应用提供了通用、鲁棒的解决方案。 Abstract: Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

[154] EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Fumihiko Tsuchiya,Taiki Miyanishi,Mahiro Ukai,Nakamasa Inoue,Shuhei Kurita,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CV

TL;DR: 本文提出了EC-Bench,一个用于评估长视频中枚举、计数和时间证据定位能力的新基准,揭示了当前多模态大语言模型在长视频定量推理任务上的严重局限性。

Details Motivation: 现有视频计数基准局限于短片段且仅评估最终数字答案,缺乏对‘应计什么’及‘是否跨时间一致识别目标’的评估;而真实长视频(数十分钟)中事件稀疏多样,亟需支持长时序推理的评估基准。 Method: 构建EC-Bench基准:包含152个超30分钟的长视频、1699个带显式证据时间跨度的查询;在22个MLLM上统一评测枚举(Enumeration)、计数(Counting)和时间证据定位(Temporal Grounding)三项能力,并进行相关性分析。 Result: SOTA MLLM在枚举和计数任务上准确率仅为29.98%和23.74%,远低于人类的78.57%和82.97%;分析表明枚举准确率、时间定位能力与计数性能强相关。 Conclusion: 当前MLLM在长视频定量推理方面存在根本性缺陷;EC-Bench为该领域提供了具有挑战性和诊断价值的新基准。 Abstract: Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

[155] Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

Jun-Woo Heo,Keonhee Park,Gyeong-Moon Park

Main category: cs.CV

TL;DR: 本文提出DEUS框架,通过ETF子空间未知分离(EUS)和基于能量的已知区分损失(EKD)解决开放世界目标检测中未知类别发现难与灾难性遗忘问题,显著提升未知检测性能并保持已知类精度。

Details Motivation: 现有OWOD方法依赖已知类别预测来发现未知对象,难以有效学习未知表征;且记忆回放虽缓解旧类遗忘,却损害新类知识。 Method: 提出DEUS框架,包含:1)ETF-Subspace Unknown Separation(EUS),利用等角紧框架构建正交子空间,结合双空间能量计算增强未知识别;2)Energy-based Known Distinction(EKD)损失,约束新旧分类器分离以减少记忆回放中的知识干扰。 Result: 在OWOD基准上验证,DEUS显著提升未知检测性能,同时保持有竞争力的已知类别检测精度。 Conclusion: DEUS通过几何子空间分离与双空间能量建模,有效解耦已知/未知表征,并缓解增量学习中的知识干扰,为开放世界目标检测提供了更鲁棒的解决方案。 Abstract: In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

[156] NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome

Badhan Mazumder,Sir-Lord Wiafe,Vince D. Calhoun,Dong Hye Ye

Main category: cs.CV

TL;DR: NeuroBRIDGE 是一种基于图神经网络的新框架,用于建模青少年纵向功能连接体的动态变化,提升物质使用启动(SUI)风险预测能力并提供可解释的神经通路洞察。

Details Motivation: 早期识别青少年物质使用启动(SUI)风险至关重要,但现有方法多将脑连接视为静态或横断面,忽略了其随时间及行为变化的动态特性。 Method: 提出 NeuroBRIDGE 框架:在黎曼切线空间中对齐纵向功能连接体,并结合双时间注意力与行为条件化的Koopman动力学建模时序演化。 Result: 在ABCD数据集上验证,NeuroBRIDGE显著优于相关基线模型,提升了未来SUI预测性能,并揭示了可解释的神经发育风险通路。 Conclusion: NeuroBRIDGE 为理解青少年神经发育风险机制提供了新工具,有助于推动精准预防干预策略的发展。 Abstract: Early identification of adolescents at risk for substance use initiation (SUI) is vital yet difficult, as most predictors treat connectivity as static or cross-sectional and miss how brain networks change over time and with behavior. We proposed NeuroBRIDGE (Behavior conditioned RIemannian Koopman Dynamics on lonGitudinal connEctomes), a novel graph neural network-based framework that aligns longitudinal functional connectome in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal change. Evaluated on ABCD, NeuroBRIDGE improved future SUI prediction over relevant baselines while offering interpretable insights into neural pathways, refining our understanding of neurodevelopmental risk and informing targeted prevention.

[157] SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li,Vinkle Srivastav,Nicolas Chanel,Saurav Sharma,Nabani Banik,Lorenzo Arboit,Kun Yuan,Pietro Mascagni,Nicolas Padoy

Main category: cs.CV

TL;DR: 本文提出SurgTEMP,一种面向手术视频问答(VQA)的多模态大语言模型框架,通过查询引导的分层视觉记忆机制和外科能力进阶训练策略,有效建模长时程、低对比度、知识密集型的手术视频,并在新构建的CholeVidQA-32K数据集上显著超越现有方法。

Details Motivation: 现有手术VQA研究局限于静态帧分析,忽视时序语义;且面临低视觉对比度、强领域知识依赖、分析需求跨时段分布、以及从感知到术中评估的多层次任务挑战。 Method: 提出SurgTEMP框架:(i) 查询引导的令牌选择模块,构建空间与时间双路视觉记忆库;(ii) 外科能力进阶(SCP)训练方案。同时发布CholeVidQA-32K数据集,含32K QA对、3855个视频片段(约128小时),按感知→评估→推理三级层次组织11类任务。 Result: 在CholeVidQA-32K上,SurgTEMP在细调与零样本设定下均显著优于当前主流开源多模态及视频大模型,在手术视频VQA任务上达到新SOTA。 Conclusion: SurgTEMP通过显式建模时空一致性与外科认知层级,为复杂、动态的手术视频理解提供了可扩展、可解释的多模态大模型范式,推动计算机辅助手术向实时、高阶术中决策支持迈进。 Abstract: Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

[158] Scaling Video Pretraining for Surgical Foundation Models

Sicheng Lu,Zikai Xiao,Jianhui Wei,Danyu Sun,Qi Lu,Keli Hu,Yang Feng,Jian Wu,Zongxin Yang,Zuozhu Liu

Main category: cs.CV

TL;DR: 本文提出了SurgRec,一种可扩展且可复现的外科视频理解预训练方法,包含SurgRec-MAE和SurgRec-JEPA两种变体,并构建了大规模多源手术视频数据集与标准化评估基准,显著提升了下游任务性能。

Details Motivation: 现有外科基础模型受限于数据规模小、手术流程多样性不足、评估不一致及缺乏可复现训练流程。 Method: 提出SurgRec预训练方案,构建含10,535个视频、2.145亿帧的多源手术视频语料库;设计统一预训练流水线(含平衡采样);建立覆盖16个下游数据集、4个临床领域的标准化可复现基准。 Result: SurgRec在多个下游数据集上持续优于自监督基线和视觉语言模型(VLMs);VLMs在细粒度时序识别任务中表现不稳定且对提示词敏感。 Conclusion: SurgRec为外科视频理解提供了可复现、可扩展的基础框架,推动社区构建更通用的手术视频模型,并将开源全部代码、模型与数据。 Abstract: Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

[159] Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight

Badhan Mazumder,Sir-Lord Wiafe,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为MAGNet的Transformer风格图神经网络框架,用于自适应学习脑结构与功能之间的交互关系,通过融合结构MRI和静息态fMRI数据,在ABCD数据集上实现了优于基线模型的认知功能预测性能。

Details Motivation: 理解大脑结构与功能如何相互作用对解释智力至关重要,但联合建模结构与功能连接组具有挑战性,因其分别反映互补的大脑组织特性。 Method: 提出Multi-scale Adaptive Graph Network(MAGNet),基于源形态测量(sMRI)提取区域间形态特征,并与静息态fMRI的功能连接融合;采用混合图建模直接与间接通路,结合局部-全局注意力机制优化连接重要性,并通过联合损失函数端到端地保障跨模态一致性与预测目标优化。 Result: 在ABCD数据集上,MAGNet在认知功能预测任务中显著优于相关基线模型,验证了其多模态融合的有效性。 Conclusion: MAGNet为结构-功能交互建模提供了新范式,有助于深化对认知功能神经基础的理解。 Abstract: Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.

[160] Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

Iain Swift,JingHua Ye

Main category: cs.CV

TL;DR: 本研究探索了在脑肿瘤预后预测中引入FLAIR MRI作为第三模态对已有双模态(组织病理学+基因组)框架的影响,发现在小样本下三模态早期融合有轻微性能提升但未达统计显著性,提示多模态整合需足够上下文支持。

Details Motivation: 现有研究已利用组织病理学与基因组数据进行脑肿瘤预后预测,但体积MRI(尤其是FLAIR序列)在统一生存预测框架中的作用尚未被探索。 Method: 在TCGA-GBMLGG队列(664例患者)上,构建单模态、双模态(9种组合)和三模态(3种组合)模型,采用早期融合、晚期融合与联合融合策略,并使用Composite Score(CS)评估性能,辅以置换检验和自助法置信区间分析。 Result: 三模态早期融合取得探索性CS=0.854,较双模态基线提升ΔCS=+0.011(p=0.250,不显著);MRI单模态CS=0.755;含MRI的实验受限于仅19例测试患者,置信区间极宽(如[0.400,1.000])。 Conclusion: FLAIR MRI作为第三模态可能带来有限但可测的预后增益,但其贡献依赖于充分的多模态协同;当前小样本限制了结论的稳健性,需更大规模验证。 Abstract: Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $Δ$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.

[161] SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays

Abdullah Thabit,Mohamed Benmahdjoub,Rafiuddin Jinabade,Hizirwan S. Salim,Marie-Lise C. van Veelen,Mark G. van Vledder,Eppo B. Wolvius,Theo van Walsum

Main category: cs.CV

TL;DR: 本文提出并评估了一个集成的基于头戴式显示器(HMD)的增强现实(AR)手术导航框架,支持患者与器械定位、影像-患者配准及实时导航可视化,已在HoloLens 2和Magic Leap 2上验证其在针穿刺引导与肋骨骨折定位中的精度(均<5 mm),代码开源。

Details Motivation: 现有HMD-AR手术导航系统集成复杂、依赖专业知识,阻碍科研进展,亟需一个通用、易用、可扩展的集成框架。 Method: 采用2D图案标记物进行患者与器械跟踪;结合枢轴校准与参考校准完成手术工具标定;通过点匹配与手动定位实现影像-患者配准;在HoloLens 2和Magic Leap 2上开展仿体实验,评估AR引导针插入与肋骨骨折定位两个术式。 Result: 工具尖端标定平均精度1 mm,影像-患者配准精度3 mm,两类手术任务靶向精度均低于5 mm。 Conclusion: 该框架是一个易配置、可扩展、开源的HMD-AR手术导航平台,适用于多种外科场景,并已公开代码以促进领域发展。 Abstract: Augmented reality (AR) devices with head mounted displays (HMDs) facilitate the direct superimposition of 3D preoperative imaging data onto the patient during surgery. To use an HMD-AR device as a stand-alone surgical navigation system, the device should be able to locate the patient and surgical instruments, align preoperative imaging data with the patient, and visualize navigation data in real time during surgery. Whereas some of the technologies required for this are known, integration in such devices is cumbersome and requires specific knowledge and expertise, hampering scientific progress in this field. This work therefore aims to present and evaluate an integrated HMD-based AR surgical navigation framework that is adaptable to diverse surgical applications. The framework tracks 2D patterns as reference markers attached to the patient and surgical instruments. It allows for the calibration of surgical tools using pivot and reference-based calibration techniques. It enables image-to-patient registration using point-based matching and manual positioning. The integrated functionalities of the framework are evaluated on two HMD devices, the HoloLens 2 and Magic Leap 2, with two surgical use cases being evaluated in a phantom setup: AR-guided needle insertion and rib fracture localization. The framework was able to achieve a mean tooltip calibration accuracy of 1 mm, a registration accuracy of 3 mm, and a targeting accuracy below 5 mm on the two surgical use cases. The framework presents an easy-to-use configurable tool for HMD-based AR surgical navigation, which can be extended and adapted to many surgical applications. The framework is publicly available at https://github.com/abdullahthabit/SurgNavAR.

[162] Conditional Polarization Guidance for Camouflaged Object Detection

QIfan Zhang,Hao Wang,Xiangrong Qin,Ruijie Li

Main category: cs.CV

TL;DR: 本文提出CPGNet,一种不对称的RGB-偏振框架,通过条件偏振引导机制显式调控RGB特征学习,提升伪装目标检测性能。

Details Motivation: 现有基于偏振的方法依赖复杂视觉编码器和融合机制,模型复杂度高、计算开销大,且未能充分挖掘偏振信息对分层RGB表征学习的显式指导作用。 Method: 提出CPGNet框架,包含轻量级偏振交互模块、条件偏振引导机制、偏振边缘引导的频率细化策略以及迭代反馈解码器。 Result: 在多个偏振数据集及非偏振数据集上,CPGNet持续优于当前最先进方法。 Conclusion: 条件偏振引导与频率细化结合迭代解码,能有效增强对伪装目标细微差异的感知能力,显著提升检测精度。 Abstract: Camouflaged object detection (COD) aims to identify targets that are highly blended with their backgrounds. Recent works have shown that the optical characteristics of polarization cues play a significant role in improving camouflaged object detection. However, most existing polarization-based approaches depend on complex visual encoders and fusion mechanisms, leading to increased model complexity and computational overhead, while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. To address these limitations, we propose CPGNet, an asymmetric RGB-polarization framework that introduces a conditional polarization guidance mechanism to explicitly regulate RGB feature learning for camouflaged object detection. Specifically, we design a lightweight polarization interaction module that jointly models these complementary cues and generates reliable polarization guidance in a unified manner. Unlike conventional feature fusion strategies, the proposed conditional guidance mechanism dynamically modulates RGB features using polarization priors, enabling the network to focus on subtle discrepancies between camouflaged objects and their backgrounds. Furthermore, we introduce a polarization edge-guided frequency refinement strategy that enhances high-frequency components under polarization constraints, effectively breaking camouflage patterns. Finally, we develop an iterative feedback decoder to perform coarse-to-fine feature calibration and progressively refine camouflage prediction. Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.

[163] Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

Wenyi Li,Renkai Luo,Yue Yu,Huan-ang Gao,Mingju Gao,Li Yuan,Chaoyou Fu,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了GeoCodeBench,一个面向3D几何视觉编程的博士级基准测试,评估大模型在复杂3D几何代码生成上的能力;结果显示当前最优模型GPT-5仅达36.6%通过率,揭示了显著能力差距,并发现方法部分上下文比全文更有效。

Details Motivation: 当前AI编码模型在复杂3D几何视觉代码生成上表现不佳,亟需一个高难度、科学严谨的基准来衡量和推动进展。 Method: 构建GeoCodeBench:从顶会论文官方代码库中提取候选函数,经人工筛选核心3D几何组件,为每个任务设计多样化、含边缘情况的单元测试;采用自动评分;评估8个主流开源与闭源模型,并进行上下文长度消融实验。 Result: GPT-5取得最高36.6%通过率;研究级任务显著难于通用3D能力任务;截断至Method部分的输入比使用全文输入效果更好(统计显著)。 Conclusion: GeoCodeBench填补了3D科学编程评测空白,揭示了当前模型在几何逻辑、长文本理解与算法实现上的关键短板,为迈向可信3D视觉编码提供了明确基准与改进方向。 Abstract: AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

[164] Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Kaleb Newman,Tyler Zhu,Olga Russakovsky

Main category: cs.CV

TL;DR: 本文研究视频扩散模型在迷宫求解任务中的内部规划动态,发现其在去噪早期即形成高层运动计划(早期计划承诺),且路径长度而非障碍密度是决定迷宫难度的关键因素;据此提出ChEaP方法,通过筛选优质早期计划并链式组合,显著提升长程迷宫求解准确率。

Details Motivation: 视频扩散模型展现出类似推理的新兴能力(如解迷宫、解谜题),但其生成过程中的推理机制尚不清楚,亟需系统性分析其内部规划动态。 Method: 以2D迷宫求解为可控测试平台,分析视频扩散模型在去噪过程中的隐含轨迹演化;识别早期计划承诺现象,并探究影响求解难度的关键因素;基于发现提出Chaining with Early Planning(ChEaP)推理策略,实现早期计划筛选与链式生成。 Result: 发现视频模型在前几轮去噪中即确定高层运动路径(早期计划承诺);路径长度是主导难度的关键变量,存在约12步的失败阈值;ChEaP方法将长程迷宫准确率从7%提升至67%,在Frozen Lake和VR-Bench上整体性能提升2.5倍。 Conclusion: 当前视频扩散模型具备比以往认知更深层的推理能力,其潜力可通过改进推理时缩放策略(如ChEaP)更可靠地激发,为构建具规划能力的生成式视频模型提供新思路。 Abstract: Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

[165] OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

Yuheng Liu,Xin Lin,Xinke Li,Baihan Yang,Chen Wang,Kalyan Sunkavalli,Yannick Hold-Geoffroy,Hao Tan,Kai Zhang,Xiaohui Xie,Zifan Shi,Yiwei Hu

Main category: cs.CV

TL;DR: OmniRoam是一种可控的全景视频生成框架,通过预览与精炼两阶段流程,实现长时程、高保真、全局一致的场景漫游。

Details Motivation: 现有基于透视视频的场景建模方法受限于观测视角窄、完整性与全局一致性差;需利用全景视频更广覆盖与天然时空一致性优势。 Method: 提出两阶段框架:1)预览阶段——轨迹控制的视频生成模型从输入图像/视频生成全景概览;2)精炼阶段——对概览视频进行时序扩展与空间超分,生成长程高分辨率全景视频;并构建两类全景视频数据集(合成+真实)用于训练。 Result: 在视觉质量、可控性与长时程场景一致性上均超越SOTA方法;支持实时视频生成与3D重建等拓展应用。 Conclusion: OmniRoam验证了全景表征在视频生成与场景建模中的优越性,为可控、长时程、高保真世界漫游提供了新范式。 Abstract: Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.