Table of Contents
cs.CL [Back]
[1] One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations
Sripad Karne
Main category: cs.CL
TL;DR: 本文利用塞尔维亚语的双文现象(拉丁与西里尔字母并存且语义完全一致)作为控制实验,研究稀疏自编码器(SAE)学习到的特征是否表征抽象语义而非表面书写形式。结果表明,不同文字下相同语义的句子激活高度重叠的SAE特征,且这种跨文字不变性随模型规模增大而增强,说明SAE特征具有超越分词层面的语义抽象能力。
Details
Motivation: 探究稀疏自编码器(SAE)学到的特征是反映抽象语义还是受限于文本表面书写形式。 Method: 利用塞尔维亚语拉丁与西里尔双文系统(语义相同、字符一一映射、分词完全不重叠)作为控制变量实验,分析Gemma系列模型(270M–27B)中SAE特征在跨文字、跨释义条件下的激活重叠度。 Result: 相同语义不同文字的句子激活高度重叠的SAE特征,其重叠度显著高于随机基线;跨文字差异小于同文字内释义差异;跨文字+跨释义组合仍具显著特征重叠,排除记忆效应;该不变性随模型规模增大而增强。 Conclusion: SAE特征能捕获高于表面分词层级的抽象语义,塞尔维亚双文现象可作为评估表示抽象性的通用评测范式。 Abstract: Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed: Serbian is written interchangeably in Latin and Cyrillic scripts with a near-perfect character mapping between them, enabling us to vary orthography while holding meaning exactly constant. Crucially, these scripts are tokenized completely differently, sharing no tokens whatsoever. Analyzing SAE feature activations across the Gemma model family (270M-27B parameters), we find that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Strikingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data yet still exhibit substantial feature overlap. This script invariance strengthens with model scale. Taken together, our findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization, and we propose Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations.[2] MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
Ibrahim Baroud,Christoph Otto,Vera Czehmann,Christine Hovhannisyan,Lisa Raithel,Sebastian Möller,Roland Roller
Main category: cs.CL
TL;DR: 本文提出了一种基于机器翻译的多语言匿名化基准数据集构建方法,利用合成数据与神经机器翻译技术,在保障隐私合规的前提下,生成包含2500+标注的十语种医疗匿名化评测基准。
Details
Motivation: 医疗敏感数据因隐私问题难以获取,而高质量带PII标注的数据对开发和评估匿名化系统至关重要;真实数据受限,需依赖合成数据与跨语言迁移解决数据稀缺与多语言覆盖问题。 Method: 采用神经机器翻译方法构建十语种匿名化基准,重点保持原始PII标注一致性,并对人名、地名等进行文化与语境适配式翻译与本地化;使用合成及经验证的真实数据作为源语料。 Result: 构建了含超2500个PII标注的十语种匿名化基准数据集;医学专业人士评估证实其翻译质量高,尤其在PII翻译与本地化方面表现优异;数据集与标注指南已开源。 Conclusion: 该基准有效缓解了多语言医疗匿名化研究中的数据瓶颈,支持标注员培训、跨机构标注验证及自动PII识别模型提升,且规避了真实患者数据的法律与隐私风险。 Abstract: Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.[3] ConFu: Contemplate the Future for Better Speculative Sampling
Zongyue Qin,Raghavv Goel,Mukul Gagrani,Risheek Garrepalli,Mingu Lee,Yizhou Sun
Main category: cs.CL
TL;DR: 本文提出ConFu框架,通过让草稿模型'预见未来'来提升推测解码效率,显著提高令牌接受率和生成速度。
Details
Motivation: 现有推测解码方法中草稿模型仅基于当前前缀预测,导致多步后预测偏离目标模型,造成误差累积。 Method: 提出ConFu框架:(i) 引入'沉思令牌'和软提示,使草稿模型低成本利用目标模型的未来信号;(ii) 设计基于MoE的动态沉思令牌机制实现上下文感知的未来预测;(iii) 构建含锚点令牌采样与未来预测复现的训练框架。 Result: 在Llama-3 3B/8B模型上,ConFu相较EAGLE-3在多个下游任务中令牌接受率和生成速度提升8–11%。 Conclusion: ConFu首次将推测解码与连续推理令牌结合,为大语言模型推理加速提供了新方向。 Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.[4] SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
Hexuan Wang,Yaxuan Ren,Srikar Bommireddypalli,Shuxian Chen,Adarsh Prabhudesai,Rongkun Zhou,Elina Baral,Philipp Koehn
Main category: cs.CL
TL;DR: SciTaRC是一个专家构建的科学论文表格数据问答基准,现有AI模型在该基准上表现不佳,主要受限于执行瓶颈:代码模型在原始科学表格上表现脆弱,而语言模型则因理解不足和计算错误而失败。
Details
Motivation: 现有AI模型在处理科学论文中的表格数据问答任务时表现不佳,缺乏一个能够评估深度语言推理和复杂计算能力的高质量基准。 Method: 构建了一个名为SciTaRC的专家标注基准,包含需深度语言推理与复杂计算的科学论文表格问答问题,并对多种先进AI模型进行系统性评测与归因分析。 Result: 当前最先进AI模型在SciTaRC上失败率至少23%,Llama-3.3-70B-Instruct失败率达65.5%;发现普遍存在的“执行瓶颈”,即模型难以忠实执行正确策略。 Conclusion: 单纯提升模型规模或能力不足以解决科学表格问答中的核心挑战,需针对性改进执行可靠性,尤其在表格解析、策略执行与数值计算等环节。 Abstract: We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.[5] Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance
Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Joseph Skrovan,Mehak Beri,Hitakshi Modi,Andrew Well,Carlos M. Mery,Yan Zhang,Mia K. Markey,Ying Ding
Main category: cs.CL
TL;DR: 本文提出了一种结合迭代式代码本优化与完整溯源追踪的自动化主题分析(TA)框架,显著提升了代码本的通用性与可审计性,并在多个数据集上展现出优于现有方法的综合质量。
Details
Motivation: 手动主题分析在健康研究中面临可扩展性和可重复性差的问题,而现有大语言模型(LLM)驱动的自动化方法生成的代码本泛化能力有限且缺乏分析可审计性。 Method: 提出一种自动化TA框架,核心包括迭代式代码本优化和全流程溯源追踪机制,并在五个涵盖临床访谈、社交媒体和公开转录文本的数据集上进行评估。 Result: 该框架在五个数据集中的四个上取得最高综合质量得分;迭代优化在四个数据集上带来统计显著且效应量大的提升,主要体现在代码复用性和分布一致性增强,同时保持描述质量;在两个儿科心脏病学临床语料上,生成主题与专家标注主题高度一致。 Conclusion: 所提框架有效解决了LLM自动化TA中代码本泛化性弱与分析不可审计的问题,为健康领域定性研究提供了更可靠、可复现的自动化支持。 Abstract: Thematic analysis (TA) is widely used in health research to extract patterns from patient interviews, yet manual TA faces challenges in scalability and reproducibility. LLM-based automation can help, but existing approaches produce codebooks with limited generalizability and lack analytic auditability. We present an automated TA framework combining iterative codebook refinement with full provenance tracking. Evaluated on five corpora spanning clinical interviews, social media, and public transcripts, the framework achieves the highest composite quality score on four of five datasets compared to six baselines. Iterative refinement yields statistically significant improvements on four datasets with large effect sizes, driven by gains in code reusability and distributional consistency while preserving descriptive quality. On two clinical corpora (pediatric cardiology), generated themes align with expert-annotated themes.[6] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong,Kevin Guo,Congning Ni,Chao Yan,Katherine Brown,Avinash Baidya,Xiang Gao,Bradley Marlin,Zhijun Yin
Main category: cs.CL
TL;DR: 本文提出一种置信度感知的决策框架,通过分析单条推理路径自适应选择单路径或多路径推理,在保持准确率的同时显著降低推理开销。
Details
Motivation: 大型语言模型在链式思维(CoT)推理中常生成冗长路径,导致高推理成本;而自一致性等多路径方法虽提升准确率,却带来巨大计算开销。 Method: 构建基于句子级数值与语言特征的置信度感知决策框架,利用MedQA数据集中间推理状态进行训练,并在多个下游数据集上零样本迁移。 Result: 在MedQA、MathQA、MedMCQA和MMLU上无需微调即可泛化;相比多路径基线,准确率相当,但token消耗减少最多达80%。 Conclusion: 推理路径中蕴含丰富的不确定性信号,可支撑一种简单、可迁移的机制来权衡大模型推理的准确性与效率。 Abstract: Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80\% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.[7] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
Kaiser Sun,Xiaochuang Yuan,Hongjun Liu,Chen Zhao,Cheng Zhang,Mark Dredze,Fan Bai
Main category: cs.CL
TL;DR: 本文系统诊断了多模态大语言模型(MLLMs)在处理图像中文字时性能下降的“模态差距”问题,发现其受任务类型、数据来源及渲染参数(如字体、分辨率)显著影响;通过大规模错误分析揭示图像输入会加剧阅读类错误并导致部分模型思维链崩溃;据此提出一种基于纯文本推理轨迹自蒸馏的方法,大幅提升图像模式下的准确率且具备跨任务泛化能力。
Details
Motivation: 多模态大语言模型(MLLMs)在处理图像中的文本时性能常显著低于纯文本输入,但该‘模态差距’的成因尚不清晰,亟需系统性诊断与可解释的改进路径。 Method: 在五种输入模式、七个基准任务上评估七个MLLMs;开展覆盖4000+样本的扎根理论错误分析;提出基于模型自身纯文本推理轨迹与对应图像输入的自蒸馏训练方法。 Result: 发现模态差距具有任务与数据依赖性(如数学任务在合成文本渲染下性能下降超60分,而真实文档图像有时反超文本模式);字体等渲染因素可导致准确率波动达47个百分点;图像输入主要加剧阅读错误(计算与格式错误),不显著影响知识与推理错误;部分模型出现视觉输入下的思维链崩溃;所提自蒸馏法将GSM8K图像模式准确率从30.71%提升至92.72%,且迁移至未见基准无灾难性遗忘。 Conclusion: 模态差距并非固有缺陷,而是由渲染偏差、模型对视觉输入的阅读机制薄弱及推理路径不稳定共同导致;通过结合文本推理监督与图像输入的自蒸馏策略,可有效弥合该差距,为提升MLLM视觉文本理解提供可行方案。 Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.[8] Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
Trent R Northen,Mingxun Wang
Main category: cs.CL
TL;DR: 本研究评估了大语言模型(LLMs)在材料、能源、制造和算法四个领域中对合成技术 vs. 生物技术方案的系统性偏好偏差,发现多数模型偏向合成方案;通过基于生物医学文献的微调(QLoRA),显著提升了两个开源小模型(Llama 3.2-3B 和 Qwen2.5-3B)对生物方案的偏好,且未损害其通用能力。
Details
Motivation: LLMs 在互联网规模数据上训练,可能继承并放大对非生物(合成)技术的系统性偏好,忽视可持续、生物启发的解决方案;亟需评估并校正此类‘生物不对齐’(bio-misalignment)问题。 Method: 1) 构建含50条提示的Bioalignment基准,采用Kelly准则启发式评估框架量化模型对生物/合成方案的偏好;2) 在5个前沿模型和5个开源模型上测试;3) 使用约2200万token的PMC生物医学文献 corpus,分别对Llama 3.2-3B-Instruct(混合持续预训练+指令微调)和Qwen2.5-3B-Instruct(纯指令微调)进行QLoRA微调;4) 统计检验微调效果。 Result: 多数测试模型呈现显著‘非生物对齐’(即偏好合成方案);QLoRA微调后,两模型对生物方案评分均显著提升(Holm-Bonferroni校正后p < 0.001 和 p < 0.01),且通用能力未下降。 Conclusion: LLMs存在可测量、可修正的生物技术偏好偏差;少量高质量生物领域微调即可有效提升模型对生物/仿生方案的权衡倾向,为发展‘生物对齐’AI提供了可行路径;成果(基准、语料、代码、适配器权重)已全部开源。 Abstract: Large language models (LLMs) trained on internet-scale corpora can exhibit systematic biases that increase the probability of unwanted behavior. In this study, we examined potential biases towards synthetic vs. biological technological solutions across four domains (materials, energy, manufacturing, and algorithms). A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework. According to this metric, most models were not bioaligned in that they exhibit biases in favor of synthetic (non-biological) solutions. We next examined if fine-tuning could increase the preferences of two open-weight models, Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct, for biological-based approaches. A curated corpus of ~22M tokens from 6,636 PMC articles emphasizing biological problem-solving was used first to fine-tune Llama 3B with a mixed corpus of continued training and instruction-formatted. This was then extended to Qwen 3B using instruction-formatted only. We found that QLoRA fine-tuning significantly increased the scoring of biological solutions for both models without degrading general capabilities (Holm-Bonferroni-corrected p < 0.001 and p < 0.01, respectively). This suggests that even a small amount of fine-tuning can change how models weigh the relative value of biological and bioinspired vs. synthetic approaches. Although this work focused on small open-weight LLMs, it may be extensible to much larger models and could be used to develop models that favor bio-based approaches. We release the benchmark, corpus, code, and adapter weights.[9] DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization
Jianing Yang,Yusuke Fujita,Yui Sudo
Main category: cs.CL
TL;DR: 本文提出DuplexCascade,一种无需VAD的级联流式语音到语音对话系统,通过微回合(chunk-wise micro-turn)交互和专用控制标记实现全双工对话,兼顾实时性与大语言模型的对话智能。
Details
Motivation: 传统ASR-LLM-TTS级联系统依赖VAD导致半双工、控制脆弱;而端到端VAD-free模型虽支持全双工,却难以保持对话智能。需兼顾全双工交互能力与LLM的强推理能力。 Method: 提出DuplexCascade:将传统话语级(utterance-wise)交互转为块级(chunk-wise)微回合交互;设计一组面向流式约束的对话专用控制标记,用于协调LLM的响应时机与轮转行为。 Result: 在Full-DuplexBench和VoiceBench上达到开源语音到语音对话系统中全双工轮转性能与对话智能的SOTA水平。 Conclusion: DuplexCascade验证了在级联架构下实现高质量全双工语音对话的可行性,弥合了系统可控性与语言模型智能之间的鸿沟。 Abstract: Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.[10] DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval
Taegyeong Lee,Jiwon Park,Seunghyun Hwang,JooYoung Jang
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的直接嵌入优化(DEO)方法,用于处理否定和排除类查询的文本与多模态检索,通过分解查询并采用对比学习目标优化嵌入,在多个指标上显著超越基线方法。
Details
Motivation: 现有检索方法在处理否定和排除类查询时效果不佳,而以往改进方法依赖嵌入适配或微调,带来额外计算开销和部署复杂性。 Method: 提出Direct Embedding Optimization(DEO),将查询分解为正向与负向成分,并在不引入新训练数据或更新模型的前提下,通过对比学习目标优化查询嵌入。 Result: 在NegConstraint数据集上,DEO相比基线提升+0.0738 nDCG@10和+0.1028 MAP@100;在多模态检索中,Recall@5较OpenAI CLIP提升6%。 Conclusion: DEO是一种高效、实用的训练-free方法,能有效支持真实场景中否定与排除感知的检索任务。 Abstract: Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have enabled diverse retrieval methods. However, existing retrieval methods often fail to accurately retrieve results for negation and exclusion queries. To address this limitation, prior approaches rely on embedding adaptation or fine-tuning, which introduce additional computational cost and deployment complexity. We propose Direct Embedding Optimization (DEO), a training-free method for negation-aware text and multimodal retrieval. DEO decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective. Without additional training data or model updates, DEO outperforms baselines on NegConstraint, with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6\% over OpenAI CLIP in multimodal retrieval. These results demonstrate the practicality of DEO for negation- and exclusion-aware retrieval in real-world settings.[11] Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
Benjamin Reichman,Adar Avasian,Samuel Webster,Larry Heck
Main category: cs.CL
TL;DR: 本文研究情绪作为潜在因素如何影响大语言模型的注意力机制和推理行为,提出AURA-QA数据集和情绪正则化框架,提升模型在情感变化与非情感变化场景下的阅读理解性能。
Details
Motivation: 现有大语言模型在情绪语调多变的文本上部署广泛,但其推理能力评估通常忽略情绪带来的表征差异;以往工作多将情绪视为预测目标(如情感分类),而本文关注情绪如何隐式塑造模型对文本的注意与推理。 Method: 分析情绪对Transformer模型注意力几何结构(如局部性、质心距离、熵)的系统性影响;构建情感均衡、人工撰写的人类阅读理解数据集AURA-QA;提出情绪条件下的表征漂移约束正则化训练框架。 Result: 发现注意力指标随情绪类型显著变化,且与问答性能相关;AURA-QA支持可控的情绪效应研究;所提正则化方法在多个QA基准(含分布偏移场景)上带来一致性能提升。 Conclusion: 情绪是影响语言模型内部表征与推理的关键潜变量;显式建模并约束情绪引发的表征漂移,可增强模型鲁棒性与泛化能力,尤其在情感多样文本中。 Abstract: Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.[12] SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
Hsiao-Ying Huang,Cheng-Han Chiang,Hung-yi Lee
Main category: cs.CL
TL;DR: 本文提出SPAR-K框架,通过模态感知的早期退出机制加速交错式语音语言模型(SLM)推理,在保持语音感知质量的同时显著降低解码深度。
Details
Motivation: 交错式语音语言模型(SLMs)在每步生成中需全深度Transformer解码,尤其面对长语音序列时计算开销大,亟需高效推理方法。 Method: 提出SPAR-K:一种模态感知的早期退出框架,包含语音交替深度调度策略——多数语音位置在固定中间层退出,周期性全深度‘刷新’步骤缓解分布偏移。 Result: 在Step-Audio-2-mini和GLM-4-Voice上实验表明,SPAR-K最多降低平均语音解码深度11%(前者)和5%(后者),问答准确率最大下降仅0.82%,MOS与WER几乎不变,且无额外计算开销;同时验证通用文本LLM中的置信度驱动早退策略不适用于SLM。 Conclusion: 语音token具有独特统计特性,需专用早期退出设计;SPAR-K在效率与质量间实现良好平衡,为SLM高效部署提供新范式。 Abstract: Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth "refresh" steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\% while reducing average speech decoding depth by up to 11\% on Step-Audio-2-mini and 5\% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.[13] LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression
Thao Do,Dinh Phu Tran,An Vo,Seon Kwon Kim,Daeyoung Kim
Main category: cs.CL
TL;DR: 本文提出了一种基于边际的查询驱动上下文剪枝框架,通过衡量句子省略时线索丰富度的变化来识别关键句子,从而实现高效、紧凑且精准的上下文压缩,提升检索增强生成(RAG)的准确性和可扩展性。
Details
Motivation: 高效上下文压缩对提升问答系统的准确性与可扩展性至关重要;RAG中需快速、紧凑、精准地传递上下文,以保障线索充分性并控制大语言模型阅读器的成本。 Method: 提出基于边际的查询驱动上下文剪枝框架,利用轻量级仅编码器Transformer,通过复合排序损失训练:对关键句子施加大间隔,对非关键句子保持中性得分。 Result: 在精确匹配和F1分数上表现强劲,推理吞吐量高、内存需求低,压缩比高且不损害问答性能。 Conclusion: 该方法是一种轻量、实用的RAG上下文压缩替代方案,在效率与效果间取得良好平衡。 Abstract: Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.[14] TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation
Jiashuo Sun,Yixuan Xie,Jimeng Shi,Shaowen Wang,Jiawei Han
Main category: cs.CL
TL;DR: 本文提出TaSR-RAG,一种基于轻量级两层分类法引导的结构化推理框架,将查询和文档表示为关系三元组,通过分步三元组匹配与显式实体绑定表实现高效、可解释的多跳证据选择,显著提升多跳问答性能与推理忠实性。
Details
Motivation: 现有RAG系统多依赖非结构化文本块检索和单次生成,导致上下文冗余、信息密度低、多跳推理脆弱;而结构化RAG常需昂贵易错的图构建或受限于僵化的实体中心结构,难以适配查询的动态推理链。 Method: 提出TaSR-RAG:1)用轻量两层分类法约束三元组中实体语义;2)将复杂问题分解为含隐变量的有序三元组子查询序列;3)通过融合原始三元组语义相似性与类型化三元组结构一致性的混合匹配进行逐步证据选择;4)维护跨步实体绑定表以解析中间变量、避免实体混淆。 Result: 在多个多跳问答基准上,TaSR-RAG持续超越强RAG及结构化RAG基线达14%,同时生成更清晰的证据归因和更忠实的推理轨迹。 Conclusion: TaSR-RAG通过 taxonomy-guided triple representation 和 step-wise hybrid matching,在不依赖显式图构建或穷举搜索的前提下,实现了高精度、高可解释性、低脆性的多跳RAG,为知识密集型问答提供了新范式。 Abstract: Retrieval-Augmented Generation (RAG) helps large language models (LLMs) answer knowledge-intensive and time-sensitive questions by conditioning generation on external evidence. However, most RAG systems still retrieve unstructured chunks and rely on one-shot generation, which often yields redundant context, low information density, and brittle multi-hop reasoning. While structured RAG pipelines can improve grounding, they typically require costly and error-prone graph construction or impose rigid entity-centric structures that do not align with the query's reasoning chain. We propose \textsc{TaSR-RAG}, a taxonomy-guided structured reasoning framework for evidence selection. We represent both queries and documents as relational triples, and constrain entity semantics with a lightweight two-level taxonomy to balance generalization and precision. Given a complex question, \textsc{TaSR-RAG} decomposes it into an ordered sequence of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching that combines semantic similarity over raw triples with structural consistency over typed triples. By maintaining an explicit entity binding table across steps, \textsc{TaSR-RAG} resolves intermediate variables and reduces entity conflation without explicit graph construction or exhaustive search. Experiments on multiple multi-hop question answering benchmarks show that \textsc{TaSR-RAG} consistently outperforms strong RAG and structured-RAG baselines by up to 14\%, while producing clearer evidence attribution and more faithful reasoning traces.[15] Quantifying and extending the coverage of spatial categorization data sets
Wanchun Li,Alexandra Carstensen,Yang Xu,Terry Regier,Charles Kemp
Main category: cs.CL
TL;DR: 本文探讨了利用大语言模型(LLM)生成的空间关系标签来扩展和优化空间分类数据集(如TRPS),发现LLM标签与人类标签具有较好一致性,并据此新增42个场景,提升了数据集覆盖度。
Details
Motivation: 解决现有空间关系数据集(如TRPS)在语言和场景覆盖上的局限性,探索高效扩展多语言、大规模空间数据集的方法。 Method: 使用大语言模型(LLMs)为TRPS及新设计的场景生成空间关系标签,并与人类标注进行对比;基于一致性分析指导场景与语言的选择,进而扩展TRPS数据集。 Result: LLM生成的标签与人类标签对齐度较高;新增的42个场景使TRPS扩展版本在场景空间覆盖上优于此前两个扩展版本。 Conclusion: LLM可作为辅助工具有效支持空间语义数据集的规模化构建,为未来涵盖数十种语言、数百个场景的数据集建设奠定基础。 Abstract: Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.[16] Reward Prediction with Factorized World States
Yijun Shen,Delong Chen,Xianming Hu,Jiaming Mi,Hongbo Zhao,Kai Zhang,Pascale Fung
Main category: cs.CL
TL;DR: 本文提出StateFactory方法,通过语言模型将非结构化观测转化为分层的对象-属性结构,利用当前状态与目标状态间的语义相似度来自然估计奖励,从而实现跨领域的零样本奖励泛化。
Details
Motivation: 监督学习的奖励模型易受训练数据偏差影响,泛化能力差;本文探索仅依赖良好定义的世界状态表征是否足以实现跨域准确奖励预测。 Method: 提出StateFactory:一种基于语言模型的因子化解析方法,将原始观测映射为分层对象-属性结构,并在层级约束下以语义相似度衡量当前状态与目标状态的距离作为奖励估计。 Result: 在涵盖5个领域的RewardPrediction新基准上,相比VLWM-critic和LLM-as-a-Judge,EPIC距离分别降低60%和8%;在AlfWorld和ScienceWorld中规划成功率分别提升+21.64%和+12.40%。 Conclusion: 紧凑、结构化的状态表征(如StateFactory所构建)可显著提升奖励模型的零样本泛化能力,并有效增强智能体的系统-2规划性能。 Abstract: Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io[17] LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
Lukáš Eigler,Jindřich Libovický,David Hurych
Main category: cs.CL
TL;DR: 本文提出了一种名为'LLM as a Meta-Judge'的可扩展框架,利用大语言模型通过受控语义退化生成合成评估数据集,以替代耗时昂贵的人工标注,验证NLG评估指标的有效性。
Details
Motivation: 现有NLG评估指标验证严重依赖昂贵且耗时的人工标注,而这些标注主要仅存在于英文数据集,缺乏多语言支持和可扩展性。 Method: 提出'LLM as a Meta-Judge'框架,利用大语言模型对真实数据进行可控语义退化,生成合成评估数据集;采用'meta-correlation'指标衡量合成数据与人工基准所得指标排序的一致性。 Result: 在机器翻译、问答和摘要任务上的实验表明,合成验证能可靠代理人工判断,在多语言问答中meta-correlation超过0.9,且在人工标注不可得或成本过高时具有实用价值。 Conclusion: LLM作为元评判器是一种高效、可扩展、多语言兼容的NLG评估指标验证新范式,可显著降低对人工标注的依赖。 Abstract: Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.[18] Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health
Trung Hieu Ngo,Adrien Bazoge,Solen Quiniou,Pierre-Antoine Gourraud,Emmanuel Morin
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLMs)在法语患者记录中对性别与其他社会健康决定因素(SDoH)交互关系所表现出的偏见,发现LLM会依赖训练数据中嵌入的刻板印象进行性别化决策,建议将SDoH因素间的交互评估纳入现有偏见评测框架。
Details
Motivation: 现有偏见评测基准多关注单一社会健康决定因素(如性别或种族),忽视其交互效应及医疗等敏感领域的上下文特异性,亟需更全面的偏见评估方法。 Method: 通过设计一系列实验,在法语患者记录数据上探测LLM对性别与其他SDoH因素之间关系的响应,分析其是否依赖嵌入的刻板印象做出性别化判断。 Result: 实验证实LLM可被SDoH输入触发嵌入的刻板印象,并据此做出性别化决策,表明SDoH因素间的交互存在可观测且影响显著的偏见。 Conclusion: 评估LLM偏见时应纳入SDoH因素间的交互作用,该视角可有效补充当前以单因素为主的评测范式,尤其在医疗等高风险领域具有重要价值。 Abstract: Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.[19] Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs
Saugata Purkayastha,Pranav Kushare,Pragya Paramita Pal,Sukannya Purkayastha
Main category: cs.CL
TL;DR: 本文揭示了大型语言模型(LLMs)在道德推理与常识理解之间的权衡缺陷,提出CoMoral基准数据集以评估模型识别道德困境中常识矛盾的能力,并发现模型存在叙事焦点偏差,呼吁加强推理感知训练。
Details
Motivation: 当前LLMs被广泛部署于真实场景,亟需兼具道德基础与常识能力;然而其常过度偏向道德推理而忽视常识一致性,这一局限尚未被系统揭示和评估。 Method: 构建新型基准数据集CoMoral,其中嵌入道德困境内的常识矛盾;对10个不同规模的LLMs进行系统评测,并分析其在主语(叙述者)与次级角色间识别矛盾的差异,揭示叙事焦点偏差。 Result: 实验表明现有LLMs普遍难以在无提示下识别常识矛盾;且显著更易检测归因于次级角色而非叙述者自身的矛盾,证实叙事焦点偏差的存在。 Conclusion: LLMs的常识鲁棒性亟待提升,需引入推理感知的训练机制以平衡道德推理与常识理解能力。 Abstract: Large Language Models (LLMs) are increasingly deployed across diverse real-world applications and user communities. As such, it is crucial that these models remain both morally grounded and knowledge-aware. In this work, we uncover a critical limitation of current LLMs -- their tendency to prioritize moral reasoning over commonsense understanding. To investigate this phenomenon, we introduce CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. Through extensive evaluation of ten LLMs across different model sizes, we find that existing models consistently struggle to identify such contradictions without prior signal. Furthermore, we observe a pervasive narrative focus bias, wherein LLMs more readily detect commonsense contradictions when they are attributed to a secondary character rather than the primary (narrator) character. Our comprehensive analysis underscores the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models.[20] Modelling the Diachronic Emergence of Phoneme Frequency Distributions
Fermín Moscoso del Prado Martín,Suchir Salhan
Main category: cs.CL
TL;DR: 本文通过构建一个语音变化的随机模型,模拟了音系系统的历时演化,发现加入功能负荷效应和偏好音位库大小的稳定倾向后,模型能再现跨语言音素频率分布的统计规律,表明这些规律可能是历时音变的自然结果,而非显式优化或补偿机制所致。
Details
Motivation: 音素频率分布在不同语言中表现出稳健的统计规律性(如指数尾部的秩-频率模式、音位库大小与相对熵的负相关),但其起源尚未得到充分解释。 Method: 提出一个语音变化的随机模型,模拟音系系统历时演化;在基础模型上引入两个假设:功能负荷效应和趋向偏好音位库大小的稳定倾向。 Result: 扩展后的模型成功再现了观察到的音素秩-频率分布及音位库大小与相对熵的负相关关系。 Conclusion: 某些音系系统的统计规律可能是历时音变的自然后果,而非源于显式的优化或补偿机制。 Abstract: Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A naïve version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions -- an effect related to functional load and a stabilising tendency toward a preferred inventory size -- yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as natural consequences of diachronic sound change rather than from explicit optimisation or compensatory mechanisms.[21] You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases
Isaia Gisler,Zhonghao He,Tianyi Qiu
Main category: cs.CL
TL;DR: 本文研究了语言模型在使用合成数据训练时,学生模型如何通过语义无关甚至矛盾的自然语言改写,隐式地从教师模型中习得行为偏好(如对特定动物的喜爱),揭示了‘潜意识学习’现象的隐蔽性和顽固性。
Details
Motivation: 探究在自然语言改写(paraphrase)场景下,学生模型是否仍会通过语义无关或甚至与教师偏好相悖的内容,隐式习得教师模型的行为特质,从而评估数据生成管道中潜意识学习的风险。 Method: 通过系统提示教师模型偏好某一动物,生成大量语义固定但形式多样的自然语言改写文本(包括与该动物无关或明确表达厌恶的内容),用以训练学生模型;随后测量学生模型对该动物的偏好变化,并采用严格保真度过滤控制变量。 Result: 学生模型在训练后对该动物的偏好显著上升(最高达19个百分点),且该效应在改写内容语义无关甚至明确反对该偏好时依然存在;内容保真过滤无法阻断该效应。 Conclusion: 潜意识学习可通过自然语言改写稳健发生,且无法被内容审查或偏好矛盾信息有效抑制,对自生成数据训练范式构成严重对齐风险。 Abstract: When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.[22] ALARM: Audio-Language Alignment for Reasoning Models
Petr Grinberg,Hassan Shahmohammadi
Main category: cs.CL
TL;DR: 本文提出了一种面向推理型大语言模型(RLMs)的音频语言模型(ALM)训练方法——自重述(self-rephrasing),并融合压缩多音频编码器,在600万样本多任务数据集上高效训练出4B参数ALM,在多项音频推理基准上达到开源SOTA。
Details
Motivation: 现有ALM训练方法在冻结LLM、仅训练适配器时,对具备链式思维(CoT)能力的推理型LLM(RLM)效果不佳,因其生成的文本代理输入易被暴露,导致响应不自然。 Method: 提出自重述机制,将模型自生成的文本响应转化为适配RLM的音频理解变体,保持分布一致性;融合并压缩多个音频编码器以增强表征能力;构建含600万实例(250万唯一提示)、覆盖19K小时语音/音乐/声音的多任务训练语料。 Result: 所提4B参数ALM在音频推理基准(如MMAU-speech、MMSU)上超越多数更大规模ALM,取得当前最佳开源结果,并在所有参评模型中综合排名第三;同时保持文本能力,训练成本低。 Conclusion: 自重述与多编码器融合策略有效解决了RLM适配ALM的关键瓶颈,为高效构建兼具强音频理解和原生文本能力的轻量级ALM提供了新范式。 Abstract: Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.[23] Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models
Shreyas Meher
Main category: cs.CL
TL;DR: 本文通过对比细调的ModernBERT(Confli-mBERT)与领域预训练模型(ConfliBERT)在冲突事件分类任务上的表现,发现两者性能差距主要集中在罕见事件类别上;据此提出一个基于类别分布、错误容忍度和资源约束的实用决策框架,指导政治学者选择NLP建模策略。
Details
Motivation: 政治学界缺乏关于如何在构建领域专用模型、迁移现有模型或微调通用大模型之间做实证权衡的指导。 Method: 以冲突事件分类为测试案例,基于全球恐怖主义数据库(GTD)对ModernBERT进行微调得到Confli-mBERT,并系统对比其与当前领域金标准模型ConfliBERT的性能(准确率与F1值),尤其分析高频与低频事件类别的差异。 Result: Confli-mBERT准确率为75.46%,略低于ConfliBERT的79.34%;但在高频攻击类型(如爆炸、绑架)上F1值几乎持平,性能差距主要集中于占比<2%的罕见事件类别。 Conclusion: 模型选择不应抽象比较‘优劣’,而应依据具体研究问题中的类别分布、可容忍误差及可用资源,采用本文提出的实践决策框架。 Abstract: Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.[24] Surgical Repair of Collapsed Attention Heads in ALiBi Transformers
Palmer Schallon
Main category: cs.CL
TL;DR: 本文发现BLOOM系列模型中ALiBi位置编码导致注意力头系统性坍缩,提出手术式重初始化方法有效恢复注意力头功能,并揭示预训练注意力配置可能是次优局部极小值。
Details
Motivation: 解决BLOOM模型中由ALiBi位置编码引发的注意力头系统性坍缩问题,提升模型性能。 Method: 提出手术式重初始化方法:针对坍缩的注意力头进行Q/K/V权重重初始化,零化输出投影,并对非手术参数施加梯度掩码冻结;在单卡消费级GPU上对BLOOM-1b7实施验证。 Result: 成功将BLOOM-1b7中有效注意力头数量从242恢复至379(共384头),达98.7%容量;重初始化后模型训练困惑度短暂优于原模型25%(12.70 vs. 16.99)。 Conclusion: ALiBi引起的注意力坍缩具有可预测模式,手术式重初始化可高效修复;预训练注意力配置并非最优,存在改进空间;开源代码、检查点与诊断工具。 Abstract: We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.[25] Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models
Luc Builtjes,Alessa Hering
Main category: cs.CL
TL;DR: 本文提出了一种完全开源、可本地部署的放射科报告纵向信息提取流程,使用qwen2.5-72b模型按RECIST标准提取并关联靶病灶、非靶病灶和新病灶,实现在隐私敏感医疗环境中的高精度(93.7%-94.9%)结构化数据提取。
Details
Motivation: 放射科报告为肿瘤负荷、治疗反应和疾病进展提供关键纵向信息,但其非结构化文本形式阻碍自动化分析;现有先进大语言模型多为闭源,难以满足医疗场景对数据隐私与可复现性的要求。 Method: 基于开源llm_extractinator框架,采用qwen2.5-72b大语言模型,依据RECIST标准从多时间点放射科报告中提取并链接靶病灶、非靶病灶和新病灶信息。 Result: 在50对荷兰胸部/腹部CT报告上评估,靶病灶、非靶病灶、新病灶的属性级准确率分别为93.7%、94.9%和94.0%。 Conclusion: 开源大语言模型可在保障数据隐私和可复现性的前提下,在多时间点肿瘤学任务中实现具有临床意义的性能,支持从常规临床文本中规模化提取结构化纵向数据。 Abstract: Radiology reports capture crucial longitudinal information on tumor burden, treatment response, and disease progression, yet their unstructured narrative format complicates automated analysis. While large language models (LLMs) have advanced clinical text processing, most state-of-the-art systems remain proprietary, limiting their applicability in privacy-sensitive healthcare environments. We present a fully open-source, locally deployable pipeline for longitudinal information extraction from radiology reports, implemented using the \texttt{llm\_extractinator} framework. The system applies the \texttt{qwen2.5-72b} model to extract and link target, non-target, and new lesion data across time points in accordance with RECIST criteria. Evaluation on 50 Dutch CT Thorax/Abdomen report pairs yielded high extraction performance, with attribute-level accuracies of 93.7\% for target lesions, 94.9\% for non-target lesions, and 94.0\% for new lesions. The approach demonstrates that open-source LLMs can achieve clinically meaningful performance in multi-timepoint oncology tasks while ensuring data privacy and reproducibility. These results highlight the potential of locally deployable LLMs for scalable extraction of structured longitudinal data from routine clinical text.[26] Understanding the Interplay between LLMs' Utilisation of Parametric and Contextual Knowledge: A keynote at ECIR 2025
Isabelle Augenstein
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LMs)中参数化知识的获取、使用与更新问题,重点分析了参数知识与外部检索上下文之间的冲突(包括跨记忆与内存内冲突),并提出了评估模型知识、诊断知识冲突及理解有效上下文特征的方法。
Details
Motivation: 大型语言模型虽能通过训练嵌入大量知识,但其可解释性差、知识更新成本高;在知识密集型任务中,模型常忽略外部提供的上下文,尤其当其与内部参数化知识冲突时,亟需理解参数知识与上下文知识的交互机制。 Method: 提出并开展一系列研究工作,包括:(1)对LM中知识的系统性评估;(2)设计诊断测试以揭示知识冲突;(3)分析成功被模型利用的上下文知识的特征。 Result: 识别出两类关键知识冲突——模型参数与外部上下文间的‘跨记忆冲突’,以及参数内部固有的‘内存内冲突’;发现上下文是否被采纳取决于其与参数知识的一致性、形式化表达及位置提示等特征。 Conclusion: 理解参数知识与上下文知识的动态交互是提升LM可靠性、可控性与可更新性的关键;未来需构建更鲁棒的知识融合机制,并发展轻量级知识编辑方法。 Abstract: Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM's memory learned during pre-training. Conflicting knowledge can also already be present in the LM's parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge. In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characteristics of successfully used contextual knowledge.[27] Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records
Jacopo Vitale,David Della Morte,Luca Bacco,Mario Merone,Mark de Groot,Saskia Haitjema,Leandro Pecchia,Bram van Es
Main category: cs.CL
TL;DR: 本研究提出了一种基于非结构化电子健康记录(EHR)的自动化分类框架,用于老年心血管风险评估,通过比较多种模型发现定制Transformer架构在F1和MCC指标上最优。
Details
Motivation: 克服老年心血管风险管理中人工行政编码的局限性,提升临床风险分层的自动化与准确性。 Method: 在3482名患者的纵向荷兰临床文本数据上,对比经典机器学习、专为长上下文优化的深度学习架构(定制Transformer)、零样本通用大语言模型(LLM),并评估融合结构化用药与人体测量数据的晚期融合策略。 Result: 定制Transformer架构在F1分数和Matthews相关系数上均优于传统方法和生成式LLM;其层级注意力机制对捕获医学文本长程依赖至关重要。 Conclusion: 专用Transformer模型可作为人工编码流程的稳健自动化替代方案,显著提升老年心血管风险分层效能。 Abstract: To overcome the limitations of manual administrative coding in geriatric Cardiovascular Risk Management, this study introduces an automated classification framework leveraging unstructured Electronic Health Records (EHRs). Using a dataset of 3,482 patients, we benchmarked three distinct modeling paradigms on longitudinal Dutch clinical narratives: classical machine learning baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative Large Language Models (LLMs) in a zero-shot setting. Additionally, we evaluated a late fusion strategy to integrate unstructured text with structured medication embeddings and anthropometric data. Our analysis reveals that the custom Transformer architecture outperforms both traditional methods and generative \acs{llm}s, achieving the highest F1-scores and Matthews Correlation Coefficients. These findings underscore the critical role of specialized hierarchical attention mechanisms in capturing long-range dependencies within medical texts, presenting a robust, automated alternative to manual workflows for clinical risk stratification.[28] Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation
Denica Kjorvezir,Danilo Najkov,Eva Valencič,Erika Jesenko,Barbara Koroišić Seljak,Tome Eftimov,Riste Stojanov
Main category: cs.CL
TL;DR: 本文提出了一种融合语义、词汇和领域信息的食谱相似性评估方法,并通过专家验证与分析,发现不同相似性维度对专家判断的影响程度,为个性化饮食推荐和自动化食谱生成提供支持。
Details
Motivation: 提升食谱间相似性评估的准确性与实用性,以支持个性化营养推荐、自动食谱生成等食品行业应用。 Method: 融合语义、词汇和营养属性(如食材、烹饪方法、营养成分)进行多维相似性建模,并开发Web界面供领域专家验证结果。 Result: 在318对食谱的评估中,专家对255对(80%)达成一致;进一步分析揭示了词汇、语义与营养相似性在专家决策中的相对影响。 Conclusion: 多源信息融合的相似性评估方法具有较高专家认可度,其中不同维度对判断的贡献可量化,为智能食谱系统提供了可解释、可扩展的技术基础。 Abstract: This research focuses on developing advanced methods for assessing similarity between recipes by combining different sources of information and analytical approaches. We explore the semantic, lexical, and domain similarity of food recipes, evaluated through the analysis of ingredients, preparation methods, and nutritional attributes. A web-based interface was developed to allow domain experts to validate the combined similarity results. After evaluating 318 recipe pairs, experts agreed on 255 (80%). The evaluation of expert assessments enables the estimation of which similarity aspects--lexical, semantic, or nutritional--are most influential in expert decision-making. The application of these methods has broad implications in the food industry and supports the development of personalized diets, nutrition recommendations, and automated recipe generation systems.[29] ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
Dechuan Teng,Chunlin Lu,Libo Qin,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出ESAinsTOD,一种统一的端到端、模式感知、指令微调框架,用于通用任务型对话建模,通过指令对齐与模式对齐机制提升模型在多数据集上的泛化性、鲁棒性与零样本性能。
Details
Motivation: 现有端到端任务型对话系统通常针对特定数据集定制,难以迁移到新对话场景,缺乏通用性与适应性。 Method: 提出ESAinsTOD框架:采用全参数微调大语言模型,引入指令对齐(确保遵循多样化任务指令)和模式对齐(约束预测符合给定schema),并采用会话级端到端建模以利用历史任务结果。 Result: 在CamRest676、In-Car和MultiWOZ上显著超越SOTA;在低资源尤其是零样本设置下泛化能力更强;对数据噪声与级联错误更具鲁棒性。 Conclusion: 结构化的指令-模式双重对齐机制,结合会话级建模,能有效提升LLM在任务型对话中的通用性、适应性与可靠性,为构建可迁移的端到端TOD系统提供了新范式。 Abstract: Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make predictions adhering to the specified schema. In addition, we employ session-level end-to-end modeling, which allows the system to access the results of previously executed task flows within the dialogue history, to bridge the gap between the instruction-tuning paradigm and the real-world application of TOD systems. Empirical results show that while a fine-tuned LLM serves as a strong baseline, our structured approach provides significant additional benefits. In particular, our findings indicate that: (i) ESAinsTOD outperforms state-of-the-art models by a significant margin on end-to-end task-oriented dialog modeling benchmarks: CamRest676, In-Car and MultiWOZ; (ii) more importantly, it exhibits superior generalization capabilities across various low-resource settings, with the proposed alignment mechanisms significantly enhancing zero-shot performance; and (iii) our instruction-tuning paradigm substantially improves the model's robustness against data noise and cascading errors.[30] Evaluation of LLMs in retrieving food and nutritional context for RAG systems
Maks Požarnik Vavken,Matevž Ogrinc,Tome Eftimov,Barbara Koroušić Seljak
Main category: cs.CL
TL;DR: This paper evaluates four LLMs for metadata filtering in a food composition RAG system, showing high accuracy for simple-to-moderate queries but limitations on complex, non-expressible constraints.
Details
Motivation: To reduce manual effort and technical expertise required for domain experts (e.g., food compilers, nutritionists) to access complex food and nutrition data via RAG. Method: LLMs are used to translate natural language queries into structured metadata filters for retrieval from a Chroma vector database backed by a comprehensive food composition database. Result: LLMs achieve high retrieval accuracy on easy and moderately complex queries, but struggle with difficult queries involving non-expressible constraints due to metadata format limitations. Conclusion: LLM-driven metadata filtering is effective when constraints are explicitly expressible in the metadata schema, but fails when query semantics exceed that representational scope. Abstract: In this article, we evaluate four Large Language Models (LLMs) and their effectiveness at retrieving data within a specialized Retrieval-Augmented Generation (RAG) system, using a comprehensive food composition database. Our method is focused on the LLMs ability to translate natural language queries into structured metadata filters, enabling efficient retrieval via a Chroma vector database. By achieving high accuracy in this critical retrieval step, we demonstrate that LLMs can serve as an accessible, high-performance tool, drastically reducing the manual effort and technical expertise previously required for domain experts, such as food compilers and nutritionists, to leverage complex food and nutrition data. However, despite the high performance on easy and moderately complex queries, our analysis of difficult questions reveals that reliable retrieval remains challenging when queries involve non-expressible constraints. These findings demonstrate that LLM-driven metadata filtering excels when constraints can be explicitly expressed, but struggles when queries exceed the representational scope of the metadata format.[31] RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
Sihong Wu,Yiling Ma,Yilun Zhao,Tiansheng Hu,Owen Jiang,Manasi Patwardhan,Arman Cohan
Main category: cs.CL
TL;DR: 本文提出RbtAct方法,利用作者反驳(rebuttal)作为隐式监督信号,训练大语言模型生成更具可操作性和具体性的同行评审反馈,并构建了RMR-75K数据集与视角条件化的段落级反馈生成新任务。
Details
Motivation: 现有AI生成的审稿意见往往流于表面、缺乏可操作性,无法为作者提供具体可行的修改指导,亟需提升反馈的行动导向性。 Method: 提出RbtAct框架:以作者反驳作为隐式监督信号优化反馈生成;定义视角条件化的段落级评审反馈生成任务;构建含75K样本的RMR-75K数据集(含视角标签与作者采纳影响等级);在Llama-3.1-8B-Instruct上进行监督微调+基于反驳对的偏好优化。 Result: 在人类专家与LLM-as-a-judge评估中,RbtAct在可操作性与具体性上显著优于强基线,同时保持内容依据性与相关性。 Conclusion: 利用反驳信息建模作者实际采纳行为,是提升AI审稿反馈行动力的有效路径;视角条件化与细粒度段落对齐能增强反馈的针对性与实用性。 Abstract: Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.[32] Beyond Fine-Tuning: Robust Food Entity Linking under Ontology Drift with FoodOntoRAG
Jan Drole,Ana Gjorgjevikj,Barbara Korouši'c Seljak,Tome Eftimov
Main category: cs.CL
TL;DR: 本文提出FoodOntoRAG,一种无需微调、与模型和本体无关的少样本命名实体链接(NEL)方法,用于将食品标签和菜单中的术语标准化为本体概念;通过混合检索、选择、置信度校准与同义词生成等模块实现高精度、可解释且适应本体演化的链接。
Details
Motivation: 现有基于微调大语言模型的食品领域命名实体链接方法计算开销大、依赖特定本体版本、难以应对本体漂移,亟需更灵活、鲁棒、可解释的替代方案。 Method: 提出FoodOntoRAG流水线:1)混合词法–语义检索器生成候选实体;2)选择代理基于结构化证据(标签、同义词、定义、关系)选出最优匹配并给出理由;3)打分代理校准置信度;4)当置信度不足时,同义词生成代理提出术语改写并重启流程。全程无需微调。 Result: FoodOntoRAG在准确性上接近当前最优水平,同时揭示了现有标注中的空缺与不一致问题,并展现出对本体演化的强鲁棒性及决策的可解释性。 Conclusion: FoodOntoRAG提供了一种高效、通用、可维护的食品术语标准化范式,克服了微调方法的固有局限,为可信膳食评估与食品安全报告奠定基础。 Abstract: Standardizing food terms from product labels and menus into ontology concepts is a prerequisite for trustworthy dietary assessment and safety reporting. The dominant approach to Named Entity Linking (NEL) in the food and nutrition domains fine-tunes Large Language Models (LLMs) on task-specific corpora. Although effective, fine-tuning incurs substantial computational cost, ties models to a particular ontology snapshot (i.e., version), and degrades under ontology drift. This paper presents FoodOntoRAG, a model- and ontology-agnostic pipeline that performs few-shot NEL by retrieving candidate entities from domain ontologies and conditioning an LLM on structured evidence (food labels, synonyms, definitions, and relations). A hybrid lexical--semantic retriever enumerates candidates; a selector agent chooses a best match with rationale; a separate scorer agent calibrates confidence; and, when confidence falls below a threshold, a synonym generator agent proposes reformulations to re-enter the loop. The pipeline approaches state-of-the-art accuracy while revealing gaps and inconsistencies in existing annotations. The design avoids fine-tuning, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.[33] EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting
Maria Kunilovskaya,Christina Pollkläsener
Main category: cs.CL
TL;DR: 本文介绍了更新和整合后的双向英德EPIC-UdS(口语)与EuroParl-UdS(书面语)语料库,修正了元数据和文本错误,增强了语言标注,并新增词对齐与词级意外度指标等层;该资源支持信息论视角下的语言变异研究,包括语体比较、言语不流畅性分析及翻译共性研究,并通过填充词预测任务验证了重建口语数据的完整性及多种概率模型的有效性。
Details
Motivation: 修正先前版本中发现的元数据和文本错误,提升语料质量,并扩展标注层次以支持信息论驱动的语言变异、语体差异、言语不流畅性及翻译共性等多维度研究需求。 Method: 对EPIC-UdS(口语)和EuroParl-UdS(书面语)语料库进行系统性更新与整合,包括错误修正、内容精炼、语言标注更新,并新增词对齐与词级surprisal指数;开展填充词预测的实证研究,评估基于GPT-2基础模型与微调模型以及机器翻译模型的概率预测能力。 Result: 成功构建高质量、多层级的英德双模态平行语料库;实证研究表明重建的口语数据具有完整性,且GPT-2微调模型在填充词预测任务中表现优于基线模型和机器翻译模型。 Conclusion: 更新后的语料库为书面与口语对比、翻译研究及信息论语言建模提供了可靠资源;结合现代语言模型的实证分析进一步拓展了语料在计算语言学与口译研究中的应用潜力。 Abstract: This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.[34] One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
Chengyu Shen,Yanheng Hou,Minghui Pan,Runming He,Zhen Hao Wong,Meiyi Qiang,Zhou Liu,Hao Liang,Peichao Lai,Zeang Sheng,Wentao Zhang
Main category: cs.CL
TL;DR: One-Eval 是一个面向大语言模型评估的智能代理系统,将自然语言评估请求自动转化为可执行、可追溯、可定制的评估流程,显著降低人工评估成本并提升可复现性。
Details
Motivation: 现有大语言模型评估过程繁琐,需人工选择基准、复现异构代码、映射数据集模式、解读指标,效率低且难以复现。 Method: 提出 One-Eval 系统,包含三部分:NL2Bench(意图解析与个性化基准规划)、BenchResolve(基准解析、数据获取与模式归一化)、Metrics & Reporting(任务感知指标选择与决策导向报告),并引入人机协同检查点和样本证据链。 Result: One-Eval 能从多样化自然语言请求中端到端自动执行评估,大幅减少用户干预,在工业场景中验证了其高效性与可复现性。 Conclusion: One-Eval 提供了一种自动化、可审计、可定制的大模型评估新范式,推动评估实践向更可靠、更易用方向发展。 Abstract: Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.[35] Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
Naman Gupta,Vaibhav Singh,Arun Iyer,Kirankumar Shiragur,Pratham Grover,Ramakrishna B. Bairi,Ritabrata Maiti,Sankarshan Damle,Shachee Mishra Gupta,Rishikesh Maurya,Vageesh D. C
Main category: cs.CL
TL;DR: 本文研究了长上下文推理中的块排序问题,提出使用Chow-Liu树学习块间依赖结构,并通过广度优先遍历生成更优的处理顺序,从而减少信息损失、提升答案相关性和准确率。
Details
Motivation: Chain-of-Agents(CoA)等序列式多智能体推理框架因受限于共享内存容量,在处理长上下文时存在信息瓶颈,且其性能高度依赖输入块的处理顺序;因此需系统研究如何优化chunk ordering以缓解信息损失。 Method: 利用Chow-Liu树从数据中学习chunk之间的依赖关系,构建近似最优的联合分布结构,并采用广度优先遍历该树生成chunk处理顺序。 Result: 在三个长上下文基准测试中,所提方法在答案相关性和精确匹配准确率上均显著优于默认文档分块顺序和基于语义得分的排序方法。 Conclusion: chunk间的结构化依赖建模(如Chow-Liu树)可有效指导排序策略,提升bounded-memory多智能体推理系统的整体性能。 Abstract: Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks.[36] N-gram-like Language Models Predict Reading Time Best
James A. Michaelov,Roger P. Levy
Main category: cs.CL
TL;DR: 本文探讨了现代语言模型(如Transformer)在预测阅读时间时表现不佳的原因,提出阅读时间更依赖于简单的n-gram统计而非复杂模型所学的高级统计特征,并通过实验证明n-gram概率与眼动追踪阅读时间指标相关性最强。
Details
Motivation: 当代语言模型(如Transformer)在词预测任务上表现优异,但其预测概率却与人类实际阅读时间的相关性变差,本文旨在解释这一反常现象。 Method: 通过比较不同神经语言模型预测概率与n-gram概率之间的相关性,并分析这些概率与基于眼动追踪的自然文本阅读时间指标的相关程度,验证假设。 Result: 发现预测概率与n-gram概率相关性越高的语言模型,其概率与阅读时间指标(如眼动数据)的相关性也越高。 Conclusion: 阅读时间主要受简单n-gram统计驱动,而非语言模型学习到的复杂统计模式,这解释了为何高性能语言模型在阅读时间预测上反而表现较差。 Abstract: Recent work has found that contemporary language models such as transformers can become so good at next-word prediction that the probabilities they calculate become worse for predicting reading time. In this paper, we propose that this can be explained by reading time being sensitive to simple n-gram statistics rather than the more complex statistics learned by state-of-the-art transformer language models. We demonstrate that the neural language models whose predictions are most correlated with n-gram probability are also those that calculate probabilities that are the most correlated with eye-tracking-based metrics of reading time on naturalistic text.[37] Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Maike Züfle,Sara Papi,Fabian Retkowski,Szymon Mazurek,Marek Kasztelnik,Alexander Waibel,Luisa Bentivogli,Jan Niehues
Main category: cs.CL
TL;DR: 本文提出DoWhatISay (DOWIS)多语言语音与文本提示数据集,用于在真实语音指令场景下评估语音大语言模型(SLLMs),发现语音提示整体表现弱于文本提示,仅在语音输出任务中差距缩小。
Details
Motivation: 现有SLLM评估多依赖文本提示,无法反映真实语音交互场景,缺乏面向语音指令的标准化评测基准。 Method: 构建多语言、多任务、多风格的语音-文本配对提示数据集DOWIS(覆盖9项任务、11种语言、每组10种提示变体、5种风格),并基于该数据集对前沿SLLMs进行跨模态、跨语言、跨任务评测分析。 Result: 实验表明:文本提示在绝大多数任务中显著优于语音提示,尤其在低资源和跨语言设置下;仅当任务输出为语音时,语音提示性能接近文本提示。 Conclusion: SLLM评估亟需纳入语音提示范式,尤其应重视语音输入-输出匹配的任务设计,推动更贴近实际应用的模型评测标准。 Abstract: Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.[38] Benchmarking Political Persuasion Risks Across Frontier Large Language Models
Zhongren Chen,Joshua Kalla,Quan Le
Main category: cs.CL
TL;DR: 本研究通过两项大规模调查实验(N=19,145),评估了七种前沿大语言模型(来自Anthropic、OpenAI、Google和xAI)在影响政治观点方面的说服力,发现其普遍强于传统竞选广告,且不同模型间差异显著(Claude最强,Grok最弱);信息型提示的效果因模型而异;并提出一种数据驱动、策略无关的对话分析方法以识别说服策略,为前沿模型的说服风险提供基准与跨模型评估框架。
Details
Motivation: 尽管先前研究认为LLM不比传统政治竞选手段更具说服力,但前沿模型的快速发展引发对其政治影响力的新担忧,亟需系统性再评估。 Method: 开展两项覆盖两党议题与立场的大规模在线调查实验(总样本量19,145),对比七种前沿LLM(Claude、GPT、Grok等)与标准竞选广告的说服效果;采用信息型与非信息型提示进行干预;引入数据驱动、策略无关的LLM辅助对话分析方法,识别并量化潜在说服策略。 Result: LLM整体说服力显著高于传统竞选广告;Claude系列模型说服力最强,Grok最低,结果在议题与立场上稳健;信息型提示对Claude和Grok有正向作用,却大幅削弱GPT的说服力;所提分析方法可有效解构不同模型的说服机制。 Conclusion: 前沿LLM已具备超越传统手段的政治说服能力,且模型间差异显著、提示策略效果高度依赖模型架构;需建立跨模型、可复现的风险评估框架以应对日益增长的AI政治影响风险。 Abstract: Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.[39] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
Zorik Gekhman,Roee Aharoni,Eran Ofek,Mor Geva,Roi Reichart,Jonathan Herzig
Main category: cs.CL
TL;DR: 本文探讨了在简单单跳事实性问题中,推理如何提升大语言模型(LLM)的参数化知识回忆能力,并揭示了两种关键机制:计算缓冲效应和事实预热;同时指出推理中幻觉事实会增加最终答案幻觉风险,并提出基于无幻觉推理路径提升准确率的方法。
Details
Motivation: 尽管推理在数学、代码生成和多跳事实问题中作用明确,但其对无需逻辑分解的简单单跳事实问题的影响尚不清楚,而实验发现推理能显著扩展模型参数化知识回忆的能力边界,这与直觉相悖,因此需探究其内在机制。 Method: 设计一系列假设驱动的受控实验,分析推理过程对单跳事实问答的影响,识别并验证计算缓冲效应和事实预热两种机制,并评估幻觉事实对最终答案准确性的影响。 Result: 发现推理通过计算缓冲效应(利用推理token进行语义无关的隐式计算)和事实预热(生成相关事实作为语义桥梁)提升知识回忆;同时证实推理中幻觉中间事实会提高最终答案幻觉概率;并验证优先选择无幻觉推理路径可提升模型准确率。 Conclusion: 推理在单跳事实问题中并非仅依赖逻辑步骤,而是通过隐式计算和语义引导增强知识检索;但其伴随的幻觉风险需被重视,可通过筛选高质量推理路径加以缓解和利用。 Abstract: While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.[40] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
Mingyang Song,Mao Zheng
Main category: cs.CL
TL;DR: This survey introduces the FUSE taxonomy to systematically review model merging techniques for large language models, covering theoretical foundations, unification strategies, application scenarios, and ecosystem support.
Details
Motivation: With the rapid proliferation of fine-tuned LLMs, model merging offers a computationally efficient alternative to ensembles and full retraining, enabling composition of specialized capabilities at minimal cost. Method: The paper proposes the FUSE taxonomy—a four-dimensional framework covering Foundations, Unification Strategies, Scenarios, and Ecosystem—and uses it to structure a comprehensive review of theoretical underpinnings, algorithmic methods (e.g., weight averaging, task vectors, MoE), applications, and tooling. Result: A structured, taxonomy-driven survey that organizes and analyzes the state-of-the-art in LLM model merging, including key methods, use cases, open-source tools, benchmarks, and identified challenges. Conclusion: Model merging is a promising paradigm for LLM composition; this survey provides a unified conceptual framework and identifies critical gaps—such as theoretical understanding, scalability, and standardization—that must be addressed for broader adoption and advancement. Abstract: Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey presents a comprehensive and structured examination of model merging in the LLM era through the \textbf{FUSE} taxonomy, a four-dimensional framework organized along \textbf{F}oundations, \textbf{U}nification Strategies, \textbf{S}cenarios, and \textbf{E}cosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry, mode connectivity, and the linear mode connectivity hypothesis. We then systematically review the algorithmic landscape, spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. For each method family, we analyze the core formulation, highlight representative works, and discuss practical trade-offs. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning. Finally, we survey the supporting ecosystem of open-source tools, community platforms, and evaluation benchmarks, and identify key open challenges including theoretical gaps, scalability barriers, and standardization needs. This survey aims to equip researchers and practitioners with a structured foundation for advancing model merging.[41] CREATE: Testing LLMs for Associative Creativity
Manya Wadhwa,Tiasa Singha Roy,Harvey Lederman,Junyi Jessy Li,Greg Durrett
Main category: cs.CL
TL;DR: 本文提出CREATE基准,用于评估大模型在创造性联想推理方面的能力,要求模型生成连接概念的高特异性、高多样性路径,并通过客观评分衡量其创造性效用。
Details
Motivation: 创造力的核心是联想推理能力,即在概念间建立新颖且有意义的联系。现有基准难以客观评估模型在此方面的能力,因此需要一个能模拟真实创造性任务(如假设生成)并支持大规模、客观评估的新基准。 Method: 构建CREATE基准,要求模型基于其参数化知识生成连接给定概念的多条路径;路径需具备高特异性(连接独特且紧密)和高多样性(路径间差异大);采用基于自动指标的客观评分机制,综合路径质量与数量进行评估。 Result: 前沿大模型在CREATE上表现存在明显差异,最强模型展现出更高创造性效用;该任务因搜索空间极大、答案多重性高而难以饱和;‘思考型’模型并不总更优,现有创意提示方法仅带来有限提升。 Conclusion: CREATE为评估和提升大模型的联想创造力提供了有效、可扩展且客观的基准与沙盒环境,推动了创造性AI研究的发展。 Abstract: A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.cs.CV [Back]
[42] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Junyuan Mao,Qiankun Li,Linghao Meng,Zhicheng He,Xinliang Zhou,Kun Wang,Yang Liu,Yueming Jin
Main category: cs.CV
TL;DR: 本文提出Granulon,一种基于DINOv3的新型多模态大语言模型,通过文本条件控制的粒度控制器和自适应令牌聚合模块,实现像素到细粒度再到粗粒度的统一推理,在准确率和减少幻觉方面显著优于现有方法。
Details
Motivation: 现有基于CLIP的视觉编码器强调全局语义对齐但缺乏细粒度理解;而DINOv3虽具强像素级感知能力,却缺少粗粒度语义抽象,导致多粒度推理能力受限。 Method: 提出Granulon模型,包含文本条件控制的粒度控制器(动态调整视觉抽象层级)与自适应令牌聚合模块(执行粒度引导的池化和关系感知聚类),实现单次前向传播中的多粒度视觉表征。 Result: 在相同设置下,Granulon相较所有对比视觉编码器,准确率提升约30%,幻觉减少约20%。 Conclusion: Granulon通过融合像素级感知与语义级抽象,有效 bridging 多粒度视觉理解鸿沟,为MLLM提供了更鲁棒、可解释的视觉编码范式。 Abstract: Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.[43] Where, What, Why: Toward Explainable 3D-GS Watermarking
Mingshu Cai,Jiajun Li,Osamu Yoshie,Yuya Ieiri,Yixuan Li
Main category: cs.CV
TL;DR: 本文提出了一种面向3D高斯泼溅表示的原生水印框架,通过Trio-Experts模块和SBAG门控机制实现水印载体选择与质量保护的解耦,并引入通道组掩码控制梯度传播,在保持高保真度的同时提升水印鲁棒性与可解释性。
Details
Motivation: 随着3D高斯泼溅成为交互式3D资产的事实标准表示,亟需一种鲁棒且不可感知的水印技术来保障版权与溯源。 Method: 提出Trio-Experts模块直接作用于高斯原语以生成载体先验;设计Safety and Budget Aware Gate(SBAG)分配高斯元用于水印嵌入与视觉补偿;引入通道级组掩码控制梯度传播,限制参数更新并修复局部伪影。 Result: 在PSNR上提升+0.83 dB,比特准确率提升+1.24%;水印具备视角一致性、抗压缩与噪声鲁棒性;支持解耦微调与每高斯归因,实现可审计的可解释性。 Conclusion: 该方法在鲁棒性与视觉质量间取得更优权衡,是首个面向3D高斯泼溅表示的原生、可解释、高质量水印框架。 Abstract: As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers, optimized for bit resilience under perturbation and bitrate budgets, and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness-quality trade-off compared with prior methods. In addition, decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.24%.[44] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model
Jinxiang Lai,Wenzhe Zhao,Zexin Lu,Hualei Zhang,Qinyu Yang,Rongwei Quan,Zhimin Li,Shuai Shao,Song Guo,Qinglin Lu
Main category: cs.CV
TL;DR: 本文提出VisionCreator-R1,一种具备显式反思机制的原生视觉生成智能体,并设计了Reflection-Plan Co-Optimization(RPCO)训练方法,通过揭示反思与规划在强化学习中的优化不对称性,分阶段训练以提升单图与多图生成性能,显著超越Gemini2.5Pro。
Details
Motivation: 现有视觉生成智能体多为计划驱动,缺乏系统性反思机制来纠正生成过程中的视觉错误。 Method: 提出VisionCreator-R1智能体及Reflection-Plan Co-Optimization(RPCO)训练方法:先在自建VCR-SFT数据集(含反思强的单图轨迹和规划强的多图轨迹)上监督微调,再在VCR-RL数据集上联合强化学习优化反思与规划模块。 Result: VisionCreator-R1在既有基准及新构建的VCR-bench(覆盖单图与多图任务)上持续超越Gemini2.5Pro;实验发现反思学习受信用分配噪声阻碍,而规划学习更稳定。 Conclusion: 反思与规划需差异化训练策略;显式反思机制结合分阶段协同优化可显著提升多阶段视觉内容生成的鲁棒性与质量。 Abstract: Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.[45] Computer Vision-Based Vehicle Allotment System using Perspective Mapping
Prachi Nandi,Sonakshi Satapathy,Suchismita Chinara
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8和逆透视变换(IPM)的低成本、易部署的智能停车系统,利用多视角摄像头与计算机视觉技术实现车位检测与3D可视化引导。
Details
Motivation: 传统传感器方案存在精度低、灵活性差、集成难等问题,难以满足高密度城市对高效、自适应停车管理的需求;计算机视觉因其高精度和布局适应性成为更优替代方案。 Method: 采用YOLOv8进行车辆与空位检测,结合四路摄像头图像,通过逆透视变换(IPM)融合生成统一俯视图,并构建3D笛卡尔坐标系对空闲车位进行空间建模与可视化。 Result: 实现了准确的空位识别与空间定位,成功模拟三维停车环境并以3D图形式直观呈现可用车位,验证了系统的可行性与实用性。 Conclusion: 基于计算机视觉的智能停车系统在成本、部署便捷性与环境适应性方面优于传统传感方案,为智慧城市建设中的交通优化提供了有效技术路径。 Abstract: Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.[46] A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology
Brian Isett,Rebekah Dadey,Aofei Li,Ryan C. Augustin,Kate Smith,Aatur D. Singhi,Qiangqiang Gu,Riyue Bao
Main category: cs.CV
TL;DR: 本文提出了一种多癌种肿瘤定位模型MuCTaL,通过在四种癌症数据上进行平衡训练,实现了对已见及未见癌种的鲁棒定位,验证了跨癌种泛化能力。
Details
Motivation: 深度学习模型在特定癌种上训练后,跨癌种应用时鲁棒性下降;需探索小规模多癌种训练是否能提升泛化能力。 Method: 基于DenseNet169迁移学习,在四种癌症(黑色素瘤、肝细胞癌、结直肠癌、非小细胞肺癌)共79984张非重叠WSI tile上训练多癌种模型MuCTaL,并构建可扩展推理流程生成空间肿瘤概率热图。 Result: 在四类训练癌种验证集上tile级ROC-AUC达0.97,在独立胰腺导管腺癌队列中达0.71;模型与现有数字病理工具兼容。 Conclusion: 适度规模的多癌种平衡训练可显著提升模型跨癌种泛化能力,MuCTaL为 translational research 提供了通用、可部署的肿瘤定位工具。 Abstract: Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.[47] HECTOR: Hybrid Editable Compositional Object References for Video Generation
Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Alan Yuille,Chongyang Ma
Main category: cs.CV
TL;DR: 本文提出HECTOR,一种支持混合参考条件(静态图像和动态视频)和显式轨迹控制(位置、尺度、速度)的视频生成框架,实现细粒度的视觉元素组合操控,在视觉质量、参考保真度和运动可控性上优于现有方法。
Details
Motivation: 现有视频生成模型缺乏对复杂物理对象间动态交互的显式组合操控能力,难以满足精细的时空约束需求。 Method: 提出HECTOR生成框架,支持混合参考(图像/视频)引导,并允许用户显式指定各参考元素的运动轨迹(位置、尺度、速度),实现细粒度组合控制。 Result: 实验表明HECTOR在视觉质量、参考保真度和运动可控性方面均优于现有方法。 Conclusion: HECTOR为视频生成提供了可解释、可控的组合建模新范式,有效提升了对动态场景的精细操控能力。 Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.[48] Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures
David Fernandez,Pedram MohajerAnsari,Amir Salarpour,Long Cheng,Abolfazl Razi,Mert D. Pesé
Main category: cs.CV
TL;DR: 本文系统评估了三种视觉-语言模型(Dolphins、OmniDrive、LeapVAD)在自动驾驶场景中对物理对抗攻击的鲁棒性,发现它们普遍存在严重脆弱性,尤其在多帧持续攻击和目标检测性能下降方面,揭示当前VLM架构难以应对安全关键型自动驾驶中的对抗威胁。
Details
Motivation: 视觉-语言模型在自动驾驶中日益重要,但其对物理对抗攻击的鲁棒性尚未被系统研究,而该问题对安全至关重要。 Method: 提出基于黑盒优化与语义同质化的公平比较框架,在CARLA仿真环境中评估可物理实现的补丁攻击对三类VLM的影响。 Result: 所有模型均表现出严重脆弱性,出现持续多帧失效及关键目标检测性能显著下降,并暴露出不同架构特有的脆弱模式。 Conclusion: 当前视觉-语言模型架构在安全关键的自动驾驶应用中,未能充分应对物理对抗攻击,亟需增强鲁棒性设计。 Abstract: Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.[49] Towards Visual Query Segmentation in the Wild
Bing Fan,Minghao Li,Hanzhi Zhang,Shaohua Dong,Naga Prudhvi Mareedu,Weishi Shi,Yunhe Feng,Yan Huang,Heng Fan
Main category: cs.CV
TL;DR: 本文提出了视觉查询分割(VQS)新范式,旨在根据外部视觉查询,在未剪辑视频中分割出目标物体的所有像素级出现;为此构建了首个大规模基准VQS-4K,并提出高效方法VQ-SAM,在该基准上取得领先性能。
Details
Motivation: 现有视觉查询定位(VQL)仅用边界框定位目标最后一次出现,不够全面和精确;亟需支持全时序、像素级、多出现定位的更实用范式。 Method: 构建了包含4111个视频、130万帧、222类物体的大规模手动精标基准VQS-4K;并提出VQ-SAM方法——基于SAM 2,引入目标特异性与背景干扰线索,通过多阶段框架与自适应记忆生成(AMG)模块逐步演化记忆以实现VQS。 Result: VQ-SAM在VQS-4K上显著超越所有现有方法,验证了其有效性;VQS-4K是首个专为VQS设计的基准。 Conclusion: VQS拓展了传统VQL范式,VQS-4K和VQ-SAM共同为该新任务奠定基础,有望推动后续研究与实际应用。 Abstract: In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.[50] Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift
Maziar Sabouri,Nourhan Bayasi,Arman Rahmim
Main category: cs.CV
TL;DR: 本文提出了一种针对甲状腺超声图像跨中心域偏移问题的多任务学习适配器(MKGA/ResMKGA),通过解码端轻量级适配,分离并增强CNN对纹理(恶性风险)与ViT对几何(结节分割)的建模能力,显著提升跨域鲁棒性与临床TI-RADS诊断准确率。
Details
Motivation: 甲状腺超声自动化需兼顾全局几何推理(结节分割)和局部纹理推理(恶性风险评估),但在跨中心域偏移下两类线索退化不对称,而共享骨干网络易引发负迁移;作者观察到ViT更利于几何先验迁移、CNN更稳健保留纹理信息,据此提出针对性解耦适配方案。 Method: 提出解码侧轻量适配器MKGA及其残差变体ResMKGA:利用多尺度跳跃特征,结合互补感受野与语义上下文条件门控机制,在特征融合前抑制易受伪影干扰的内容。在ResNet34(CNN)和MedSAM(ViT)骨干上分别验证。 Result: 在两个超声基准数据集上,所提适配器显著提升跨中心分割性能;在CNN骨干下,临床TI-RADS诊断准确率明显优于标准多任务基线。 Conclusion: 解码端适配比骨干共享更适配超声多任务中的域偏移挑战;MKGA/ResMKGA实现了几何与纹理线索的协同优化与鲁棒解耦,为医学超声多任务学习提供了新范式。 Abstract: Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.[51] Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
Mohamed Harmanani,Bining Long,Zhuoxin Guo,Paul F. R. Wilson,Amirhossein Sabour,Minh Nguyen Nhat To,Gabor Fichtinger,Purang Abolmaesumi,Parvin Mousavi
Main category: cs.CV
TL;DR: 本文提出MedCBR框架,将临床指南与视觉-语言及推理模型结合,提升医学影像中概念瓶颈模型(CBM)的可解释性与可靠性。
Details
Motivation: 现有离散概念表示难以融入诊断指南和专家经验等临床上下文,导致在复杂病例中可靠性不足。 Method: MedCBR将临床描述符转化为符合指南的文本,通过多任务目标(多模态对比对齐、概念监督、诊断分类)联合建模图像特征、概念与病理;再由推理模型生成基于指南的结构化临床叙述。 Result: 在超声和乳腺X线数据集上AUROC分别达94.2%和84.0%,非医学数据集准确率达86.1%。 Conclusion: MedCBR增强了医学AI的可解释性,实现了从影像分析到临床决策的端到端可解释桥梁。 Abstract: Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.[52] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering
Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Su-Jing Wang,Adrian K. Davison
Main category: cs.CV
TL;DR: 本文介绍了MEGC 2026挑战赛的两项新任务:ME-VQA和ME-LVQA,旨在利用多模态大语言模型(MLLMs)和视觉-语言模型(LVLMs)提升面部微表情(MEs)的理解与分析能力,尤其强调短时视频问答与长时视频中的时序推理与微表情检测。
Details
Motivation: 面部微表情(MEs)在高风险环境中具有重要价值,但其短暂、细微且易被抑制的特点给自动识别带来挑战;新兴的多模态大语言模型(MLLMs)与视觉-语言模型(LVLMs)为提升ME理解能力提供了新契机。 Method: 提出两个基于视频问答的新任务——ME-VQA(短视频问答)与ME-LVQA(长视频问答),要求参赛模型利用MLLMs或LVLMs完成对微表情相关问题的跨模态推理与时序建模,并通过公开排行榜统一评估。 Result: 设立了MEGC 2026挑战赛的标准化评测框架,推动ME分析向多模态理解与长时视频建模方向发展,并开放数据与 leaderboard 促进社区协作与基准对比。 Conclusion: ME-VQA与ME-LVQA任务标志着微表情分析正从传统单帧/短序列识别迈向融合语义理解、视觉推理与长时建模的多模态智能新范式。 Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.[53] TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers
Yihua Liu,Fanjiang Ye,Bowen Lin,Rongyu Fang,Chengming Zhang
Main category: cs.CV
TL;DR: 本文提出TIDE方法,一种无需额外训练的文本到图像(T2I)分辨率外推技术,通过文本锚定机制和动态温度控制机制,解决DiT在高分辨率生成时的结构退化与提示信息丢失问题,支持任意分辨率与宽高比生成,且无采样开销。
Details
Motivation: Diffusion Transformer(DiT)在生成高于训练分辨率的图像时存在结构退化问题,主因是注意力稀释导致提示信息丢失;现有提升注意力锐度的方法难以兼顾语义细节保真与 artifact 抑制。 Method: 提出训练无关的TIDE方法:1)文本锚定机制——校正文本与图像token间的表征不平衡;2)动态温度控制机制——利用扩散过程中频谱演化的规律抑制生成伪影。 Result: TIDE在多种基准上展现出高质量的分辨率外推能力,能无缝集成至当前SOTA DiT模型,在任意分辨率与宽高比下生成无明显artifact、结构清晰、语义忠实的图像。 Conclusion: TIDE是一种高效、通用、即插即用的DiT分辨率外推方案,从表征对齐与频谱调控双路径突破了高分辨率生成瓶颈,为扩散模型的实际部署提供了新思路。 Abstract: Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.[54] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
Heesup Yun,Isaac Kazuo Uyehara,Earl Ranario,Lars Lundqvist,Christine H. Diepenbrock,Brian N. Bailey,J. Mason Earles
Main category: cs.CV
TL;DR: This paper proposes using vision-language models (Gemma 3 and Qwen3-VL) to generate plant simulation configurations (in JSON) directly from drone images, evaluated on a synthetic cowpea dataset and validated on real-world data; it identifies limitations like contextual bias and reliance on dataset priors, and introduces the first VLM-based framework for scalable 3D plot reconstruction for agricultural digital twins.
Details
Motivation: Functional-structural plant models (FSPMs) are powerful but too complex and slow for large-scale deployment in agriculture; there is a need for scalable, automated methods to generate simulation configurations for digital twins. Method: Leverages open-source VLMs (Gemma 3 and Qwen3-VL) with in-context learning to generate JSON-formatted plant simulation parameters from drone-based remote sensing images; evaluates performance using a synthetic cowpea dataset (from Helios 3D), plus real-world orthophotos and ablation with a blind baseline; assessment covers JSON integrity, geometric accuracy, and biophysical plausibility. Result: VLMs can extract structural metadata (e.g., plant count, sun azimuth) but suffer from contextual bias and fall back to dataset means when visual cues are weak; validation confirms limited generalization beyond synthetic training data. Conclusion: This work pioneers VLM-driven generation of FSPM configurations for digital agriculture, offering a scalable pathway toward automated 3D plot reconstruction—yet highlights critical limitations in robustness and reasoning that must be addressed for real-world deployment. Abstract: This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.[55] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration
Abdul Rehman Akbar,Samuel Wales-McGrath,Alejadro Levya,Lina Gokhale,Rajendra Singh,Wei Chen,Anil Parwani,Muhammad Khalid Khan Niazi
Main category: cs.CV
TL;DR: 本文提出PathoScribe,一种检索增强型大语言模型框架,旨在将静态病理档案转化为可搜索、可推理的‘活图书馆’,支持自然语言案例检索、自动队列构建、临床问答、IHC推荐及报告转换,在7万份病理报告上验证了其高召回率与高质量推理能力。
Details
Motivation: 病理报告中蕴含的大量经验性知识因缺乏有效检索与推理机制而难以利用,仅数字化不足以支撑临床决策,亟需实现实时类案检索与知识赋能。 Method: 构建统一的检索增强大语言模型(RAG-LLM)框架PathoScribe,集成自然语言检索、多任务推理(如队列构建、IHC推荐、报告生成)及临床语义理解模块,并在70,000份多中心外科病理报告上进行端到端评估。 Result: 在自然语言案例检索中达到100% Recall@10;评审员对推理质量评分均值为4.56/5;自动队列构建平均耗时9.2分钟,与人工一致性达91.3%,无漏检。 Conclusion: PathoScribe为数字病理档案从被动存储转向主动临床智能平台提供了可扩展的技术基础,显著提升知识复用效率与科研转化速度。 Abstract: Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.[56] BiCLIP: Domain Canonicalization via Structured Geometric Transformation
Pranav Mantini,Shishir K. Shah
Main category: cs.CV
TL;DR: 本文提出BiCLIP框架,通过少量锚点估计跨域图像特征间的规范几何变换,以提升多模态对齐,实现零参数或极低参数的高效少样本跨域分类。
Details
Motivation: 现有视觉语言模型(VLMs)虽具强零样本能力,但在专业领域适配困难;理论指出不同VLM间存在规范变换,本文将该思想拓展至跨域图像特征,假设其亦由可被少量锚点恢复的规范几何变换关联。 Method: 提出BiCLIP框架,在少样本分类设定下,利用少量标注样本作为锚点,学习并施加一个目标化的双线性(或线性)几何变换于多模态特征,以增强跨模态对齐;方法极简、参数量极低。 Result: 在EuroSAT、DTD、FGVCAircraft等11个标准基准上持续达到SOTA;并通过分析所学变换的正交性与角度分布,实证验证了其结构化对齐的有效性。 Conclusion: 结构化几何对齐是鲁棒领域自适应的关键;BiCLIP以极简设计和极低开销实现了高性能跨域少样本分类,为VLM领域适配提供了新范式。 Abstract: Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP[57] Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation
Siddeshwar Raghavan,Gautham Vinod,Bruce Coburn,Fengqing Zhu
Main category: cs.CV
TL;DR: 本文提出了首个无示例的持续学习基准ATLAS,用于音频-视觉分割(AVS),通过音频引导的预融合调制和低秩锚定(LRA)方法缓解灾难性遗忘,提升了动态环境下的AVS性能。
Details
Motivation: 现实世界环境是动态变化的,导致音视频分布随时间演变,而现有AVS系统多基于静态训练假设,难以适应这种变化。 Method: 提出ATLAS模型,采用音频引导的预融合条件调制,在跨模态注意力前利用投影音频上下文调节视觉特征通道;并引入低秩锚定(LRA)机制,依据损失敏感性稳定适配权重以缓解灾难性遗忘。 Result: 在四种持续学习协议(涵盖单源与多源AVS数据集)上进行了大量实验,ATLAS展现出在多种持续学习场景下的竞争力,为终身音视频感知奠定基础。 Conclusion: 本文填补了AVS在持续学习领域的空白,所提基准与方法有效提升了模型在动态环境中的鲁棒性与适应性,推动多模态持续学习发展。 Abstract: Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}[58] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing
Xuanyi Zhou,Qiuyang Mang,Shuo Yang,Haocheng Xi,Jintao Zhang,Huanzhi Mao,Joseph E. Gonzalez,Kurt Keutzer,Ion Stoica,Alvin Cheung
Main category: cs.CV
TL;DR: 本文提出SVG-EAR,一种无需参数、基于语义聚类与误差感知路由的稀疏注意力补偿方法,用于提升视频扩散模型(DiTs)的效率与生成质量平衡。
Details
Motivation: Diffusion Transformers(DiTs)在视频生成中面临二次方注意力计算开销瓶颈;现有稀疏注意力方法或丢弃块导致信息损失,或依赖可学习预测器带来训练负担与分布偏移。 Method: 基于键值对在语义聚类后高度相似的观察,提出无参线性补偿分支(centroid compensation);进一步设计轻量探针进行误差估计,并采用误差-成本比驱动的误差感知路由(error-aware routing)决定哪些块需精确计算、哪些可补偿。 Result: 在Wan2.2和HunyuanVideo上分别实现1.77×和1.93×加速,同时保持PSNR达29.759和31.043;理论证明补偿误差受聚类质量控制,实证显示优于先前方法的quality-efficiency Pareto前沿。 Conclusion: SVG-EAR无需训练、参数自由,通过语义聚类与误差感知路由有效缓解DiTs的计算瓶颈,在不牺牲生成保真度前提下显著提升吞吐量,为高效视频扩散建模提供新范式。 Abstract: Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.[59] SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training
Jingxing Li,Yongjae Leeand,Deliang Fan
Main category: cs.CV
TL;DR: SkipGS 提出了一种视图自适应的反向传播门控机制,在3D高斯泼溅(3DGS)后稠密化阶段跳过冗余的反向传播,显著加速训练(端到端提速23.1%),同时保持重建质量。
Details
Motivation: 3DGS在后稠密化阶段反向传播耗时长,且大量采样视角损失已趋于平稳,梯度贡献小,造成计算冗余。 Method: 提出SkipGS方法,通过前向传播持续更新各视角损失统计,并基于当前损失与近期视角基线的一致性动态跳过反向传播,同时保障最小反向预算以维持优化稳定性。 Result: 在Mip-NeRF 360数据集上,相比3DGS,SkipGS端到端训练时间减少23.1%,其中后稠密化阶段减少42.0%,重建质量相当;且方法即插即用,可与其他加速策略协同使用。 Conclusion: SkipGS通过轻量级、表示无关的反向传播调度策略,有效缓解3DGS训练瓶颈,在不牺牲质量前提下提升训练效率,具备良好兼容性和实用性。 Abstract: 3D Gaussian Splatting (3DGS) achieves real-time novel-view synthesis by optimizing millions of anisotropic Gaussians, yet its training remains expensive, with the backward pass dominating runtime in the post-densification refinement phase. We observe substantial update redundancy in this phase: many sampled views have near-plateaued losses and provide diminishing gradient benefits, but standard training still runs full backpropagation. We propose SkipGS with a novel view-adaptive backward gating mechanism for efficient post-densification training. SkipGS always performs the forward pass to update per-view loss statistics, and selectively skips backward passes when the sampled view's loss is consistent with its recent per-view baseline, while enforcing a minimum backward budget for stable optimization. On Mip-NeRF 360, compared to 3DGS, SkipGS reduces end-to-end training time by 23.1%, driven by a 42.0% reduction in post-densification time, with comparable reconstruction quality. Because it only changes when to backpropagate -- without modifying the renderer, representation, or loss -- SkipGS is plug-and-play and compatible with other complementary efficiency strategies for additive speedups.[60] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
Bolutife Atoki,Iuliia Tkachenko,Bertrand Kerautret,Carlos Crispim-Junior
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的认证框架,通过结合原始二值模板、打印的CDP和打印机身份表征,实现高精度的打印机分类与防伪鉴别,显著优于传统方法,并具备对未见伪造类型的泛化能力。
Details
Motivation: 高分辨率打印/扫描设备和生成式深度学习的发展,使得传统防伪认证系统难以区分高质量伪造品与正品。 Method: 提出基于扩散模型的多类打印机分类框架,联合利用原始二值模板、打印CDP和打印机身份语义表征;扩展ControlNet,将去噪过程重用于类别条件噪声预测。 Result: 在Indigo 1x1 Base数据集上,性能优于传统相似性度量及先前深度学习方法;能泛化至训练中未见过的伪造类型。 Conclusion: 该扩散驱动的认证方法可有效提升印刷品防伪鲁棒性与泛化性,为应对先进伪造技术提供了新思路。 Abstract: Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.[61] WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion
Zekun Long,Ali Zia,Guanyiman Fu,Vivien Rolland,Jun Zhou
Main category: cs.CV
TL;DR: 本文提出WS-Net,一种基于状态空间建模与弱信号注意力融合的深度解混框架,旨在解决高光谱图像中弱光谱响应被掩盖的问题,显著提升弱端元丰度估计精度。
Details
Motivation: 弱光谱响应在高光谱图像中常被主导端元和传感器噪声掩盖,导致丰度估计不准确。 Method: 提出WS-Net:采用多分辨率小波融合编码器、结合Mamba状态空间分支建模长程依赖、引入弱信号注意力分支增强低相似性光谱特征,并通过可学习门控机制融合表征;解码器使用KL散度正则化以增强主导与弱端元的可分性。 Result: 在1个模拟和2个真实数据集(Samson、Apex)上超越6种SOTA方法,RMSE和SAD分别降低最多55%和63%,且在低信噪比下对弱端元保持稳定精度。 Conclusion: WS-Net是一种鲁棒、高效的新基准方法,专为弱信号高光谱解混任务设计。 Abstract: Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, resulting in inaccurate abundance estimation. This paper introduces WS-Net, a deep unmixing framework specifically designed to address weak-signal collapse through state-space modelling and Weak Signal Attention fusion. The network features a multi-resolution wavelet-fused encoder that captures both high-frequency discontinuities and smooth spectral variations with a hybrid backbone that integrates a Mamba state-space branch for efficient long-range dependency modelling. It also incorporates a Weak Signal Attention branch that selectively enhances low-similarity spectral cues. A learnable gating mechanism adaptively fuses both representations, while the decoder leverages KL-divergence-based regularisation to enforce separability between dominant and weak endmembers. Experiments on one simulated and two real datasets (synthetic dataset, Samson, and Apex) demonstrate consistent improvements over six state-of-the-art baselines, achieving up to 55% and 63% reductions in RMSE and SAD, respectively. The framework maintains stable accuracy under low-SNR conditions, particularly for weak endmembers, establishing WS-Net as a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing.[62] Spectral-Structured Diffusion for Single-Image Rain Removal
Yucheng Xing,Xin Wang
Main category: cs.CV
TL;DR: 本文提出SpectralDiff,一种基于频谱结构的扩散模型框架,用于单图像去雨任务,通过引入结构化频谱扰动来引导多方向雨纹的渐进抑制,并设计了全乘积U-Net架构以提升计算效率。
Details
Motivation: 雨纹具有方向性和多尺度频谱集中特性,传统空间域扩散模型难以有效建模此类结构化频谱特征。 Method: 提出SpectralDiff框架,在标准扩散过程中引入结构化频谱扰动;设计全乘积U-Net架构,利用卷积定理将卷积替换为逐元素乘法层。 Result: 在合成与真实数据集上实验表明,SpectralDiff在去雨性能、模型紧凑性与推理效率方面均优于现有扩散模型。 Conclusion: 频谱结构先验可有效增强扩散模型对方向性纹理(如雨纹)的建模能力,无需重构扩散范式即可实现高效高质量单图像去雨。 Abstract: Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.[63] Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework
Ammar K. AlMhdawi,Nonso Nnamoko,Alaa Mashan Ubaid
Main category: cs.CV
TL;DR: 本文提出了一种增强型双模型YOLOv8框架,用于智能火灾检测与基于距离的风险评估,不仅检测火与烟,还结合周围物体信息进行量化风险评分。
Details
Motivation: 传统视觉监测仅限于简单火灾检测,缺乏对危险的优先级评估和行动指导,亟需一种能结合环境上下文进行风险量化的智能系统。 Method: 构建双模型框架:主模型为YOLOv8实例分割模型(训练于9860张标注图像),用于检测火与烟;辅模型为COCO预训练目标检测模型,识别人、车辆、设施等;通过像素距离计算与像素-米换算,融合多源信息生成风险评分。 Result: 精度、召回率、F1值均超90%,mAP@0.5超91%;输出含火灾位置、物体检测、估计距离及风险等级的可视化结果;系统轻量,可在Google Colab中运行,适配工业与资源受限场景。 Conclusion: 该框架成功将火灾检测提升为可操作的风险评估工具,在准确性和实用性之间取得良好平衡,具备实际部署价值。 Abstract: This study proposes an enhanced dual-model YOLOv8 framework for intelligent fire detection and proximity-aware risk assessment, extending conventional vision-based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel-based distances between detected fire regions and nearby objects and converts these values into approximate real-world measurements using a pixel-to-meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance-based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and mAP@0.5 above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open-source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource-constrained settings.[64] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
Md Selim Sarowar,Omer Tariq,Sungho Kim
Main category: cs.CV
TL;DR: GST-VLA 提出了一种结合3D高斯空间标记化与深度感知链式推理的新视觉-语言-动作框架,显著提升了机器人操作任务性能。
Details
Motivation: 现有VLA模型使用的2D图像块标记缺乏内在几何结构,难以支持需要精确3D理解的机器人操作任务。 Method: 1) 高斯空间标记器(GST)将深度与语义特征转化为128个各向异性3D高斯原语,编码位置、尺度、方向与几何置信度;2) 深度感知链式推理(DA-CoT)显式建模四类3D空间推理步骤;3) 结合交叉注意力与流匹配动作专家解码7自由度动作。 Result: 在LIBERO和SimplerEnv基准上分别达到96.4%和80.2%,较基线提升2.0%和5.4%;消融实验证明各模块具有独立且协同的增益。 Conclusion: 将显式的3D几何表示与结构化空间推理引入VLA框架,可有效提升复杂操作任务的精度与泛化能力。 Abstract: VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ\in \mathbb{R}^3$, log-scale covariance $\log σ\in \mathbb{R}^3$, and learned opacity $α\in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.[65] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing
Lixiang Lin,Siyuan Jin,Jinshan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的框架OmniEdit,用于唇形同步和音视频编辑,通过改进FlowEdit中的编辑范式并消除生成过程中的随机性,实现了更稳定、无偏的编辑效果。
Details
Motivation: 现有唇形同步和音视频编辑方法大多依赖预训练模型的监督微调,计算开销大、数据需求高。 Method: 提出OmniEdit框架,将FlowEdit中的编辑序列替换为目标序列以实现无偏估计,并去除生成过程中的随机性以构建平滑稳定的编辑轨迹。 Result: 大量实验验证了OmniEdit在唇形同步与音视频编辑任务上的有效性与鲁棒性。 Conclusion: OmniEdit是一种高效、无需训练的通用音视频编辑框架,为多模态编辑提供了新思路。 Abstract: Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at https://github.com/l1346792580123/OmniEdit.[66] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
Zixuan Wang,Yixin Hu,Haolan Wang,Feng Chen,Yan Liu,Wen Li,Yinjie Lei
Main category: cs.CV
TL;DR: 本文提出一种物理可信视频生成(PPVG)新范式,将视频生成视为因果关联、动态演化的事件序列,并设计了物理驱动的事件链推理与过渡感知跨模态提示两个模块,显著提升物理现象建模的合理性与连续性。
Details
Motivation: 现有视频扩散模型缺乏对物理现象因果演进过程的建模能力,仅依赖提示词生成单帧或瞬时物理状态,难以保证视频中物理过程的连贯性与合理性。 Method: 提出两模块:(1) 物理驱动的事件链推理——利用思维链分解提示中的物理现象为基本事件单元,并嵌入物理公式作为因果约束;(2) 过渡感知跨模态提示(TCP)——将因果事件单元转化为时序对齐的图文提示,通过交互式编辑合成视觉关键帧并生成连贯叙事。 Result: 在PhyGenBench和VideoPhy基准上显著优于现有方法,在多种物理领域(如刚体运动、流体、碰撞等)生成更符合物理规律的视频。 Conclusion: 将PPVG建模为因果事件序列是有效路径,显式引入物理公式约束与跨模态时序提示机制可大幅提升生成视频的物理合理性和动态连续性。 Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.[67] MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
Chenran Zhang,Ruiqi Wu,Tao Zhou,Yi Zhou
Main category: cs.CV
TL;DR: 本文提出了一种知识驱动的认知编排方法(MedKCO)用于医学视觉-语言预训练,通过设计两级课程学习策略和自调节非对称对比损失,提升模型在分布偏移下的泛化能力,并在多个下游任务中显著优于基线方法。
Details
Motivation: 当前医学视觉-语言预训练方法同时学习简单与复杂概念,违背认知规律,导致特征表示次优,尤其在分布偏移下表现不佳。 Method: 提出MedKCO框架:1)基于诊断敏感性和类内样本代表性设计两级课程学习以排序预训练数据;2)引入自调节的非对称对比损失,动态调整视觉-语言对比目标的参与度。 Result: 在三种医学影像场景及多个视觉-语言下游任务上显著超越多种课程学习基线方法。 Conclusion: 知识驱动的认知编排策略能有效提升医学VLP模型的鲁棒性与泛化能力,为医学多模态预训练提供了新思路。 Abstract: Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.[68] Training-free Motion Factorization for Compositional Video Generation
Zixuan Wang,Ziqin Zhou,Feng Chen,Duo Peng,Yixin Hu,Changsheng Li,Yinjie Lei
Main category: cs.CV
TL;DR: 本文提出了一种运动分解框架,将复杂运动分解为静止、刚性运动和非刚性运动三类,通过‘先规划后生成’范式提升视频生成中对运动语义的理解与合成质量。
Details
Motivation: 现有视频生成方法主要关注语义绑定,忽视了对提示词中多样化运动类别的理解,难以准确合成多实例、多运动类型的视频。 Method: 提出运动分解框架,包含两个阶段:(1) 规划阶段——基于运动图推理各实例的帧级形变与位移;(2) 生成阶段——通过条件化引导分支分别建模三类运动,在扩散模型中实现运动解耦调制;整个框架模型无关。 Result: 在真实世界基准上显著提升了运动合成性能,验证了框架在多实例、多运动类型视频生成中的有效性与通用性。 Conclusion: 运动分解与规划-生成协同范式能有效缓解提示歧义、增强运动可控性,为 compositional 视频生成提供了新思路,且具备跨模型架构的可扩展性。 Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.[69] Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
Yuheng Wang,Yuji Lin,Dongrun Zhu,Jiayue Cai,Sunil Kalia,Harvey Lui,Chunqi Chang,Z. Jane Wang,Tim K. Lee
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的多模态检索框架,用于皮肤癌的图文联合医学图像检索,通过全局-局部对齐和临床导向的加权策略提升检索性能。
Details
Motivation: 医学图像检索在临床诊断、教育和质控中具有重要价值,但实际查询常为图像与文本(如皮肤镜特征)的组合,现有方法难以有效融合多模态信息并兼顾临床相关性。 Method: 提出一种Transformer框架,学习分层的图文组合查询表征;引入联合的全局-局部对齐机制:局部对齐通过多个空间注意力掩码聚合判别性区域,全局对齐提供整体语义监督;最终相似度通过凸的、领域知识引导的加权方式计算,强调临床显著的局部证据并保持全局一致性。 Result: 在公开Derm7pt数据集上实验表明,该方法持续优于当前最优方法,提升了检索准确率与临床实用性。 Conclusion: 所提框架能高效检索相关医学记录,支持临床实际部署,为多模态医学图像检索提供了新思路与有效方案。 Abstract: Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.[70] VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
Xiyao Wang,Xiaoyu Tan,Yang Dai,Yuxuan Fu,Shuo Li,Xihe Qiu
Main category: cs.CV
TL;DR: VIVID-Med 提出用冻结大语言模型作为结构化语义教师,通过统一医学模式(UMS)和结构化预测分解(SPD)预训练轻量级ViT,在多个医疗影像任务上显著优于现有方法,且无需部署大模型。
Details
Motivation: 现有视觉-语言预训练方法在医学图像分析中使用单热标签或自由文本监督视觉编码器,难以有效建模临床发现间的复杂语义关系。 Method: 提出VIVID-Med框架:1)利用冻结大语言模型作为结构化语义教师;2)将临床发现映射为可验证的JSON字段-状态对(基于统一医学模式UMS),并采用答案感知掩码;3)通过结构化预测分解(SPD)正则化交叉注意力,提取互补视觉特征;4)训练后丢弃LLM,仅保留轻量ViT主干。 Result: 在CheXpert线性探针中macro-AUC达0.8588(较BiomedCLIP提升6.65点,仅用1/500数据);零样本迁移到NIH ChestX-ray14达0.7225;CT跨模态任务中LIDC-IDRI肺结节分类AUC=0.8413,OrganAMNIST 11器官分类macro-AUC=0.9969。 Conclusion: VIVID-Med提供了一种高效、可扩展、无需部署大模型的医学视觉表征学习范式,适用于资源受限的临床场景。 Abstract: Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.[71] Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities
Jindi Bao,Jianjun Qian,Mengkai Yan,Jian Yang
Main category: cs.CV
TL;DR: 本文提出PRLF框架,通过自适应模态可靠性估计器(AMRE)和渐进式交互模块(ProgInteract),在模态缺失不确定条件下提升多模态情感分析的鲁棒性与泛化能力。
Details
Motivation: 现有方法依赖完整模态,但实际中常面临模态缺失(如噪声、硬件故障、隐私限制),导致特征错位并损害完整模态表征。 Method: 提出PRLF框架,包含AMRE模块动态评估各模态可靠性(基于识别置信度与Fisher信息),以及ProgInteract模块以主导模态为基准渐进对齐其他模态。 Result: 在CMU-MOSI、CMU-MOSEI和SIMS数据集上,PRLF在跨模态与单模态缺失场景下均超越现有最优方法。 Conclusion: PRLF有效缓解模态缺失带来的负面影响,提升了多模态情感分析在真实场景中的鲁棒性与泛化能力。 Abstract: Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.[72] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model
Junjie Yin,Jiaju Li,Hanfa Xing
Main category: cs.CV
TL;DR: 本文提出了一种名为QUSR的新型超分辨率扩散模型,通过引入质量感知先验(QAP)和不确定性引导噪声生成(UNG)模块,有效应对真实场景中未知且空间非均匀退化的问题,显著提升了重建图像的保真度与真实性。
Details
Motivation: 现有基于扩散的图像超分辨率方法在面对真实场景中未知且空间非均匀的退化时表现不佳,易丢失细节或产生视觉伪影。 Method: 提出QUSR模型,包含两个核心模块:1)不确定性引导噪声生成(UNG)模块,根据局部不确定性自适应调整噪声注入强度;2)质量感知先验(QAP)模块,利用多模态大语言模型(MLLM)生成可解释的质量描述作为先验指导。 Result: 实验表明QUSR在真实场景下能生成高保真、高真实感的超分辨率图像。 Conclusion: QUSR通过结合不确定性建模与语义质量先验,为复杂退化下的图像超分辨率提供了更鲁棒、可解释的新范式。 Abstract: Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at https://github.com/oTvTog/QUSR.[73] Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging
Mohseu Rashid Subah,Mohammed Abdul Gani Zilani,Thomas L. Nickolas,Matthew R. Allen,Stuart J. Warden,Rachel K. Surowiec
Main category: cs.CV
TL;DR: 本文提出了一种基于HR-pQCT图像的全自动多区域分割与软组织放射组学分析框架,首次将SegFormer用于HR-pQCT分割,并发现软组织(尤其是肌腱膜)的放射组学特征在骨质疏松分类中优于传统骨参数。
Details
Motivation: 现有HR-pQCT分析主要关注矿化骨结构,大量软组织信息被忽略;DXA仅提供面密度,缺乏微结构和软组织信息,限制了骨质疏松精准评估。 Method: 采用SegFormer实现胫骨、腓骨皮质/松质骨及周围软组织(皮肤、肌腱膜、脂肪)的全自动分割;对各区域提取939维放射组学特征,降维后训练六种机器学习模型进行二分类;在20,496张图像(122例扫描)上验证性能。 Result: SegFormer平均F1达95.36%;肌腱膜特征在图像级分类中准确率80.08%,AUROC 0.85,优于骨特征;患者级AUROC从0.792提升至0.875。 Conclusion: 软组织放射组学可显著提升骨质疏松分类性能,多区域自动化HR-pQCT分析为整合骨与软组织评估提供了新范式。 Abstract: Osteoporosis is a skeletal disease typically diagnosed using dual-energy X-ray absorptiometry (DXA), which quantifies areal bone mineral density but overlooks bone microarchitecture and surrounding soft tissues. High-resolution peripheral quantitative computed tomography (HR-pQCT) enables three-dimensional microstructural imaging with minimal radiation. However, current analysis pipelines largely focus on mineralized bone compartments, leaving much of the acquired image data underutilized. We introduce a fully automated framework for binary osteoporosis classification using radiomics features extracted from anatomically segmented HR-pQCT images. To our knowledge, this work is the first to leverage a transformer-based segmentation architecture, i.e., the SegFormer, for fully automated multi-region HR-pQCT analysis. The SegFormer model simultaneously delineated the cortical and trabecular bone of the tibia and fibula along with surrounding soft tissues and achieved a mean F1 score of 95.36%. Soft tissues were further subdivided into skin, myotendinous, and adipose regions through post-processing. From each region, 939 radiomic features were extracted and dimensionally reduced to train six machine learning classifiers on an independent dataset comprising 20,496 images from 122 HR-pQCT scans. The best image level performance was achieved using myotendinous tissue features, yielding an accuracy of 80.08% and an area under the receiver operating characteristic curve (AUROC) of 0.85, outperforming bone-based models. At the patient level, replacing standard biological, DXA, and HR-pQCT parameters with soft tissue radiomics improved AUROC from 0.792 to 0.875. These findings demonstrate that automated, multi-region HR-pQCT segmentation enables the extraction of clinically informative signals beyond bone alone, highlighting the importance of integrated tissue assessment for osteoporosis detection.[74] Rotation Equivariant Mamba for Vision Tasks
Zhongchen Zhao,Qi Xie,Keyu Huang,Lei Zhang,Deyu Meng,Zongben Xu
Main category: cs.CV
TL;DR: 本文提出EQ-VMamba,首个具备旋转等变性的视觉Mamba架构,通过旋转等变交叉扫描策略和群Mamba模块实现端到端旋转等变,并在多个视觉任务上以约50%更少参数取得优越或相当性能。
Details
Motivation: 现有基于Mamba的视觉模型缺乏旋转等变性,导致对图像旋转敏感,限制了鲁棒性和跨任务泛化能力。 Method: 提出EQ-VMamba,包含旋转等变交叉扫描策略和群Mamba模块,并进行严格的等变误差理论分析。 Result: 在图像分类、语义分割和超分辨率等多个基准任务上,EQ-VMamba性能优于或媲美非等变基线,且参数量减少约50%。 Conclusion: 嵌入旋转等变性可显著提升视觉Mamba模型的旋转鲁棒性、整体性能及参数效率。 Abstract: Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.[75] Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G
Loc X. Nguyen,Ji Su Yoon,Huy Q. Le,Yu Qiao,Avi Deb Raha,Eui-Nam Huh,Nguyen H. Tran,Choong Seon Hong
Main category: cs.CV
TL;DR: 本文提出了一种基于Agentic AI的控制层,用于在6G网络中管理面向用户定制的设备端联邦学习(FL),将高阶任务目标转化为适应网络条件的动作,实现学习与网络管理的协同优化。
Details
Motivation: 用户定制化、分布式数据驱动的设备端学习对无线系统提出了低延迟、高带宽和高可靠性的新要求,传统FL方法难以兼顾学习性能与动态网络约束。 Method: 设计了一个由检索、规划、编码和评估四类专业化智能体组成的Agentic AI控制层,集成监控工具与优化方法,闭环地执行客户端选择、激励机制设计、调度、资源分配、自适应本地训练及代码生成等任务。 Result: 通过案例研究验证了该Agentic AI系统能根据信噪比、带宽和设备能力等动态网络条件持续优化决策,并实现高性能联邦学习。 Conclusion: 将联邦学习视为学习与网络管理的联合任务,并引入具备感知、推理与闭环优化能力的Agentic AI作为控制层,是支撑6G时代高效、鲁棒、用户定制化设备端学习的关键范式。 Abstract: The shift toward user-customized on-device learning places new demands on wireless systems: models must be trained on diverse, distributed data while meeting strict latency, bandwidth, and reliability constraints. To address this, we propose an Agentic AI as the control layer for managing federated learning (FL) over 6G networks, which translates high-level task goals into actions that are aware of network conditions. Rather than simply viewing FL as a learning challenge, our system sees it as a combined task of learning and network management. A set of specialized agents focused on retrieval, planning, coding, and evaluation utilizes monitoring tools and optimization methods to handle client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation. The use of closed-loop evaluation and memory allows the system to consistently refine its decisions, taking into account varying signal-to-noise ratios, bandwidth conditions, and device capabilities. Finally, our case study has demonstrated the effectiveness of the Agentic AI system's use of tools for achieving high performance.[76] RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation
Kunyu Tan,Mingjian Liang
Main category: cs.CV
TL;DR: 本文提出RTFDNet,一种三分支编码器-解码器网络,通过协同特征融合(SFF)、跨模态解耦正则化(CMDR)和区域解耦正则化(RDR),统一实现RGB-T语义分割中的模态融合与解耦,提升在模态缺失下的鲁棒性,并支持高效单阶段推理。
Details
Motivation: 传统RGB-T语义分割方法过度强调模态平衡,导致在传感器信号部分缺失时鲁棒性差、性能严重下降;现有先进方法(如跨模态知识蒸馏、模态自适应微调)通常将模态融合与适配解耦,依赖多阶段训练或教师-学生框架,不够高效灵活。 Method: 提出RTFDNet:1)三分支Encoder-Decoder架构;2)协同特征融合(SFF)模块,实现通道门控交换与轻量空间注意力;3)跨模态解耦正则化(CMDR),分离融合表征中的模态特异性成分,并以stop-gradient目标监督单模态解码器;4)区域解耦正则化(RDR),在置信区域强制类别选择性预测一致性并阻断梯度回传至融合分支。 Result: 在多种模态缺失条件下均展现出一致且优越的性能,支持无需额外模块的高效单阶段测试推理;代码已开源。 Conclusion: RTFDNet通过统一融合与解耦机制,在保证融合性能的同时强化单模态路径,显著提升了RGB-T语义分割在复杂、不完整输入场景下的鲁棒性与实用性。 Abstract: RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.[77] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Tzu-Heng Huang,Sirajul Salekin,Javier Movellan,Frederic Sala,Manjot Bilkhu
Main category: cs.CV
TL;DR: 本文提出RubiCap,一种基于强化学习的密集图像描述生成框架,利用大语言模型(LLM)生成细粒度、样本特定的评分标准作为奖励信号,显著提升生成多样性与泛化能力,在多个基准上超越监督蒸馏、现有RL方法及人类专家标注。
Details
Motivation: 密集图像描述生成对视觉-语言预训练和文生图至关重要,但高质量人工标注成本过高;合成标注虽可行,但监督蒸馏多样性差、泛化弱,而传统强化学习因缺乏可验证的确定性评估器难以应用于开放性描述任务。 Method: RubiCap构建候选描述委员会,由LLM撰写共识性评分标准(rubrics),提取当前策略的优势与缺陷,并转化为显式评估维度,再由LLM法官进行结构化多维质量评估,替代粗粒度标量奖励。 Result: 在CapArena上胜率最高,超越监督蒸馏、先前RL方法、人类标注及GPT-4V增强输出;在CaptionQA上展现优异词效:7B模型匹敌Qwen2.5-VL-32B-Instruct,3B模型超越其7B版本;用RubiCap-3B生成的描述训练出的VLM强于使用专有模型生成描述训练的VLM。 Conclusion: RubiCap通过LLM驱动的细粒度、结构化奖励机制,有效解决了开放域图像描述中强化学习缺乏可靠评估信号的核心瓶颈,为低成本、高质量合成标注提供了新范式。 Abstract: Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.[78] Progressive Split Mamba: Effective State Space Modelling for Image Restoration
Mohammed Hassanin,Nour Moustafa,Weijian Deng,Ibrahim Radwan
Main category: cs.CV
TL;DR: 本文提出PS-Mamba,一种拓扑感知的分层状态空间模型,通过几何一致划分和跨尺度对称捷径路径,解决Mamba在图像恢复中因1D序列化导致的局部结构失真与长程衰减问题,在超分、去噪和JPEG压缩伪影去除任务上显著优于现有Mamba及注意力模型。
Details
Motivation: 现有图像恢复方法中,CNN感受野受限,Transformer计算复杂度高;虽SSM(如Mamba)提供线性复杂度长程建模能力,但直接用于2D图像会破坏空间拓扑并引发长程信息衰减,影响高保真重建效果。 Method: 提出Progressive Split-Mamba(PS-Mamba):1)几何一致的分层分割( halves→quadrants→octants),避免全图1D展平,保持邻域完整性;2)引入对称跨尺度捷径路径,直接传递低频全局上下文以缓解长程衰减。 Result: 在超分辨率、图像去噪和JPEG伪影去除任务上,PS-Mamba持续超越最新Mamba类及注意力机制模型,性能提升显著。 Conclusion: PS-Mamba通过拓扑感知的分层状态空间建模与跨尺度稳定信息传输,有效平衡了图像恢复中的局部细节保真与全局空间一致性,验证了改进SSM架构在底层视觉任务中的优越性与潜力。 Abstract: Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.[79] Point Cloud as a Foreign Language for Multi-modal Large Language Model
Sneha Paul,Zachary Patterson,Nizar Bouguila
Main category: cs.CV
TL;DR: 本文提出SAGE,首个端到端3D多模态大语言模型,直接处理原始点云,无需预训练3D编码器;通过轻量级3D分词器和语义对齐的偏好优化训练策略,在3D理解任务中实现更优性能、更高效率与更强泛化性。
Details
Motivation: 现有基于编码器的3D多模态大语言模型存在几何与语言空间语义错位、分辨率敏感及计算开销大等问题。 Method: 提出端到端SAGE模型:1)设计轻量级3D分词器,融合几何采样、邻域聚合与向量量化,将点云离散化为可被LLM理解的token;2)引入基于语义对齐奖励的偏好优化训练策略,提升开放性3D问答的推理能力。 Result: 在多个3D理解基准上显著超越现有编码器方法,同时具备更高计算效率、更好LLM骨干泛化性及对输入分辨率变化的鲁棒性。 Conclusion: SAGE验证了端到端处理原始点云的可行性与优势,为3D多模态理解提供了新范式。 Abstract: Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens--treating 3D data as a foreign language that naturally extends the LLM's vocabulary. Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: github.com/snehaputul/SAGE3D.[80] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
Zongxia Li,Hongyang Du,Chengsong Huang,Xiyang Wu,Lantao Yu,Yicheng He,Jing Xie,Xiaomin Wu,Zhichao Liu,Jiarui Zhang,Fuxiao Liu
Main category: cs.CV
TL;DR: 本文提出MM-Zero,首个基于强化学习实现零数据自演化的视觉语言模型(VLM)推理框架,通过Proposer、Coder、Solver三角色协同与GRPO训练,显著提升多模态推理性能。
Details
Motivation: 现有VLM自演化通常需初始图像数据引导,而LLM已实现零数据自演化;本文旨在填补这一空白,实现真正零数据的VLM自演化。 Method: 提出MM-Zero框架:包含Proposer(生成抽象视觉概念并提问)、Coder(将概念转为可执行代码渲染图像)、Solver(对生成图像进行多模态推理)三角色;全部角色初始化自同一基础模型,采用Group Relative Policy Optimization(GRPO)联合训练,并设计融合执行反馈、视觉验证与难度平衡的奖励机制。 Result: 在多个多模态基准测试中显著提升VLM推理性能,验证了零数据自演化的可行性与有效性,并拓展了自演化从双模型向多模型系统的路径。 Conclusion: MM-Zero首次实现了VLM的零数据自演化,突破了传统双角色范式,为多模态基础模型的自主进化提供了新范式和可扩展路径。 Abstract: Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.[81] Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints
Chayanin Chamachot,Kanokphan Lertniponphan
Main category: cs.CV
TL;DR: 本文提出了一种基于手部关节角度的几何感知度量学习框架,用于跨语言少样本手语识别(SLR),该方法对视角、尺度和位移具有不变性,在低资源手语中显著提升识别性能。
Details
Motivation: 全球300多种手语中大多数缺乏足够标注数据,而传统基于坐标的关键点表示易受跨域差异(如视角、尺度、拍摄条件)影响,尤其在少样本场景下导致类原型不稳定。 Method: 提出一种紧凑的20维关节间角度描述符(源自MediaPipe静态手关键点),具备SO(3)旋转、平移与各向同性缩放不变性;结合度量学习与轻量MLP编码器进行跨语言少样本迁移。 Result: 在ASL、LIBRAS、阿拉伯语手语和泰语手语四种手指拼写字母集上,相比归一化坐标基线,域内准确率最高提升25个百分点;冻结特征的跨语言迁移结果常优于域内训练结果。 Conclusion: 手部几何不变描述符为低资源场景下的跨语言少样本手语识别提供了可迁移且高效的基础表示。 Abstract: Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world's 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.[82] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
Yueen Ma,Irwin King
Main category: cs.CV
TL;DR: 本文提出了X-GS,一个可扩展的开源框架,统一了多种3D高斯泼溅(3DGS)技术,支持实时语义增强的在线SLAM,并通过X-GS-Perceiver和X-GS-Thinker组件实现几何、位姿与语义特征的联合优化及下游多模态任务。
Details
Motivation: 现有3DGS方法多孤立于特定领域(如在线SLAM、语义增强或无位姿图像重建),缺乏统一、可扩展且支持实时多模态下游任务的框架。 Method: 提出X-GS框架,核心为X-GS-Perceiver:支持未标定RGB/RGB-D视频流输入,联合优化几何与相机位姿,并蒸馏视觉基础模型的高维语义特征至3D高斯;引入在线向量量化(VQ)、GPU加速网格采样与高度并行化流水线以保障实时性;语义高斯由X-GS-Thinker交由视觉语言模型执行下游任务。 Result: 在真实数据集上验证了X-GS在效率、精度及新解锁的多模态能力(如目标检测、零样本描述生成)方面的优势,实现了实时语义增强在线SLAM。 Conclusion: X-GS成功构建了一个统一、高效、可扩展的3DGS开放框架,弥合了3D重建与多模态AI之间的鸿沟,为具身智能等下游应用提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.[83] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy
Yaoyu Liu,Minghui Zhang,Xin You,Hanxiao Zhang,Yun Gu
Main category: cs.CV
TL;DR: 本文提出TubeMLLM,一种结合结构化理解与可控生成的医学血管样解剖统一基础模型,通过自然语言提示引入拓扑先验,并在共享注意力架构中对齐视觉表征,显著提升拓扑感知能力;同时构建首个拓扑中心多模态基准TubeMData及自适应损失加权策略,在15个数据集上实现SOTA零样本跨模态泛化与鲁棒性。
Details
Motivation: 医学血管样解剖建模因复杂拓扑结构和数据分布偏移敏感而困难,导致任务专用模型存在拓扑不一致问题(如虚假断连或合并);受多模态大语言模型(MLLM)零样本泛化潜力启发,亟需一种能兼顾拓扑感知与可控生成的统一基础模型。 Method: 提出TubeMLLM模型:1)通过显式自然语言提示注入拓扑先验;2)设计共享注意力架构对齐语言提示与视觉表征;3)构建拓扑中心多模态基准TubeMData;4)引入强调拓扑关键区域的自适应损失加权策略。 Result: 在15个多样化数据集上验证有效性:1)眼底彩照中β₀数误差从37.42降至8.58;2)零样本迁移至未见X射线血管造影,Dice达67.50%,β₀误差仅1.21;3)对模糊、噪声、低分辨率退化保持鲁棒;4)拓扑质量评估准确率达97.38%,远超标准VLM基线。 Conclusion: TubeMLLM成功将拓扑先验融入多模态大模型框架,实现了高精度、强泛化、鲁棒的医学血管样结构建模,在零样本跨模态任务与拓扑感知理解方面树立新标杆,为医学图像分析提供了可推广的基础模型范式。 Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $β_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $β_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.[84] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
Chengjun Yu,Xuhan Zhu,Chaoqun Du,Pengfei Yu,Wei Zhai,Yang Cao,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 本文提出EXPLORE-Bench基准,用于评估多模态大语言模型(MLLMs)在第一人称视角下对长时序动作物理后果的推理能力;实验表明现有MLLMs在此任务上远逊于人类,分步推理可提升性能但带来计算开销。
Details
Motivation: 现有MLLMs作为具身智能体基础,尚不清楚其能否可靠地从第一人称视角推理动作的长期物理后果,存在研究空白。 Method: 构建新任务‘第一人称场景预测与长时序推理’,并发布基于真实第一人称视频的基准EXPLORE-Bench,含长动作序列与结构化终态场景标注;采用多种主流MLLM进行评测,并探索测试时分步推理策略。 Result: 实验显示当前MLLMs在该任务上性能显著低于人类;分步推理能部分提升效果,但计算成本上升。 Conclusion: 长时序第一人称推理仍是MLLMs的重大挑战,EXPLORE-Bench为该方向提供了系统性评测平台与推进路径。 Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.[85] Distributed Convolutional Neural Networks for Object Recognition
Liang Sun
Main category: cs.CV
TL;DR: 本文提出了一种新型损失函数,用于训练分布式卷积神经网络(DisCNN),使其仅识别特定正类;通过将正样本映射到高维空间的紧凑集合、负样本映射到原点,实现正类特征的解耦与轻量化提取,并在检测嵌入复杂背景中的正样本任务中表现出色。
Details
Motivation: 解决传统CNN难以仅聚焦于单一正类并解耦其特征的问题,同时提升模型轻量化程度与对未见类别的泛化能力。 Method: 设计一种新型损失函数,引导DisCNN将正样本映射至高维空间中的紧凑集合,负样本映射至原点,从而仅提取正类特征;采用轻量级网络架构。 Result: 实验验证了该方法能有效解耦正类特征,模型具有强泛化能力,对未见类别仍有效,并可直接用于复杂背景下的正样本目标检测。 Conclusion: 所提DisCNN及其损失函数实现了正类特征的精准提取与解耦,兼顾轻量化、泛化性与实用性。 Abstract: This paper proposes a novel loss function for training a distributed convolutional neural network (DisCNN) to recognize only a specific positive class. By mapping positive samples to a compact set in high-dimensional space and negative samples to Origin, the DisCNN extracts only the features of the positive class. An experiment is given to prove this. Thus, the features of the positive class are disentangled from those of the negative classes. The model has a lightweight architecture because only a few positive-class features need to be extracted. The model demonstrates excellent generalization on the test data and remains effective even for unseen classes. Finally, using DisCNN, object detection of positive samples embedded in a large and complex background is straightforward.[86] UniField: A Unified Field-Aware MRI Enhancement Framework
Yiyang Lin,Chenhui Wang,Zhihao Peng,Yixuan Yuan
Main category: cs.CV
TL;DR: 本文提出了一种名为\methodname的统一框架,用于多场强MRI图像增强,通过利用3D基础模型、引入物理驱动的频谱校正机制(FASRM)以及发布大规模配对多场强MRI数据集,显著提升了跨场强MRI增强性能。
Details
Motivation: 现有MRI场强增强方法局限于单一任务和小规模数据,未能利用不同场强间共有的退化模式,导致泛化能力差且数据稀缺问题严重。 Method: 1) 利用预训练3D基础模型直接建模3D体积信息;2) 提出基于磁场物理机制的Field-Aware Spectral Rectification Mechanism (FASRM)以缓解流匹配模型的频谱偏差;3) 构建并公开一个规模远超现有数据集的配对多场强MRI数据集。 Result: 在多个指标上超越当前最优方法,平均PSNR提升约1.81 dB,SSIM提升9.47%。 Conclusion: 所提统一框架有效融合多模态与多任务学习,结合3D建模、物理先验与大规模数据,显著提升了MRI跨场强增强的性能与泛化能力。 Abstract: Magnetic Resonance Imaging (MRI) field-strength enhancement holds immense value for both clinical diagnostics and advanced research. However, existing methods typically focus on isolated enhancement tasks, such as specific 64mT-to-3T or 3T-to-7T transitions using limited subject cohorts, thereby failing to exploit the shared degradation patterns inherent across different field strengths and severely restricting model generalization. To address this challenge, we propose \methodname, a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning by exploiting these shared degradation characteristics. Specifically, our main contributions are threefold. Firstly, to overcome MRI data scarcity and capture continuous anatomical structures, \methodname departs from conventional methods that treat 3D MRI volumes as independent 2D slices. Instead, we directly exploit comprehensive 3D volumetric information by leveraging pre-trained 3D foundation models, thereby embedding generalized and robust structural representations to significantly boost enhancement performance. In addition, to mitigate the spectral bias of mainstream flow-matching models that often over-smooth high-frequency details, we explicitly incorporate the physical mechanisms of magnetic fields to introduce a Field-Aware Spectral Rectification Mechanism (FASRM), tailoring customized spectral corrections to distinct field strengths. Finally, to resolve the fundamental data bottleneck, we organize and publicly release a comprehensive paired multi-field MRI dataset, which is an order of magnitude larger than existing datasets. Extensive experiments demonstrate our method's superiority over state-of-the-art approaches, achieving an average improvement of approximately 1.81 dB in PSNR and 9.47\% in SSIM. Code will be released upon acceptance.[87] HelixTrack: Event-Based Tracking and RPM Estimation of Propeller-like Objects
Radim Spetlik,Michal Pliska,Vojtěch Vrba,Jiri Matas
Main category: cs.CV
TL;DR: 本文提出HelixTrack,一种全事件驱动的方法,用于在无人机和旋转机械中实现微秒级延迟的快速周期性运动跟踪与RPM估计,解决了现有帧式和事件式跟踪器在螺旋桨等周期性目标上因违反平滑运动假设而导致的漂移或失效问题。
Details
Motivation: 安全关键场景(如无人机、旋转机械)需要在自运动和强干扰下对快速周期性运动(如螺旋桨)进行微秒级延迟跟踪,但现有跟踪器因假设运动平滑而无法处理周期性信号。 Method: HelixTrack采用事件驱动架构:通过在线估计的单应性将事件反向映射到转子平面;用卡尔曼滤波实时估计相位;并结合相位残差与几何信息进行批处理迭代优化姿态。 Result: 在自建TQE数据集(含13段高分辨率事件序列、52个旋转目标、精确微秒级RPM真值)上,HelixTrack以约11.8倍实时速度运行,实现微秒级延迟,并显著优于适配后的各类基线方法。 Conclusion: HelixTrack首次实现了事件域中螺旋桨类目标的联合高精度跟踪与RPM估计,填补了该任务的技术空白,并推动了面向周期性运动的安全感知研究。 Abstract: Safety-critical perception for unmanned aerial vehicles and rotating machinery requires microsecond-latency tracking of fast, periodic motion under egomotion and strong distractors. Frame-based and event-based trackers drift or break on propellers because periodic signatures violate their smooth-motion assumptions. We tackle this gap with HelixTrack, a fully event-driven method that jointly tracks propeller-like objects and estimates their rotations per minute (RPM). Incoming events are back-warped from the image plane into the rotor plane via a homography estimated on the fly. A Kalman Filter maintains instantaneous estimates of phase. Batched iterative updates refine the object pose by coupling phase residuals to geometry. To our knowledge, no public dataset targets joint tracking and RPM estimation of propeller-like objects. We therefore introduce the Timestamped Quadcopter with Egomotion (TQE) dataset with 13 high-resolution event sequences, containing 52 rotating objects in total, captured at distances of 2 m / 4 m, with increasing egomotion and microsecond RPM ground truth. On TQE, HelixTrack processes full-rate events (approx. 11.8x real time) faster than real time and microsecond latency. It consistently outperforms per-event and aggregation-based baselines adapted for RPM estimation.[88] BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off
Shuang Liu,Ao Yu,Linkang Cheng,Xiwen Huang,Li Zhao,Junhui Liu,Zhiting Lin,Yu Liu
Main category: cs.CV
TL;DR: BridgeDiff是一种基于扩散模型的虚拟试穿(VTOFF)新方法,通过引入服装条件桥模块(GCBM)和扁平结构约束模块(FSCM),在部分可见条件下提升服装外观一致性和结构稳定性,显著优于现有方法。
Details
Motivation: 现有VTOFF方法将任务视为直接图像翻译,依赖局部掩码或纯文本提示,忽视了穿着状态与平面服装布局之间的本质差异,导致未观测区域重建不一致、结构不稳定。 Method: 提出BridgeDiff:1)Garment Condition Bridge Module(GCBM)构建全局服装外观与语义身份表征,增强部分可见下的细节推理鲁棒性;2)Flat Structure Constraint Module(FSCM)在关键去噪阶段引入Flat-Constraint Attention(FC-Attention),显式注入平面服装结构先验。 Result: 在标准VTOFF基准上达到SOTA性能,生成的平面服装重建质量更高,同时更好保持细粒度外观与结构完整性。 Conclusion: BridgeDiff通过解耦并协同建模人体观测与平面服装生成,有效弥合了着装图像与平面布局间的语义鸿沟,为VTOFF提供了更可靠、结构稳定的生成范式。 Abstract: Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.[89] RAE-NWM: Navigation World Model in Dense Visual Representation Space
Mingkun Zhang,Wangtian Shen,Fan Zhang,Haijian Qin,Zihao Pei,Ziyang Meng
Main category: cs.CV
TL;DR: 本文提出了一种基于密集视觉表征的导航世界模型RAE-NWM,通过在DINOv2特征空间建模动作条件下的状态转移,提升了结构稳定性和动作准确性。
Details
Motivation: 现有基于VAE潜在空间的世界模型因空间压缩而丢失细粒度结构信息,影响精确控制;作者发现密集DINOv2特征对动作条件转移具有更强线性可预测性。 Method: 提出RAE-NWM,在密集DINOv2特征空间建模导航动态;采用带解耦扩散Transformer头的条件扩散Transformer(CDiT-DH)建模连续转移,并引入时序驱动门控模块调节动作注入强度。 Result: 在多个评估中表明,该方法在序列rollout中提升了结构稳定性和动作准确性,有利于下游规划与导航任务。 Conclusion: 在密集视觉表征空间建模世界动态比在压缩潜在空间更有效,为视觉导航世界模型提供了新范式。 Abstract: Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.[90] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection
Chao Shuai,Zhenguang Liu,Shaojing Fan,Bin Gong,Weichen Lian,Xiuli Bi,Zhongjie Ba,Kui Ren
Main category: cs.CV
TL;DR: 本文提出Geometric Semantic Decoupling (GSD)方法,解决基于视觉基础模型(VFMs)的AI图像检测器在面对未知生成流程时泛化能力差的问题,通过解耦语义信息与伪造痕迹提升检测鲁棒性。
Details
Motivation: 现有基于VFMs(如CLIP)的AI图像检测器在跨生成流程场景下泛化能力差,核心原因是‘语义回退’现象——模型依赖预训练语义先验(如身份信息)而非伪造痕迹。 Method: 提出无参数模块GSD:利用冻结VFM作为语义引导、可训练VFM作为伪造检测器;通过批统计估计语义方向,并用几何约束将其投影剔除,迫使检测器关注语义无关的取证特征。 Result: 在跨数据集评估中视频级AUC达94.4%(+1.2%);对未知篡改鲁棒性提升3.0%(DF40);泛化至通用场景检测,在UniversalFakeDetect和GenImage上分别提升0.9%和1.7%。 Conclusion: GSD有效缓解语义回退问题,显著提升VFM-based检测器对未见生成方式和通用场景的泛化能力与鲁棒性。 Abstract: AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).[91] Towards Instance Segmentation with Polygon Detection Transformers
Jiacheng Sun,Jiaqi Lin,Wenlong Hu,Haoyang Li,Xinghong Zhou,Chenghai Mao,Yan Peng,Xiaomao Li
Main category: cs.CV
TL;DR: 本文提出Poly-DETR,将实例分割重构为基于极坐标表示的稀疏顶点回归,避免密集像素掩码预测,提升高分辨率下的实时性与轻量化性能。
Details
Motivation: 当前实例分割在高分辨率输入与轻量、实时推理之间存在矛盾,亟需新方法解决。 Method: 提出Polygon Detection Transformer(Poly-DETR),采用极坐标表示实现稀疏顶点回归;引入Polar Deformable Attention和Position-Aware Training Scheme以适配box-to-polygon监督迁移并聚焦边界线索;构建mask-based对照模型用于系统比较。 Result: 在MS COCO test-dev上比现有极坐标方法提升4.7 mAP;在Cityscapes上内存消耗减半;在PanNuke和SpaceNet上全面超越mask-based对照模型。 Conclusion: Poly-DETR验证了极坐标表示在规则形状实例(如细胞、建筑轮廓)分割中的优越性,尤其适用于高分辨率与领域特定场景。 Abstract: One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.[92] Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning
Kanishkha Jaisankar,Pranav M. Pawar,Diana Susane Joseph,Raja Muthalagu,Mithun Mukherjee
Main category: cs.CV
TL;DR: 本文综述了深度学习与计算机视觉技术在自动驾驶汽车中的应用,提出了一种结合预训练与自定义神经网络的新方法,用于交通标志分类、车辆检测、车道检测和行为克隆等关键任务,并通过多种数据增强与迁移学习技术提升了模型性能。
Details
Motivation: 解决自动驾驶汽车在交通标志识别、车道预测、车辆检测及行为克隆等方面面临的挑战,提升系统鲁棒性与可靠性。 Method: 采用预训练与自定义神经网络,结合几何与颜色变换的数据增强、图像归一化及迁移学习进行特征提取,并在GTSRB、道路分割、车辆检测及Udacity模拟器采集的数据集上验证。 Result: 所提方法在交通标志分类、车道预测、车辆检测和行为克隆等任务中表现出有效性,显著提升了自动驾驶系统的性能。 Conclusion: 该研究为自动驾驶领域提供了实用的技术路径与有价值的研究洞见,推动更安全、高效自动驾驶技术的发展与落地。 Abstract: Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.[93] Multimodal Graph Representation Learning with Dynamic Information Pathways
Xiaobin Hong,Mingkai Lin,Xiaoli Wang,Chaoqun Wang,Wenzhong Li
Main category: cs.CV
TL;DR: 本文提出了一种名为DiP的新型多模态图表示学习框架,通过引入模态特定的伪节点,实现模态内动态消息路由和模态间高效信息通路,具有自适应、高表达性和稀疏性,且计算复杂度为线性。
Details
Motivation: 现有方法多基于传统图神经网络扩展,依赖静态结构或密集注意力机制,限制了灵活性和节点嵌入表达能力。 Method: 提出DiP框架,利用模态特定伪节点,在各模态内通过邻近性引导的伪节点交互实现动态消息路由,并在共享状态空间中构建高效跨模态信息通路。 Result: 在多个基准数据集上的链路预测与节点分类任务中,DiP持续优于基线方法。 Conclusion: DiP实现了自适应、高表达且稀疏的跨模态消息传播,具备线性复杂度,显著提升了多模态图学习性能。 Abstract: Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.[94] Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
Mingfei Han,Haihong Hao,Liang Ma,Kamila Zhumakhanova,Ekaterina Radionova,Jingyi Zhang,Xiaojun Chang,Xiaodan Liang,Ivan Laptev
Main category: cs.CV
TL;DR: 本文提出了一种基于网络室内漫游视频的大规模视频-指令VLN框架,利用隐式几何表示从RGB帧中提取空间线索,避免了脆弱的3D重建,显著提升了数据利用率和导航性能,在多个基准上达到SOTA,并支持零样本导航。
Details
Motivation: 现有VLN任务受限于模拟器构建的数据集多样性与可扩展性不足,难以反映真实世界环境的复杂性。 Method: 构建基于网络房间漫游视频的大规模视频-指令框架,融合开放描述轨迹与3D重建的动作轨迹,并引入隐式几何表示,直接从RGB帧中提取空间线索。 Result: 在CVDN、SOON、R2R和REVERIE等多个VLN基准上达到新SOTA,并实现了鲁棒的零样本导航能力。 Conclusion: 该工作通过将大规模网络视频与隐式空间推理结合,推动具身导航向更可扩展、泛化性更强、更贴近实际应用的方向发展。 Abstract: Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.[95] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
Junhao Cai,Deyu Zeng,Junhao Pang,Lini Li,Zongze Wu,Xiaopin Zhong
Main category: cs.CV
TL;DR: 本文提出ForgeDreamer框架,通过多专家LoRA集成和跨视角超图几何增强,解决文本生成3D在工业应用中的领域适应与几何推理不足问题。
Details
Motivation: 现有文本到3D生成方法在工业场景中面临领域适配困难(如LoRA融合导致知识干扰)和几何推理能力不足(如成对一致性约束无法建模高阶结构依赖)两大挑战。 Method: 提出两个核心方法:1)多专家LoRA集成机制,融合多个类别专用LoRA模型以实现跨类别泛化并消除知识干扰;2)基于增强语义理解的跨视角超图几何增强方法,同步建模多视角间的高阶结构依赖。 Result: 在自建工业数据集上的实验表明,该方法在语义泛化能力和几何保真度上均优于当前最先进方法。 Conclusion: ForgeDreamer有效提升了文本到3D生成在工业级精度制造任务中的适用性,兼顾语义理解与制造级几何一致性。 Abstract: Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.[96] Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
Jiaqi Liu,Zhizhong Han
Main category: cs.CV
TL;DR: 本文提出了一种提升3D高斯泼溅(3DGS)训练与渲染效率的新方法,通过动态缩小高斯尺度和引入alpha混合熵约束来缩短每像素对应的高斯列表长度,并结合渐进式分辨率调度策略,在不牺牲渲染质量的前提下显著提升效率。
Details
Motivation: 尽管3D高斯泼溅(3DGS)在渲染质量和效率上优于NeRF,但其学习过程仍存在效率瓶颈,尤其是每像素需处理的高斯数量过多导致计算开销大。 Method: 提出两种核心策略:1)定期重置高斯尺度以缩小单个高斯覆盖范围;2)在alpha混合过程中施加熵约束,使权重分布更尖锐,增强主导高斯的局部性;并集成到渐进式渲染分辨率调度器中。 Result: 在主流基准上验证,该方法显著提升了训练与渲染效率,同时保持甚至略微提升渲染质量。 Conclusion: 所提策略有效减少了每像素参与渲染的高斯数量,从而加速了3DGS的训练与推理过程,为高效神经辐射场建模提供了新思路。 Abstract: 3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.[97] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
Jiagao Hu,Yuxuan Chen,Fuhao Li,Zepeng Wang,Fei Wang,Daiguo Zhou,Jian Luan
Main category: cs.CV
TL;DR: 本文提出SVOR框架,通过MUSE、DA-Seg和两阶段课程训练,显著提升视频对象移除在阴影、快速运动和掩码缺陷等现实挑战下的鲁棒性与一致性。
Details
Motivation: 现有基于扩散模型的视频修复方法在面对真实世界中的阴影、突发动态和掩码缺陷时,难以保持时间稳定性和视觉一致性,亟需更鲁棒的视频对象移除方法。 Method: 提出SVOR框架,包含:(1) MUSE——窗口化掩码并集策略以应对突发动态;(2) DA-Seg——解耦分支上的去噪感知分割头,提供扩散感知定位先验;(3) 课程制两阶段训练——第一阶段自监督预训练学习背景与时间先验,第二阶段合成数据微调并联合去除物体及其阴影/反射。 Result: SVOR在多个数据集及退化掩码基准上达到新SOTA,显著提升真实场景下的视频对象移除性能。 Conclusion: SVOR有效推动视频对象移除技术从理想条件迈向真实应用,具备强鲁棒性、时间稳定性与跨域泛化能力。 Abstract: Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.[98] Learning Convex Decomposition via Feature Fields
Yuezhi Yang,Qixing Huang,Mikaela Angelina Uy,Nicholas Sharp
Main category: cs.CV
TL;DR: 本文提出了一种基于学习特征场的新方法,实现了首个面向开放世界、前馈式的凸分解模型,能高质量地将3D形状分解为凸体并泛化至多种表示(网格、CAD、高斯溅射)
Details
Motivation: 解决长期存在的凸分解问题,提升物理仿真中碰撞检测等应用的效率 Method: 采用特征学习方法,学习连续特征场,并通过自监督、纯几何的凸性定义目标进行聚类以获得凸分解;支持单形状优化与大规模自监督学习 Result: 分解质量优于现有方法,在开放世界物体及多种3D表示(网格、CAD、高斯溅射)上均表现出良好泛化能力 Conclusion: 该方法是首个可扩展、自监督、面向开放世界的凸分解学习模型,显著提升了分解质量与泛化性 Abstract: This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world model for convex decomposition. Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats. https://research.nvidia.com/labs/sil/projects/learning-convex-decomp/[99] CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation
Shengqi Dang,Jiaying Lei,Yi He,Ziqing Qian,Nan Cao
Main category: cs.CV
TL;DR: 本文提出CogBlender框架,通过在文本到图像生成过程中对认知属性(如效价、唤醒度、支配度和图像记忆性)进行连续多维干预,实现语义与认知双重可控的图像生成。
Details
Motivation: 现有文生图模型虽能生成语义连贯图像,但难以控制其引发的认知效应(如情绪反应、记忆编码),无法满足特定心理意图的需求。 Method: 构建认知空间(Cognitive Space)与语义流形(Semantic Manifold)之间的映射;定义认知锚点(Cognitive Anchors)作为认知空间边界;在流匹配(flow-matching)过程中,通过插值不同锚点的速度场,并由多维认知得分动态引导生成过程。 Result: 在效价、唤醒度、支配度和图像记忆性四个认知维度上验证有效;实验表明该方法能实现精确、细粒度、连续的认知属性调控。 Conclusion: CogBlender为认知驱动的创意设计提供了新范式,推动生成模型从语义可控迈向认知可控。 Abstract: Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.[100] Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking
Shilei Wang,Pujian Lai,Dong Gao,Jifeng Ning,Gong Cheng
Main category: cs.CV
TL;DR: 本文提出MDTrack框架,通过模态感知融合与解耦时间传播解决现有多模态跟踪器中模态融合不适应和时间表征纠缠的问题。
Details
Motivation: 现有方法采用统一融合策略,忽视模态间固有差异;且通过混合token传播时序信息,导致时序表征纠缠、判别性弱。 Method: 1)模态感知融合:为红外、事件、深度、RGB各模态分配专用专家,结合门控机制动态选择最优专家;2)解耦时间传播:为RGB与X模态流分别设计独立状态空间模型(SSM),并引入跨模态注意力实现隐式交互与特征融合。 Result: MDTrack S与MDTrack U在五个主流多模态跟踪基准上均达到SOTA性能。 Conclusion: 模态感知融合与解耦时间传播能显著提升多模态目标跟踪的精度与时序建模能力。 Abstract: Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack's ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.[101] DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction
Fuzhen Jiang,Zhuoran Li,Yinlin Zhang
Main category: cs.CV
TL;DR: 本文提出DenoiseSplat,一种面向含噪多视角图像的前馈式3D高斯泼溅方法,通过构建大规模噪声-干净配对数据集并仅用2D渲染结果监督训练,在多种噪声下显著提升重建与新视角合成质量。
Details
Motivation: 现有NeRF和3D高斯泼溅方法在真实噪声和伪影下性能下降,缺乏对含噪输入的鲁棒性。 Method: 构建基于RE10K的大规模场景一致噪声-干净配对基准(含高斯、泊松、斑点、椒盐噪声),采用轻量MVSplat风格前馈骨干网络,端到端训练,仅以干净2D渲染图为监督,无需3D真值。 Result: 在含噪RE10K数据集上,DenoiseSplat在PSNR/SSIM和LPIPS指标上均优于原始MVSplat及强两阶段基线(IDF+MVSplat)。 Conclusion: DenoiseSplat验证了前馈式3D高斯泼溅在无3D监督下有效去噪并提升重建质量的可行性,为真实场景下的鲁棒3D重建提供了新思路。 Abstract: 3D scene reconstruction and novel-view synthesis are fundamental for VR, robotics, and content creation. However, most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts. We therefore propose DenoiseSplat, a feed-forward 3D Gaussian splatting method for noisy multi-view images. We build a large-scale, scene-consistent noisy--clean benchmark on RE10K by injecting Gaussian, Poisson, speckle, and salt-and-pepper noise with controlled intensities. With a lightweight MVSplat-style feed-forward backbone, we train end-to-end using only clean 2D renderings as supervision and no 3D ground truth. On noisy RE10K, DenoiseSplat outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels.[102] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework
Feiyu Wang,Jiayuan Yang,Zhiyuan Zhao,Da Zhang,Bingyu Li,Peng Liu,Junyu Gao
Main category: cs.CV
TL;DR: 本文提出IntroSVG框架,通过将视觉语言模型(VLM)在闭环中同时作为生成器和批评家,引入对渲染图像的显式视觉反馈,显著提升文本到SVG生成的质量。
Details
Motivation: 现有文本到SVG生成方法受限于自回归训练过程缺乏对最终渲染图像的视觉感知,导致生成质量受限。 Method: 提出Introspective SVG Generation Framework(IntroSVG):1)统一VLM经监督微调(SFT)学习生成SVG并评估其渲染效果;2)将早期失败案例转化为高质量纠错训练数据;3)利用大容量教师VLM构建偏好数据集,并通过直接偏好优化(DPO)对齐生成策略;4)推理时采用“生成-评审-精炼”迭代循环。 Result: 在多项关键指标上达到SOTA,生成的SVG具有更复杂结构、更强语义对齐性及更高可编辑性。 Conclusion: 将显式视觉反馈纳入生成闭环可有效提升SVG生成质量,验证了该范式的有效性与潜力。 Abstract: Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.[103] CLoE: Expert Consistency Learning for Missing Modality Segmentation
Xinyu Tong,Meihua Zhou,Bowu Fan,Haitao Li
Main category: cs.CV
TL;DR: 本文提出了一种名为CLoE的一致性驱动框架,用于解决多模态医学图像分割中推理阶段模态缺失的问题,通过专家一致性学习和可靠性感知的特征重校准提升分割鲁棒性和精度。
Details
Motivation: 多模态医学图像分割在推理时经常面临模态缺失问题,导致各模态专家预测不一致、融合不稳定,尤其影响小目标结构的分割性能。 Method: 提出Consistency Learning of Experts(CLoE)框架,包含双分支专家一致性学习目标(模态专家一致性与区域专家一致性),并设计轻量门控网络将一致性分数映射为模态可靠性权重,实现可靠性感知的特征重校准与融合。 Result: 在BraTS 2020和MSD前列腺数据集上,CLoE在不完整多模态分割任务中优于当前最优方法,具备强跨数据集泛化能力,并显著提升临床关键结构的分割鲁棒性。 Conclusion: CLoE通过决策层面的一致性控制与可靠性加权融合,有效缓解模态缺失带来的性能下降,在保持全模态性能的同时显著增强鲁棒性与临床实用性。 Abstract: Multimodal medical image segmentation often faces missing modalities at inference, which induces disagreement among modality experts and makes fusion unstable, particularly on small foreground structures. We propose Consistency Learning of Experts (CLoE), a consistency-driven framework for missing-modality segmentation that preserves strong performance when all modalities are available. CLoE formulates robustness as decision-level expert consistency control and introduces a dual-branch Expert Consistency Learning objective. Modality Expert Consistency enforces global agreement among expert predictions to reduce case-wise drift under partial inputs, while Region Expert Consistency emphasizes agreement on clinically critical foreground regions to avoid background-dominated regularization. We further map consistency scores to modality reliability weights using a lightweight gating network, enabling reliability-aware feature recalibration before fusion. Extensive experiments on BraTS 2020 and MSD Prostate demonstrate that CLoE outperforms state-of-the-art methods in incomplete multimodal segmentation, while exhibiting strong cross-dataset generalization and improving robustness on clinically critical structures.[104] SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation
Aodi Wu,Jianhong Zuo,Zeyuan Zhao,Xubo Luo,Ruisuo Wang,Xue Wan
Main category: cs.CV
TL;DR: 本文提出了SpaceSense-Bench,一个面向航天器感知的大规模多模态基准数据集,包含136种卫星模型、RGB图像、高精度深度图和LiDAR点云,并提供像素级与点级部件语义标注及6-DoF姿态真值;通过该数据集对五类任务进行评测,发现小部件识别与零样本泛化仍是瓶颈,且训练卫星数量增加显著提升对新目标的性能。
Details
Motivation: 自主空间操作(如在轨服务、主动碎片清除)亟需高鲁棒性的部件级语义理解与精确相对导航,但真实轨道数据获取成本高昂、难以规模化;现有合成数据集存在目标多样性不足、单模态、标注不完整等问题。 Method: 构建基于Unreal Engine 5的高保真空间仿真环境,设计全自动数据生成流水线,产出含RGB、毫米级深度图、256线LiDAR点云的多模态数据,并同步提供7类部件级像素/点云语义标签与6-DoF姿态真值;覆盖136种卫星模型,总数据量约70GB。 Result: 在目标检测、2D语义分割、RGB-LiDAR融合3D分割、单目深度估计、朝向估计五项任务上完成基准评测,发现:(i) 小尺度部件(如推进器、全向天线)识别与对完全未见卫星的零样本泛化能力仍严重不足;(ii) 增加训练卫星数量可显著提升对新目标的性能。 Conclusion: SpaceSense-Bench填补了大规模、多模态、细粒度航天器感知基准的空白,实证表明数据规模与多样性对提升模型泛化能力至关重要,为未来空间智能感知研究提供了坚实基础与统一评测平台。 Abstract: Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB--LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.[105] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Tengjin Weng,Wenhao Jiang,Jingyi Wang,Ming Li,Lin Ma,Zhong Ming
Main category: cs.CV
TL;DR: 本文提出OddGridBench基准测试和OddGrid-GRPO强化学习框架,系统评估并提升多模态大语言模型(MLLMs)对细微视觉差异的感知能力。
Details
Motivation: 现有MLLMs在高层视觉语言任务上表现优异,但在低层视觉感知(尤其是细粒度视觉差异检测)方面缺乏系统性评估与提升。 Method: 构建可控的OddGridBench基准(含1400+网格图像,单元素在颜色、大小、旋转或位置上与其他元素存在差异);提出OddGrid-GRPO框架,融合课程学习与距离感知奖励机制进行强化训练。 Result: 所有测试MLLM(包括Qwen3-VL、InternVL3.5、Gemini-2.5-Pro、GPT-5)在OddGridBench上的表现远低于人类水平;OddGrid-GRPO显著提升了模型的细粒度视觉判别能力。 Conclusion: OddGridBench和OddGrid-GRPO为推动多模态智能的感知接地与视觉差异敏感性研究提供了新基准与有效方法。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.[106] Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments
Yang Li,Xing Chen,Yutao Liu,Gege Qi,Yanxian BI,Zizhe Wang,Yunjian Zhang,Yao Zhu
Main category: cs.CV
TL;DR: 本文提出STAR基准,用于评估大语言模型在对抗性、时间敏感环境中的交互式推理能力,强调战略规划与战术执行的平衡,并揭示了推理深度与执行速度之间的权衡。
Details
Motivation: 现有评测主要关注静态、单次推理能力,忽视了对手感知决策、时间约束和高压执行等真实交互场景中的挑战。 Method: 构建了STAR多智能体评测框架,支持回合制与实时两种模式;设计了包含标准化API和执行引擎的模块化架构;并提出战略评估套件,从胜率之外衡量执行效率和结果稳定性。 Result: 实验发现存在显著的‘策略-执行鸿沟’:推理强的模型在回合制中占优,但因推理延迟在实时场景中表现较差,而指令微调的快速模型反而更优。 Conclusion: 交互式环境中的战略智能不仅依赖推理深度,更取决于将计划及时转化为行动的能力;STAR为此类动态竞争场景提供了原则性评测基准。 Abstract: Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.[107] Predictive Spectral Calibration for Source-Free Test-Time Regression
Nguyen Viet Tuan Kiet,Huynh Thanh Trung,Pham Huy Hieu
Main category: cs.CV
TL;DR: 本文提出了一种无需源数据的测试时自适应(TTA)方法Predictive Spectral Calibration(PSC),用于图像回归任务,通过块谱匹配实现子空间对齐与残差谱校准,在多种基准上尤其在严重分布偏移下显著提升性能。
Details
Motivation: 图像回归的测试时自适应(TTA)研究远少于分类任务,且分类方法难以直接迁移至连续回归目标;现有基于子空间对齐的方法虽有进展,但仍依赖固定支撑子空间,未充分利用预测结构。 Method: 提出Predictive Spectral Calibration(PSC),一种源数据无关的框架:在源模型的预测支撑子空间内对齐目标特征,并在正交补空间中校准残差谱松弛;方法简洁、模型无关、兼容现成预训练回归器。 Result: 在多个图像回归基准上显著优于强基线,尤其在严重分布偏移场景下提升更明显。 Conclusion: PSC为回归TTA提供了一种有效、通用且实用的新范式,验证了联合利用预测子空间与正交残差校准的重要性。 Abstract: Test-time adaptation (TTA) for image regression has received far less attention than its classification counterpart. Methods designed for classification often depend on classification-specific objectives and decision boundaries, making them difficult to transfer directly to continuous regression targets. Recent progress revisits regression TTA through subspace alignment, showing that simple source-guided alignment can be both practical and effective. Building on this line of work, we propose Predictive Spectral Calibration (PSC), a source-free framework that extends subspace alignment to block spectral matching. Instead of relying on a fixed support subspace alone, PSC jointly aligns target features within the source predictive support and calibrates residual spectral slack in the orthogonal complement. PSC remains simple to implement, model-agnostic, and compatible with off-the-shelf pretrained regressors. Experiments on multiple image regression benchmarks show consistent improvements over strong baselines, with particularly clear gains under severe distribution shifts.[108] Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification
Junhyeok Lee,Minseo Choi,Han Jang,Young Hun Jeon,Heeseong Eum,Joon Jang,Chul-Ho Sohn,Kyu Sung Choi
Main category: cs.CV
TL;DR: 本文提出EPPINN框架,将证据深度学习与物理信息建模结合,实现CTP灌注参数估计中的不确定性量化,提升急性缺血性卒中评估的准确性与可靠性。
Details
Motivation: 现有基于PINN的方法是确定性的,无法量化物理约束违反带来的不确定性,限制了其可靠性评估。 Method: 提出Evidential Perfusion Physics-Informed Neural Networks(EPPINN),采用坐标网络建模动脉输入、组织浓度和灌注参数,并在物理残差上施加Normal--Inverse--Gamma分布以刻画体素级偶然性和认知不确定性;引入生理约束参数化与稳定策略以增强单例优化鲁棒性。 Result: 在数字体模、ISLES 2018基准及临床队列上,EPPINN在稀疏采样和低信噪比下均优于经典反卷积和PINN基线,误差更低且提供保守、高经验覆盖率的不确定性估计;临床数据中灌注核心检测敏感度最高。 Conclusion: 证据型物理信息学习可同步提升CTP分析的精度与可靠性,适用于时间关键的卒中评估。 Abstract: Physics-informed neural networks (PINNs) have shown promise in addressing the ill-posed deconvolution problem in computed tomography perfusion (CTP) imaging for acute ischemic stroke assessment. However, existing PINN-based approaches remain deterministic and do not quantify uncertainty associated with violations of physics constraints, limiting reliability assessment. We propose Evidential Perfusion Physics-Informed Neural Networks (EPPINN), a framework that integrates evidential deep learning with physics-informed modeling to enable uncertainty-aware perfusion parameter estimation. EPPINN models arterial input, tissue concentration, and perfusion parameters using coordinate-based networks, and places a Normal--Inverse--Gamma distribution over the physics residual to characterize voxel-wise aleatoric and epistemic uncertainty in physics consistency without requiring Bayesian sampling or ensemble inference. The framework further incorporates physiologically constrained parameterization and stabilization strategies to promote robust per-case optimization. We evaluate EPPINN on digital phantom data, the ISLES 2018 benchmark, and a clinical cohort. On the evaluated datasets, EPPINN achieves lower normalized mean absolute error than classical deconvolution and PINN baselines, particularly under sparse temporal sampling and low signal-to-noise conditions, while providing conservative uncertainty estimates with high empirical coverage. On clinical data, EPPINN attains the highest voxel-level and case-level infarct-core detection sensitivity. These results suggest that evidential physics-informed learning can improve both accuracy and reliability of CTP analysis for time-critical stroke assessment.[109] M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition
Yanshan Li,Ke Ma,Miaomiao Wei,Linhui Dai
Main category: cs.CV
TL;DR: 本文提出了一种基于博弈论的多视角极小-极大骨架数据对比学习框架M3GCLR,通过构建无限骨架数据博弈模型(ISG)及其均衡定理,结合多视角旋转增强与中性锚点对齐,设计强对抗性极小-极大游戏,并引入双损失均衡优化器,在多个基准上达到或超越SOTA性能。
Details
Motivation: 现有自监督骨架动作识别方法存在三方面局限:对视角差异建模不足、缺乏有效对抗机制、数据增强扰动不可控。 Method: 提出M3GCLR框架,包括:1)建立无限骨架数据博弈(ISG)模型及均衡定理并给出严格证明;2)通过多视角旋转生成正常-极端数据对,以时间平均输入为中性锚点实现结构对齐;3)构建强对抗性极小-极大骨架数据博弈;4)设计双损失均衡优化器,并证明其与ISG模型等价。 Result: 在NTU RGB+D 60(X-Sub/X-View)达82.1%/85.8%,NTU RGB+D 120(X-Sub/X-Set)达72.3%/75.0%,PKU-MMD Part I/II达89.1%/45.2%(三流),均匹配或超越SOTA;消融实验验证各模块有效性。 Conclusion: M3GCLR通过博弈论视角统一建模对比学习中的视角差异、对抗学习与扰动控制,提升了骨架动作表征的判别性与鲁棒性,为自监督骨架动作识别提供了新范式。 Abstract: In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.[110] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification
Nikola Jovišić,Milica Škipina,Nicola Dall'Asen,Dubravko Ćulibrk
Main category: cs.CV
TL;DR: 本文提出MIL-PF框架,利用冻结的基础模型提取特征,结合轻量级多实例学习(MIL)头进行乳腺X光分类,在保持高性能的同时显著降低训练开销。
Details
Motivation: 现代基础模型在高分辨率医学影像(如乳腺X光)上难以适配,因标注稀缺、监督弱、图像大、多视角且标签粒度粗(乳腺级),导致端到端微调计算昂贵且不实用。 Method: 提出Multiple Instance Learning on Precomputed Features(MIL-PF):冻结预训练视觉编码器,预计算图像块特征;设计轻量级注意力机制MIL头部(仅40k参数)聚合局部病变信号与全局组织上下文。 Result: 在临床规模数据上达到SOTA分类性能,大幅降低训练复杂度(免重训大骨干),支持高效实验与适配。 Conclusion: MIL-PF是一种高效、可扩展的乳腺影像分析范式,兼顾表达力与实用性,并开源代码以保障可复现性。 Abstract: Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.[111] SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
Yang Chen,Xieyuanli Chen,Junxiang Li,Jie Tang,Tao Wu
Main category: cs.CV
TL;DR: 本文提出SinGeo框架,通过双判别学习架构和课程学习策略,实现单模型在多视角、多方位下的鲁棒跨视图地理定位,达到SOTA性能并具备跨架构迁移能力。
Details
Motivation: 现有跨视图地理定位方法依赖固定视场角(FoV)训练,在未见FoV和未知朝向下性能急剧下降,需部署多个模型;而简单随机FoV训练未能真正提升泛化鲁棒性,因其隐含假设所有FoV难度相同。 Method: SinGeo采用双判别学习架构,分别增强地面与卫星视图内部的判别性,并首次引入课程学习策略,按FoV难度渐进式训练以提升鲁棒性;无需额外模块或显式几何变换。 Result: 在四个基准数据集上取得SOTA结果,显著优于专为极端FoV训练的方法;具备跨网络架构的迁移能力;提出一致性评估方法,可定量衡量模型在不同视角下的稳定性。 Conclusion: SinGeo以简洁设计实现了单模型对多样FoV与朝向的强鲁棒性,推动了CVGL向实际部署场景迈进,并为鲁棒性研究提供了可解释的评估新范式。 Abstract: Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions -- implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.[112] EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation
Yinrui Ren,Jinjing Zhu,Kanghao Chen,Zhuoxiao Li,Jing Ou,Zidong Cao,Tongyan Hua,Peilun Shi,Yingchun Fu,Wufan Zhao,Hui Xiong
Main category: cs.CV
TL;DR: 本文提出EventVGGT框架,首次将视觉几何基础模型(VGGT)的时空与多视角几何先验知识蒸馏到事件相机领域,通过跨模态特征混合、时空特征蒸馏和时间一致性蒸馏三阶段策略,显著提升单目事件深度估计精度与时间一致性。
Details
Motivation: 事件相机在高速运动和极端光照下具有优势,但密集深度标注稀缺;现有无标注方法将事件流视为独立帧,忽略了其固有时间连续性,无法利用视觉基础模型中的丰富时间先验,导致深度预测时间不一致且精度低。 Method: 提出EventVGGT框架,采用三级蒸馏策略:(i) 跨模态特征混合(CMFM)在输出层融合RGB与事件特征生成辅助深度预测;(ii) 时空特征蒸馏(STFD)在特征层蒸馏VGGT的时空表征;(iii) 时间一致性蒸馏(TCD)在时间层对齐帧间深度变化以增强跨帧一致性。 Result: 在EventScape上30米处绝对平均深度误差降低53%(从2.30降至1.06),并在DENSE和MVSEC数据集上展现出强零样本泛化能力。 Conclusion: 显式建模事件流为连贯视频序列并蒸馏VGGT的时空与几何先验,可有效提升事件深度估计的精度、时间一致性与泛化性,为事件感知提供了新范式。 Abstract: Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT's powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods -- reducing the absolute mean depth error at 30m by over 53\% on EventScape (from 2.30 to 1.06) -- while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.[113] Training-Free Coverless Multi-Image Steganography with Access Control
Minyeol Bae,Si-Hyeon Lee
Main category: cs.CV
TL;DR: 本文提出MIDAS,一种无需训练的基于扩散模型的无载体图像隐写框架,支持多图像隐藏与用户特定访问控制,通过潜在空间融合实现强鲁棒性与隐私保护。
Details
Motivation: 现有无载体图像隐写方法缺乏鲁棒的访问控制机制,难以在多用户场景下实现差异化内容授权,限制了其在隐私敏感应用中的可扩展性。 Method: 提出MIDAS框架,包含随机基机制(抑制残余结构信息)和潜在向量融合模块(对齐扩散过程),在潜在空间完成多图像融合与用户级访问控制,且无需训练。 Result: 实验表明,MIDAS在访问控制功能、载密图像质量与多样性、抗噪声鲁棒性及抗隐写分析能力上均优于现有无训练CIS基线方法。 Conclusion: MIDAS为无载体隐写提供了实用、可扩展的访问控制新范式,兼顾安全性、隐秘性与多用户适配性。 Abstract: Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.[114] ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts
Yaping Zhang,Yupu Liang,Zhiyang Zhang,Zhiyuan Chen,Lu Xiang,Yang Zhao,Yu Zhou,Chengqing Zong
Main category: cs.CV
TL;DR: DIMT 2025 Challenge推动端到端文档图像翻译研究,设OCR-free与OCR-based双赛道,涵盖大小模型子任务,共吸引69支队伍参赛;结果表明大模型在复杂版式文档翻译中展现出新范式潜力。
Details
Motivation: 推进端到端文档图像翻译(DIMT)这一多模态文档理解前沿方向的研究,弥合OCR与NLP之间的鸿沟,并建立统一、可评估的基准与竞赛平台。 Method: 组织DIMT 2025国际挑战赛,设置OCR-free和OCR-based两条赛道,每条赛道细分为小模型(<1B参数)与大模型(>1B参数)子任务;采用统一系统提交机制,允许使用提供OCR文本;构建专用数据集并制定评估协议。 Result: 共收到来自69支团队的27份有效提交(Track 1:13份;Track 2:14份);分析表明大模型方法在复杂版式文档图像翻译中性能更优,展现出显著优势与新范式潜力。 Conclusion: DIMT 2025 Challenge成功推动了文档图像机器翻译的发展,验证了大模型在该任务中的主导潜力,同时揭示了OCR集成、布局建模与跨语言对齐等方向的未来研究机会。 Abstract: Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.[115] YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search
Zhe Li,Xiaoyu Ding,Jiaxin Zheng,Yongtao Wang
Main category: cs.CV
TL;DR: 本文提出了YOLO-NAS-Bench,首个专为YOLO系列目标检测器设计的神经架构搜索(NAS)代理基准,通过构建包含1000个采样架构的轻量级LightGBM代理预测器,并引入自进化机制提升其在高性能区域的预测精度与排序一致性,最终用于进化搜索,发现性能超越YOLOv8-YOLO12基线的新架构。
Details
Motivation: 现有NAS基准主要面向图像分类,缺乏针对YOLO类目标检测器的专用NAS评估基准;同时,YOLO NAS因需在COCO上完整训练而计算成本极高,亟需高效代理模型。 Method: 构建覆盖YOLOv8-YOLO12核心模块(骨干网与颈部)的搜索空间,采用多种采样策略获取1000个架构,在COCO-mini上训练并构建LightGBM代理模型;提出自进化机制,利用代理模型自身迭代发现并评估高价值架构,将数据集扩展至1500个,提升预测性能;最后以该代理模型为适应度函数进行进化搜索。 Result: 代理模型R²从0.770提升至0.815,稀疏Kendall Tau从0.694提升至0.752;基于该代理模型搜索出的架构在COCO-mini上以相近延迟超越所有官方YOLOv8–YOLO12基线。 Conclusion: YOLO-NAS-Bench填补了YOLO类检测器NAS缺乏高效、可靠代理基准的空白,其自进化机制显著提升了代理模型在高性能区域的准确性与排序能力,验证了其在实际NAS流程中的有效性与实用性。 Abstract: Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures.[116] Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Taesung Kwon,Lorenzo Bianchi,Lennart Wittke,Felix Watine,Fabio Carrara,Jong Chul Ye,Romann Weber,Vinicius Azevedo
Main category: cs.CV
TL;DR: 本文提出了一种全卷积扩散模型(FCDM),采用类似ConvNeXt的卷积骨干网络,显著提升了训练效率与硬件友好性,在更低FLOPs和更少训练步数下达到与DiT-XL/2相当的性能。
Details
Motivation: 尽管Transformer在扩散模型中日益流行,但卷积网络固有的局部性、参数效率和硬件友好性在现代生成建模中尚未被充分探索。 Method: 设计了全卷积扩散模型(FCDM),其骨干结构借鉴ConvNeXt,专为条件扩散建模优化,并在256×256和512×512分辨率下进行实验验证。 Result: FCDM-XL仅用DiT-XL/2 50%的FLOPs,在256×256和512×512分辨率下分别减少7×和7.5×训练步数,且可在4-GPU系统上完成训练。 Conclusion: 现代卷积架构(如ConvNeXt)可作为高效、可扩展的替代方案,重振其在高效生成建模中的价值。 Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.[117] RiO-DETR: DETR for Real-time Oriented Object Detection
Zhangchi Hu,Yifan Zhao,Yansong Peng,Wenzhang Sun,Xiangchen Yin,Jie Chen,Peixi Wu,Hebei Li,Xinghao Wang,Dongsheng Jiang,Xiaoyan Sun
Main category: cs.CV
TL;DR: RiO-DETR是首个面向实时定向目标检测的DETR模型,通过内容驱动的角度估计、解耦周期性优化和定向密集O2O监督等创新设计,解决了DETR适配旋转框时面临的语义依赖方向、角度周期性和搜索空间扩大三大挑战,在DOTA等数据集上实现了实时性与精度的新平衡。
Details
Motivation: 将DETR适配到带方向的旋转边界框(OBB)存在三大挑战:方向依赖语义、角度周期性破坏欧氏优化、搜索空间扩大导致收敛慢;需在保持实时性的前提下解决这些问题。 Method: 提出三项核心方法:1)内容驱动角度估计(解耦角度查询)+ 旋转校正正交注意力;2)解耦周期性优化(有界粗到细更新 + 最短路径周期损失);3)定向密集O2O监督(注入角度多样性以加速收敛)。 Result: 在DOTA-1.0、DIOR-R和FAIR-1M-2.0数据集上验证了RiO-DETR在实时定向检测中建立了新的速度-精度权衡基准,显著优于现有方法。 Conclusion: RiO-DETR是首个面向实时场景的定向检测Transformer,其任务原生设计有效克服了DETR适配OBB的关键障碍,为实时旋转目标检测提供了新范式。 Abstract: We present RiO-DETR: DETR for Real-time Oriented Object Detection, the first real-time oriented detection transformer to the best of our knowledge. Adapting DETR to oriented bounding boxes (OBBs) poses three challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and an enlarged search space that slows convergence. RiO-DETR resolves these issues with task-native designs while preserving real-time efficiency. First, we propose Content-Driven Angle Estimation by decoupling angle from positional queries, together with Rotation-Rectified Orthogonal Attention to capture complementary cues for reliable orientation. Second, Decoupled Periodic Refinement combines bounded coarse-to-fine updates with a Shortest-Path Periodic Loss for stable learning across angular seams. Third, Oriented Dense O2O injects angular diversity into dense supervision to speed up angle convergence at no extra cost. Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed--accuracy trade-off for real-time oriented detection. Code will be made publicly available.[118] PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue
Zirui Zhang,Yaping Zhang,Lu Xiang,Yang Zhao,Feifei Zhai,Yu Zhou,Chengqing Zong
Main category: cs.CV
TL;DR: 本文提出PromptDLA,一种面向领域的文档布局分析提示器,通过引入描述性知识作为线索,将领域先验融入文档布局分析中,显著提升跨领域泛化能力。
Details
Motivation: 现有方法直接合并多源DLA数据集训练模型,忽略了不同领域在标注风格、文档类型和语言等方面的布局结构差异,导致性能次优。 Method: 提出PromptDLA,设计了一个领域感知的提示器,根据数据域的具体属性定制提示,用以引导模型关注关键特征与结构。 Result: 在DocLayNet、PubLayNet、M6Doc和D⁴LA等多个基准上达到当前最优性能。 Conclusion: PromptDLA有效融合领域先验知识,提升了文档布局分析模型在多样化领域的泛化能力,验证了领域感知提示机制的有效性。 Abstract: Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model's ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and D$^4$LA. Our code is available at https://github.com/Zirui00/PromptDLA.[119] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Bohao Li,Zhicheng Cao,Huixian Li,Yangming Guo
Main category: cs.CV
TL;DR: 本文提出CIGPose框架,通过因果干预模块消除视觉上下文带来的混杂效应,提升全身姿态估计的鲁棒性与解剖合理性,在COCO-WholeBody上达到SOTA性能。
Details
Motivation: 现有全身姿态估计器在复杂场景中易产生解剖学上不合理的预测,主因是模型从视觉上下文中学习到了虚假相关性。 Method: 构建结构因果模型(SCM)识别视觉上下文为混杂因子,并提出Causal Intervention Graph Pose(CIGPose)框架:包含基于预测不确定性的混杂关键点表征识别、用上下文无关的规范嵌入进行替换的因果干预模块,以及分层图神经网络进行局部与全局骨骼语义推理。 Result: CIGPose-x在COCO-WholeBody上达67.0% AP(未用额外数据),加入UBody后达67.5% AP,超越依赖额外训练数据的先前方法。 Conclusion: 因果干预可有效缓解混杂偏差,提升姿态估计的鲁棒性与泛化能力;CIGPose为基于因果推理的视觉姿态估计提供了新范式。 Abstract: State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0\% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5\% AP, demonstrating superior robustness and data efficiency. The codes and models are publicly available at https://github.com/53mins/CIGPose.[120] MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating
Yuning Wang,Pu Zhang,Yuan He,Ke Wang,Jianru Xue
Main category: cs.CV
TL;DR: 本文提出一种元学习框架和数据自适应模型更新机制,以提升轨迹预测模型在测试时面对分布偏移的在线适应能力。
Details
Motivation: 现有轨迹预测方法在测试时面对分布偏移性能显著下降;当前测试时训练方法依赖离线预训练模型且更新规则固定,缺乏灵活性与数据适配性。 Method: 1)提出元学习框架,在预训练阶段通过双层优化模拟测试时适应任务,使预测器具备快速准确的在线适应能力;2)测试时引入数据自适应模型更新机制,基于在线偏导数和难样本选择动态调整学习率与更新频率。 Result: 在nuScenes、Lyft、Waymo等跨数据集分布偏移场景中,本方法在适应精度上超越现有SOTA测试时训练方法,并在次优学习率和高帧率要求下表现出更强鲁棒性与实用性。 Conclusion: 所提方法有效缓解了轨迹预测中测试时分布偏移带来的性能退化问题,兼具高效性、鲁棒性与实用性。 Abstract: Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a meta-learning framework to directly optimize the predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism enables the online learning rate to suit the test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior adaptation accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.[121] Open-World Motion Forecasting
Nicolas Schischka,Nikhil Gosala,B Ravi Kiran,Senthil Yogamani,Abhinav Valada
Main category: cs.CV
TL;DR: 本文提出开放世界运动预测新范式,通过端到端类增量学习框架,在引入新物体类别时缓解灾难性遗忘,并直接从图像预测轨迹。
Details
Motivation: 现有运动预测方法假设封闭世界、固定物体类别和高质量感知,难以应对真实场景中感知不完善和物体类别动态演化的挑战。 Method: 提出首个端到端类增量运动预测框架:采用伪标签策略生成并过滤已知类别的轨迹预测;设计基于查询特征方差的重放采样策略以保留关键历史运动模式。 Result: 在nuScenes和Argoverse 2上验证了该方法能有效抵抗灾难性遗忘,保持旧类性能并提升对新类的适应能力;支持零样本迁移到真实驾驶,并可扩展至端到端类增量规划。 Conclusion: 开放世界运动预测为自动驾驶系统持续适应动态环境提供了新方向,所提框架兼顾稳定性与可扩展性,推动全栈自主驾驶系统的持续学习能力。 Abstract: Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de .[122] GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis
Tran Bao Sam,Hung Vu,Dao Trung Kien,Tran Dat Dang,Van Ha Tang,Steven Truong
Main category: cs.CV
TL;DR: 本文提出GIIM,一种基于图的新型CADx方法,用于建模病灶在单视图内的依赖关系及跨视图的动态变化,并有效处理缺失数据,显著提升诊断准确性和鲁棒性。
Details
Motivation: 现有CADx系统难以建模病灶在单视图内的复杂依赖关系及跨视图的动态变化,且对临床常见的不完整数据鲁棒性差。 Method: 将诊断任务重构为关系建模问题,提出基于图的GIIM框架,同步建模 intra-view 异常依赖与 inter-view 病变动态,并集成缺失数据处理技术。 Result: 在CT、MRI和乳腺X线等多种影像模态上验证,GIIM显著优于现有方法,提升了诊断准确率与鲁棒性。 Conclusion: GIIM为CADx提供了更有效、更具临床实用性的新范式,尤其适用于多视图、时序不全的现实诊疗场景。 Abstract: Computer-aided diagnosis (CADx) has become vital in medical imaging, but automated systems often struggle to replicate the nuanced process of clinical interpretation. Expert diagnosis requires a comprehensive analysis of how abnormalities relate to each other across various views and time points, but current multi-view CADx methods frequently overlook these complex dependencies. Specifically, they fail to model the crucial relationships within a single view and the dynamic changes lesions exhibit across different views. This limitation, combined with the common challenge of incomplete data, greatly reduces their predictive reliability. To address these gaps, we reframe the diagnostic task as one of relationship modeling and propose GIIM, a novel graph-based approach. Our framework is uniquely designed to simultaneously capture both critical intra-view dependencies between abnormalities and inter-view dynamics. Furthermore, it ensures diagnostic robustness by incorporating specific techniques to effectively handle missing data, a common clinical issue. We demonstrate the generality of this approach through extensive evaluations on diverse imaging modalities, including CT, MRI, and mammography. The results confirm that our GIIM model significantly enhances diagnostic accuracy and robustness over existing methods, establishing a more effective framework for future CADx systems.[123] A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation
Yoon Jo Kim,Wonyoung Cho,Jongmin Lee,Han Joo Chae,Hyunki Park,Sang Hoon Seo,Noh Jae Myung,Kyungmi Yang,Dongryul Oh,Jin Sung Kim
Main category: cs.CV
TL;DR: 本文提出OncoAgent,一种无需训练、可将文本临床指南直接转化为三维靶区轮廓的AI代理框架,在食管癌放疗靶区勾画中展现出优异的零样本性能和临床可接受性,并支持跨指南与跨解剖部位泛化。
Details
Motivation: 传统深度学习方法依赖专家标注数据,临床指南更新时需昂贵重训练;亟需一种灵活、可解释、无需训练即可适配新指南的靶区自动勾画方法。 Method: 构建指南感知的AI代理框架OncoAgent,通过大语言模型与空间推理模块协同,将文本指南解析并映射为三维解剖结构约束下的靶区轮廓,全程无需监督训练。 Result: 在食管癌数据上零样本Dice达0.842(CTV)和0.880(PTV);盲法临床评估中医生更倾向OncoAgent,在指南符合性、修改工作量和临床可接受性上评分更高;且可零样本迁移至其他食管指南及前列腺等部位。 Conclusion: OncoAgent实现了指南到轮廓的端到端、训练无关映射,显著提升放疗靶区规划的适应性、透明性与可扩展性,为临床AI落地提供新范式。 Abstract: Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.[124] EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
Jiajun Cao,Xiaoan Zhang,Xiaobao Wei,Liyuqiu Huang,Wang Zijian,Hanzhen Zhang,Zhengyu Jia,Wei Mao,Hao Wang,Xianming Liu,Shuchang Zhou Liu,Yang Wang,Shanghang Zhang
Main category: cs.CV
TL;DR: EvoDriveVLA提出一种协同感知-规划蒸馏框架,通过自锚定视觉蒸馏和oracle引导的轨迹蒸馏,缓解视觉编码器解冻后的感知退化与长期规划中的不稳定性问题,在开环和闭环评估中均达到SOTA性能。
Details
Motivation: Vision-Language-Action模型在自动驾驶中面临解冻视觉编码器后感知性能下降、长期规划中不稳定性累积的问题。 Method: 提出EvoDriveVLA框架:1)自锚定视觉蒸馏,利用自锚定教师模型提供视觉锚定约束,并通过轨迹引导的关键区域感知正则化学生表征;2)oracle引导的轨迹蒸馏,采用具备未来感知能力的oracle教师模型,结合粗到细轨迹优化与Monte Carlo Dropout采样生成高质量候选轨迹,从中选择最优轨迹指导学生预测。 Result: 在开放环评估中达到SOTA性能,在闭合环评估中显著提升性能。 Conclusion: EvoDriveVLA有效缓解了VLA模型在自动驾驶中感知退化与规划不稳定问题,验证了协同感知-规划蒸馏范式的有效性。 Abstract: Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.[125] TopoOR: A Unified Topological Scene Representation for the Operating Room
Tony Danjun Wang,Ka Young Kim,Tolga Birdal,Nassir Navab,Lennart Bastian
Main category: cs.CV
TL;DR: 本文提出TopoOR,一种基于高阶拓扑结构建模手术室多模态场景的新范式,克服了传统二元图结构的局限性,通过高阶注意力机制保持流形几何与模态特异性,在无菌违规检测、机器人阶段预测和下一步动作预估任务上优于图神经网络和大语言模型基线。
Details
Motivation: 现有手术场景图方法受限于严格的二元关系建模,无法有效表达手术室中实体间复杂的高阶、多模态关系,且在融合3D几何、音频、机器人运动学等异构信息时丢失关键结构,难以支撑安全关键推理。 Method: 提出TopoOR框架,将手术室实体交互提升至高阶拓扑胞腔(topological cells)表示;设计保持流形结构与模态特异性的高阶注意力机制,避免将多模态信息强制映射到统一潜在空间。 Result: 在无菌违规检测、机器人阶段预测和下一步动作预估三个任务上,显著超越传统图模型和LLM基线。 Conclusion: 高阶拓扑表征能更本征地刻画手术场景的复杂关系与多模态结构,为安全关键的手术理解提供更具表达力与鲁棒性的新范式。 Abstract: Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation[126] The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions
Chahan Vidal-Gorène,Bastien Kindt
Main category: cs.CV
TL;DR: 本文介绍了Patrologia Graeca Corpus,首个面向19世纪古希腊语印刷本的大规模开源OCR与语言资源,针对复杂双语排版与高度退化多调号希腊文字体,提出YOLO+CRNN联合识别流程,在CER(1.05%)和WER(4.69%)上显著超越现有系统,并发布含词形还原、词性标注及布局标注的六百万词符语料。
Details
Motivation: 解决19世纪《希腊教父文献集成》(PG)中尚未数字化卷册的OCR难题,其文本为高度退化的多调号希腊文,且常与拉丁文混排,现有OCR系统性能不足。 Method: 构建专用OCR流程:采用YOLO进行版面检测,CRNN进行文本识别;并对结果进行词形还原与词性标注,同步保留OCR置信度与布局信息。 Result: 实现CER 1.05%、WER 4.69%,显著优于现有工具;建成约六百万词符的高质量标注语料库,含完整OCR输出与版面结构注释。 Conclusion: 该工作不仅填补了古典希腊语文献数字化空白,还为多调号希腊文OCR设立了新基准,并为后续语言模型训练提供了关键数据基础。 Abstract: We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.[127] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks
Ronghao Fu,Haoran Liu,Weijie Zhang,Zhiwen Lin,Xiao Yang,Peng Zhang,Bo Yang
Main category: cs.CV
TL;DR: 本文提出了OmniEarth基准,用于全面评估遥感视觉-语言模型(RSVLMs)在真实地球观测场景下的感知、推理与鲁棒性能力,包含28个细粒度任务、多模态输出形式及去语言偏置的盲测协议,并揭示了现有VLMs在地理空间复杂任务上的显著不足。
Details
Motivation: 现有视觉-语言模型(VLMs)虽在通用领域表现良好,但在地球观测领域的系统性评测基准缺失,亟需面向遥感特性的综合评估框架。 Method: 构建OmniEarth基准,涵盖感知、推理、鲁棒性三维度共28个细粒度任务;支持多选与开放式视觉问答(含文本/边界框/掩码输出);引入盲测协议与五重语义一致性要求以抑制语言偏差;整合9275张高质量遥感图像(含吉林一号卫星数据)与44210条人工校验指令。 Result: 对多种对比学习模型、通用闭源/开源VLM及RSVLM进行系统评测,结果表明当前模型在地理空间复杂任务上性能仍显著不足。 Conclusion: OmniEarth填补了遥感视觉-语言模型评测的空白,为后续RSVLM研发提供了标准化、高保真、多维度的评估平台,并明确了关键能力短板。 Abstract: Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.[128] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Zhengyao Fang,Pengyuan Lyu,Chengquan Zhang,Guangming Lu,Jun Yu,Wenjie Pei
Main category: cs.CV
TL;DR: 本文提出PruneSID,一种无需训练的视觉-语言模型视觉令牌压缩方法,通过语义聚类和组内非极大值抑制,在保持关键信息的同时大幅减少冗余视觉令牌,显著提升推理效率。
Details
Motivation: 视觉-语言模型中大量视觉令牌冗余导致计算效率低下,现有压缩方法难以兼顾重要性保留与信息多样性。 Method: PruneSID采用两阶段无训练策略:(1) 主语义成分分析(PSCA)对视觉令牌进行语义聚类;(2) 组内非极大值抑制(NMS)在每组中保留代表性令牌;并引入基于图像复杂度的信息感知动态压缩比机制。 Result: 在LLaVA-1.5上以11.1%令牌保留率达96.3%准确率;在LLaVA-NeXT上以5.6%令牌保留率达92.8%准确率,较先前方法提升2.5%,预填充速度快7.8倍;且支持多模型与图像/视频跨模态泛化。 Conclusion: PruneSID在不牺牲性能前提下显著提升VLM推理效率,为高效多模态建模提供了通用、轻量、无训练的新范式。 Abstract: Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.[129] Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion
Ali Zia,Muhammad Umer Ramzan,Usman Ali,Muhammad Faheem,Abdelwahed Khamis,Shahnawaz Qureshi
Main category: cs.CV
TL;DR: 本文提出了一种面向组件、自优化的草图到图像生成框架,通过两阶段架构(SA2N编码器、CGF融合模块和SARR细化网络)提升细节重建、空间对齐与跨域适应能力,在多个数据集上显著超越现有GAN和扩散模型。
Details
Motivation: 草图抽象、稀疏且风格多样,现有GAN和扩散模型难以重建细粒度细节、保持空间对齐或跨不同草图域泛化。 Method: 提出两阶段架构:1)基于自注意力的自动编码器SA2N提取部件级语义与结构特征;2)坐标保持门控融合CGF模块整合空间布局;3)基于改进StyleGAN2的空自适应细化修订器SARR进行上下文引导的迭代细化。 Result: 在CelebAMask-HQ等面部与Sketchy等非面部数据集上全面超越SOTA方法:CelebAMask-HQ上FID↓21%、IS↑58%、KID↓41%、SSIM↑20%;兼具更高效率与视觉一致性。 Conclusion: 该组件感知、自优化框架在保真度、语义准确性与感知质量上表现优异,适用于法医绘图、数字艺术修复及通用草图图像合成等实际场景。 Abstract: Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.[130] Streaming Autoregressive Video Generation via Diagonal Distillation
Jinxiu Liu,Xuanming Liu,Kangfu Mei,Yandong Wen,Ming-HsuanYang,Weiyang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为Diagonal Distillation的视频扩散蒸馏方法,通过非对称生成策略(前期多步、后期少步)和隐式光流建模,显著提升长视频序列的运动连贯性与生成效率,在保持高质量的同时实现277.3倍加速。
Details
Motivation: 现有视频扩散蒸馏方法直接套用图像蒸馏技术,忽视时间依赖性,导致运动不连贯、误差累积及延迟-质量权衡问题;根本原因在于步骤压缩中时空上下文利用不足和下一帧块预测中的隐式噪声水平偏差(暴露偏差)。 Method: 提出Diagonal Distillation:采用对角线式(时序+去噪步)联合建模的蒸馏框架;核心是不对称生成策略(early chunks用更多去噪步,later chunks复用前期充分去噪的结果作为条件输入);显式对齐训练与推理中噪声水平预测,并引入隐式光流建模以维持运动质量。 Result: 在5秒视频生成任务中仅需2.61秒(最高31 FPS),相较原始未蒸馏模型提速277.3倍;显著改善长序列运动一致性,缓解误差传播与过饱和现象。 Conclusion: Diagonal Distillation通过协同优化时间维度与去噪步维度的蒸馏过程,突破了现有视频扩散模型实时流式生成的瓶颈,为高效高质量视频合成提供了新范式。 Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.[131] Evolving Prompt Adaptation for Vision-Language Models
Enming Zhang,Jiayang Li,Yanru Wu,Zhenyu Liu,Yang Li
Main category: cs.CV
TL;DR: 本文提出EvoPrompt框架,通过模态共享提示投影器(MPP)和进化式训练策略,在少样本微调中实现提示的稳定演化,避免灾难性遗忘,同时保持预训练VLM的零样本能力。
Details
Motivation: 现有参数高效提示学习方法在适配下游任务时易发生灾难性遗忘,难以兼顾微调性能与预训练知识保留。 Method: 提出EvoPrompt框架:1)使用模态共享提示投影器(MPP)从统一嵌入空间生成分层提示;2)进化式训练策略将低秩更新解耦为方向与幅度分量,固定语义方向、仅调整幅度;3)引入特征几何正则化(FGR)防止表征坍缩。 Result: 在少样本学习任务上达到SOTA性能,并稳健保持预训练VLM原有的零样本能力。 Conclusion: 控制提示演化路径是实现无遗忘适应的关键;EvoPrompt通过方向-幅度解耦与几何正则化,实现了稳定、知识保留的提示微调。 Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.[132] SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding
Zheng Fang,Ziwei Niu,Ziyue Wang,Zhu Zhuo,Haofeng Liu,Shuyang Qian,Jun Xia,Yueming Jin
Main category: cs.CV
TL;DR: 本文提出SurgFed框架,通过语言引导的通道选择(LCS)和超聚合(LHA)解决外科视频多任务联邦学习中组织多样性与任务多样性带来的挑战,在多个外科数据集上超越现有方法。
Details
Motivation: 外科场景多任务联邦学习在机器人辅助微创手术中至关重要,但面临组织多样性(本地模型难以适应不同手术部位的组织特征)和任务多样性(服务器端基于梯度的聚合难以处理跨站点任务异质性)两大挑战。 Method: 提出SurgFed多任务联邦学习框架,包含两个核心设计:(1)语言引导的通道选择(LCS),利用预定义文本输入构建轻量级个性化通道选择网络,提升本地模型对特定手术部位的适应能力;(2)语言引导的超聚合(LHA),采用层间交叉注意力机制结合文本输入建模跨站点任务交互,并指导超网络生成个性化参数更新。 Result: 在涵盖四种外科类型的五个公开数据集上,SurgFed显著优于当前最先进方法。 Conclusion: SurgFed有效缓解了外科视频多任务联邦学习中的跨站点与跨任务异质性问题,为临床环境下的鲁棒手术场景理解提供了新范式。 Abstract: Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.[133] Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Won Shik Jang,Ue-Hwan Kim
Main category: cs.CV
TL;DR: 本文提出Context-Nav方法,通过将长文本描述转化为全局探索先验,并结合3D空间关系验证,实现无需任务特定训练的高精度文本目标实例导航。
Details
Motivation: 解决文本目标实例导航(TGIN)中在同类干扰物中精确定位目标实例的挑战,避免依赖早期检测或大量策略训练。 Method: 1)基于密集图文对齐构建价值图,引导探索朝向与完整文本描述一致的区域;2)对候选目标进行视角感知的3D空间关系验证,仅当至少一个观察视角下空间关系成立时才接受目标。 Result: 在InstanceNav和CoIN-Bench上达到SOTA性能;消融实验证明完整描述编码与显式3D验证均显著提升效果。 Conclusion: 几何感知的空间推理可作为重策略训练或人机交互之外、适用于杂乱3D场景细粒度实例消歧的可扩展方案。 Abstract: Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.[134] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
Chun-Peng Chang,Chen-Yu Wang,Holger Caesar,Alain Pagani
Main category: cs.CV
TL;DR: 本文探讨了视觉-语言模型(VLMs)作为驾驶助手时在时间感知推理和响应一致性方面的可靠性问题,指出当前VLMs存在响应不一致与时间推理能力弱的问题,并提出了FutureVQA数据集和一种无需时序标注的自监督链式思维调优方法。
Details
Motivation: 现有研究默认强视觉理解自然带来可靠的时间推理能力,但作者质疑该假设,指出VLMs在驾驶辅助中面临响应不一致和时间推理不足两大挑战,需实证检验其是否真正具备基于观测的时序因果推理能力。 Method: 构建人类标注的FutureVQA基准数据集以评估未来场景推理能力;提出一种结合链式思维(Chain-of-Thought)的自监督微调方法,在无时间标签条件下提升模型的一致性与时间推理能力。 Result: 实验发现:1)VLMs对输入扰动敏感,响应易退化为随机猜测;2)强视觉理解能力不等于强时间推理能力;3)所提自监督方法显著提升模型在FutureVQA上的表现及响应一致性。 Conclusion: VLMs当前尚未具备可靠的、基于观测的时间 grounded 推理能力,需专门设计评估基准与训练策略;FutureVQA 和自监督链式思维调优为提升驾驶辅助系统的可信性提供了可行路径。 Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.[135] RESBev: Making BEV Perception More Robust
Lifeng Zhuo,Kefan Jin,Zhe Liu,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出RESBev方法,通过构建潜在世界模型预测干净BEV特征,提升现有BEV感知模型在传感器退化和对抗攻击下的鲁棒性,具备即插即用特性。
Details
Motivation: 真实场景中传感器退化和对抗攻击会导致BEV感知异常,威胁自动驾驶安全,亟需提升BEV感知的鲁棒性。 Method: 将感知鲁棒性重构为潜在语义预测问题,构建潜在世界模型学习时序BEV观测间的时空关联与状态转移,从而预测干净BEV特征以重建受损观测;该方法作用于Lift-Splat-Shoot流程的语义特征层,无需修改主干网络。 Result: 在nuScenes数据集上,仅需少量微调,RESBev即可显著提升现有BEV感知模型对多种自然扰动和对抗攻击的鲁棒性。 Conclusion: RESBev是一种轻量、即插即用的BEV鲁棒性增强方法,兼顾通用性与实用性,适用于多种BEV感知架构。 Abstract: Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.[136] DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation
Yanxin Li,Hui Wan,Libin Lan
Main category: cs.CV
TL;DR: 本文提出DCAU-Net,通过差分交叉注意力(DCA)和通道-空间特征融合(CSFF)解决医学图像分割中长程依赖建模与边界细节保留的难题,在保证精度的同时显著降低计算复杂度。
Details
Motivation: 现有方法在建模长程依赖(如Transformer)时计算开销大、注意力分散;而传统卷积感受野有限,且编码器-解码器中的特征融合策略缺乏自适应性,难以兼顾语义与细节。 Method: 提出DCAU-Net:1)差分交叉注意力(DCA),用窗口级摘要token替代像素级key/value,并对两个softmax注意力图作差以增强判别结构;2)通道-空间特征融合(CSFF),通过串行通道与空间注意力自适应重校准跳跃连接与上采样路径特征。 Result: 在两个公开医学图像分割基准上达到具有竞争力的性能,分割精度与鲁棒性均提升。 Conclusion: DCAU-Net通过轻量高效的注意力机制与自适应特征融合策略,有效平衡了全局上下文建模与局部细节保留,为医学图像分割提供了新思路。 Abstract: Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.[137] Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
Ming Nie,Chunwei Wang,Jianhua Han,Hang Xu,Li Zhang
Main category: cs.CV
TL;DR: 本文提出一种基于强化学习的后训练策略,使现有统一视觉-语言模型能够生成图文交错输出,无需大规模交错数据集。通过混合数据预热和扩展GRPO的统一策略优化框架,结合多维奖励机制,显著提升了图文交错生成的质量与连贯性。
Details
Motivation: 现有统一视觉语言模型在生成图文交错输出(如视觉叙事、分步视觉推理)方面能力不足,且依赖大规模交错标注数据。 Method: 采用两阶段策略:1)混合数据暖启阶段,融合人工构建的交错序列与少量多模态理解/文生图数据;2)基于扩展Group Relative Policy Optimization(GRPO)的统一策略优化框架,联合建模图文生成轨迹,并设计涵盖文本相关性、图文对齐度与结构保真度的混合奖励,辅以过程级奖励提供步进指导。 Result: 在MMIE和InterleavedBench基准上,显著提升图文交错生成的质量与连贯性。 Conclusion: 该强化学习后训练方法可有效解锁统一多模态模型的图文交错生成能力,摆脱对大规模交错数据的依赖,具备通用性和高效性。 Abstract: Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.[138] Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
Xin Lu,Rui Li,Xun Huang,Weixin Li,Chuanqing Zhuang,Jiayuan Li,Zhengda Lu,Jun Xiao,Yunhong Wang
Main category: cs.CV
TL;DR: 本文提出DynHiL-EQA数据集和DIVRR框架,以应对动态人类场景中因遮挡和视角依赖导致的感知非平稳性问题,提升具身问答(EQA)在动态环境下的鲁棒性与推理效率。
Details
Motivation: 传统EQA在静态环境中评估,难以应对真实人类场景中由人类活动和遮挡引起的感知非平稳性,导致证据冗余、推理低效及视角依赖的歧义问题。 Method: 构建DynHiL-EQA数据集(含Dynamic/Static子集),并提出无需训练的DIVRR框架:结合相关性引导的视角精炼与自适应记忆选择,在验证模糊观测后再存入、仅保留信息性证据。 Result: DIVRR在DynHiL-EQA和HM-EQA上均显著优于现有基线,兼顾动态/静态场景性能,并保持高推理效率与紧凑内存占用。 Conclusion: DIVRR通过动态感知驱动的证据管理机制,有效缓解了动态场景下的歧义与冗余问题,为面向真实交互环境的EQA提供了实用新范式。 Abstract: Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.[139] A comprehensive study of time-of-flight non-line-of-sight imaging
Julio Marco,Adrian Jarabo,Ji Hyun Nam,Alberto Tosi,Diego Gutierrez,Andreas Velten
Main category: cs.CV
TL;DR: 本文对飞行时间非直视(ToF NLOS)成像技术进行了系统性综述与实验比较,统一建模、分析其前向与反向模型与Radon变换及频域相量模型的关系,并在相同硬件条件下评估多种方法性能,发现其受限于共同的分辨率、可见性与噪声敏感性瓶颈。
Details
Motivation: ToF NLOS成像方法多样、公式与硬件异构,导致理论与实验评估困难,亟需统一框架进行客观比较。 Method: 建立通用ToF NLOS前向模型,梳理典型假设导出可解反演模型;揭示其与Radon变换族及频域phasor虚拟直视模型的内在联系;在统一硬件与光子数条件下对比多种代表性方法的重建性能。 Result: 实验证明,在相同硬件约束下,各方法在空间分辨率、隐藏场景可见性及噪声鲁棒性方面表现相似,差异主要源于方法特有参数。 Conclusion: 提出一套标准化评估范式,有望成为未来ToF NLOS成像算法研究与比较的基准参考。 Abstract: Time-of-Flight non-line-of-sight (ToF NLOS) imaging techniques provide state-of-the-art reconstructions of scenes hidden around corners by inverting the optical path of indirect photons scattered by visible surfaces and measured by picosecond resolution sensors. The emergence of a wide range of ToF NLOS imaging methods with heterogeneous formulae and hardware implementations obscures the assessment of both their theoretical and experimental aspects. We present a comprehensive study of a representative set of ToF NLOS imaging methods by discussing their similarities and differences under common formulation and hardware. We first outline the problem statement under a common general forward model for ToF NLOS measurements, and the typical assumptions that yield tractable inverse models. We discuss the relationship of the resulting simplified forward and inverse models to a family of Radon transforms, and how migrating these to the frequency domain relates to recent phasor-based virtual line-of-sight imaging models for NLOS imaging that obey the constraints of conventional lens-based imaging systems. We then evaluate performance of the selected methods on hidden scenes captured under the same hardware setup and similar photon counts. Our experiments show that existing methods share similar limitations on spatial resolution, visibility, and sensitivity to noise when operating under equal hardware constraints, with particular differences that stem from method-specific parameters. We expect our methodology to become a reference in future research on ToF NLOS imaging to obtain objective comparisons of existing and new methods.[140] GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
Lang Sun,Ronghao Fu,Zhuoran Duan,Haoran Liu,Xueyan Liu,Bo Yang
Main category: cs.CV
TL;DR: 本文提出GeoSolver框架,通过构建Geo-PRM-2M数据集和训练GeoPRM奖励模型,并结合Process-Aware Tree-GRPO强化学习算法,提升遥感视觉语言模型在链式推理中的视觉保真度与可验证性,实现SOTA性能并支持跨模型泛化与测试时扩展。
Details
Motivation: 现有视觉语言模型在遥感解释中难以实现可靠、分步的链式推理,尤其缺乏对中间步骤视觉保真性的有效监督与验证。 Method: 构建基于熵引导MCTS与视觉幻觉注入的token级过程监督数据集Geo-PRM-2M;训练token级过程奖励模型GeoPRM;提出融合树结构探索与保真度加权奖励机制的强化学习算法Process-Aware Tree-GRPO。 Result: GeoSolver-9B在多个遥感基准上达到SOTA;GeoPRM作为通用地理空间验证器,支持Test-Time Scaling,并能直接增强通用VLMs,展现强跨模型泛化能力。 Conclusion: 过程监督式强化学习是提升遥感VLM推理可信性与可验证性的有效路径,GeoPRM为多模型、多任务的视觉-语言协同推理提供了可迁移的验证基础设施。 Abstract: While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.[141] GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning
Xiao Yang,Ronghao Fu,Zhuoran Duan,Zhiwen Lin,Xueyan Liu,Bo Yang
Main category: cs.CV
TL;DR: 本文提出GeoAlignCLIP框架与RSFG-100k数据集,通过多粒度语义对齐与模态内一致性提升遥感图像与自然语言的细粒度跨模态对齐性能。
Details
Motivation: 现有遥感视觉-语言预训练模型主要依赖全局图像-文本对齐,难以有效融合多粒度视觉与文本信息,限制其在复杂细粒度任务中的表现。 Method: 提出GeoAlignCLIP统一框架,实现多粒度语义对齐并引入模态内一致性;构建包含场景描述、区域级标注与难负样本的细粒度遥感数据集RSFG-100k,提供分层监督。 Result: 在多个公开遥感基准上实验表明,GeoAlignCLIP持续优于现有遥感专用方法,在各类任务中展现出更强鲁棒性与更精确的细粒度视觉-语义对齐能力。 Conclusion: GeoAlignCLIP通过多粒度对齐与RSFG-100k数据集,有效提升了遥感领域视觉-语言模型的细粒度理解与跨模态对齐能力。 Abstract: Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.[142] More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Weijia Fan,Ruiping Liu,Jiale Wei,Yufan Chen,Junwei Zheng,Zichao Zeng,Jiaming Zhang,Qiufu Li,Linlin Shen,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出全景语言建模(PLM)范式,通过引入PanoVQA数据集和可插拔的全景稀疏注意力模块,使现有针孔视觉语言模型能直接处理360°全景图像,提升在复杂全向场景下的鲁棒性与整体推理能力。
Details
Motivation: 现有视觉语言模型针对针孔图像设计,多视角拼接方式忽略了全景图像本身所固有的整体空间与上下文关系。 Method: 提出Panorama-Language Modeling(PLM)范式;构建大规模全景VQA数据集PanoVQA;设计可插拔的全景稀疏注意力模块,适配现有针孔VLMs处理等距矩形投影全景图。 Result: 实验表明PLM在恶劣全向场景(如物体遮挡、驾驶事故)下展现出更强的鲁棒性和整体推理能力,性能优于各针孔视角简单组合。 Conclusion: 全景语言建模是一种更优的360°跨模态理解范式,其整体性能超越多个窄视角模型之和,为全向视觉语言理解奠定基础。 Abstract: Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.[143] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
Chaodong Xiao,Zhengqiang Zhang,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出BinaryAttention,一种1比特的qk-attention方法,通过仅保留查询和键的符号并用位运算替代浮点点积,在保证精度的同时大幅提升计算效率。
Details
Motivation: Transformer中注意力模块的高计算复杂度是视觉任务的主要瓶颈,现有低比特量化方法(如8位或4位)在效率与精度间难以兼顾。 Method: 提出BinaryAttention:对查询和键进行二值化(仅保留符号),用位运算替代浮点点积;引入可学习偏置缓解信息损失;结合量化感知训练与自蒸馏确保相似性对齐。 Result: BinaryAttention在A100 GPU上比FlashAttention2快2倍以上,在ViT与扩散Transformer多个基准上达到甚至超越全精度注意力性能。 Conclusion: BinaryAttention为视觉与扩散Transformer提供了高效且高精度的低比特注意力替代方案,推动了超低比特模型的发展。 Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.[144] ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
KunHo Heo,SuYeon Kim,Yonghyun Gwon,Youngbin Kim,MyeongAh Cho
Main category: cs.CV
TL;DR: 本文提出ParTY框架,通过部分引导网络、部分感知文本定位和整体-部分融合,解决文本到动作合成中身体部位对齐不准确和整体动作不连贯的问题。
Details
Motivation: 现有文本到动作合成方法难以准确反映涉及特定身体部位的动作,且部分生成方法缺乏文本语义与身体部位的显式对齐机制,导致整体动作不连贯。 Method: 提出ParTY框架,包含三部分:(1) 部分引导网络,先生成部分动作再指导整体动作生成;(2) 部分感知文本定位,将文本嵌入多样化映射并匹配各身体部位;(3) 整体-部分融合,自适应融合整体与部分动作。 Result: 在部分级和连贯性级评估中,ParTY显著优于先前方法。 Conclusion: ParTY有效提升了身体部位表达能力与整体动作连贯性,解决了现有方法的根本权衡问题。 Abstract: Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.[145] A saccade-inspired approach to image classification using visiontransformer attention maps
Matthis Dallain,Laurent Rodriguez,Laurent Udo Perrinet,Benoît Miramond
Main category: cs.CV
TL;DR: 本文受人类视觉系统启发,提出一种基于DINO自监督ViT注意力机制的仿生扫视(saccade)图像处理方法,在ImageNet分类任务中实现高效、选择性区域处理,性能媲美甚至超越全图处理,并优于传统显著性模型。
Details
Motivation: 人类视觉在严格代谢约束下仍具卓越感知能力,关键在于通过快速扫视将高分辨率中央凹聚焦于任务相关区域;而现有AI模型对整图均匀处理,效率低下。本文旨在借鉴该机制,构建更智能、高效的图像处理模型。 Method: 利用DINO(自监督Vision Transformer)生成类人眼动的注意力图,将其作为扫视引导信号,在ImageNet分类任务中逐步聚焦关键图像区域,动态更新类别得分,实现选择性信息处理。 Result: 该扫视策略在保持接近全图分类精度的同时,部分情况下反超全图处理;且DINO提供的注视点引导效果优于主流人类凝视预测显著性模型。 Conclusion: Vision Transformer的内在注意力机制可作为构建生物启发式主动视觉与神经形态高效视觉处理系统的重要基础。 Abstract: Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.[146] Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution
Shuting Liu,Lei Zhang,Wei Huang,Zhao Zhang,Zizhou Wang
Main category: cs.CV
TL;DR: 本文提出了一种零样本MRI超分辨率框架,利用显式高斯表示,在不依赖配对数据的同时实现高效、高质量的重建。
Details
Motivation: 高分辨率MRI对临床诊断至关重要,但受限于长采集时间和运动伪影;现有超分辨率方法在数据需求(需配对数据)与计算效率(隐式神经表示计算开销大)之间难以兼顾。 Method: 提出基于显式高斯表示的零样本MRI超分辨率框架:MRI定制化高斯参数嵌入组织物理特性;采用物理驱动的体绘制策略建模MRI信号形成过程;引入基于砖块的顺序无关光栅化方案以支持高度并行的3D计算。 Result: 在两个公开MRI数据集上实验表明,该方法在重建质量与计算效率方面均优于现有方法。 Conclusion: 所提框架在无需配对训练数据的前提下,实现了高保真、高效率的MRI超分辨率重建,具有良好的临床应用潜力。 Abstract: High-resolution Magnetic Resonance Imaging (MRI) is vital for clinical diagnosis but limited by long acquisition times and motion artifacts. Super-resolution (SR) reconstructs low-resolution scans into high-resolution images, yet existing methods are mutually constrained: paired-data methods achieve efficiency only by relying on costly aligned datasets, while implicit neural representation approaches avoid such data needs at the expense of heavy computation. We propose a zero-shot MRI SR framework using explicit Gaussian representation to balance data requirements and efficiency. MRI-tailored Gaussian parameters embed tissue physical properties, reducing learnable parameters while preserving MR signal fidelity. A physics-grounded volume rendering strategy models MRI signal formation via normalized Gaussian aggregation. Additionally, a brick-based order-independent rasterization scheme enables highly parallel 3D computation, lowering training and inference costs. Experiments on two public MRI datasets show superior reconstruction quality and efficiency, demonstrating the method's potential for clinical MRI SR.[147] Decoder-Free Distillation for Quantized Image Restoration
S. M. A. Sharif,Abdur Rehman,Seongwan Kim,Jaeho Lee
Main category: cs.CV
TL;DR: 本文提出QDR框架,通过FP32自蒸馏、无解码器蒸馏(DFD)和可学习幅度重加权(LMR)解决量化感知训练与知识蒸馏在图像恢复任务中联合优化的三大瓶颈,并设计边缘友好模型EFM,在保持高恢复性能的同时实现高效边缘部署。
Details
Motivation: 现有量化感知训练(QAT)与知识蒸馏(KD)联合方法在精度敏感的低层视觉图像恢复任务中存在教师-学生容量不匹配、解码器蒸馏中空间误差放大、以及重建与蒸馏损失间因量化噪声导致的优化冲突等问题,亟需针对性解决方案。 Method: 提出Quantization-aware Distilled Restoration(QDR)框架:1)采用FP32自蒸馏消除容量失配;2)引入Decoder-Free Distillation(DFD),仅在瓶颈处校正量化误差以抑制空间误差放大;3)设计Learnable Magnitude Reweighting(LMR)动态平衡多目标梯度;4)构建Edge-Friendly Model(EFM),含轻量级Learnable Degradation Gating(LDG)模块以动态定位空间退化。 Result: 在四个图像恢复任务上,Int8量化模型恢复96.5%的FP32性能,在NVIDIA Jetson Orin上达442 FPS,并使下游目标检测mAP提升16.3。 Conclusion: QDR有效克服了QAT-KD在低层视觉恢复任务中的关键瓶颈,实现了高性能、高效率与强泛化性的统一,为边缘端图像恢复提供了实用可行的新范式。 Abstract: Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization "tug-of-war" between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP[148] Grounding Synthetic Data Generation With Vision and Language Models
Ümit Mert Çağlar,Alptekin Temizel
Main category: cs.CV
TL;DR: 本文提出一种基于视觉-语言对齐的可解释合成数据增强与评估框架,并构建了包含40万样本的遥感合成数据集ARAS400k,用于语义分割与图像描述任务,验证了合成数据可有效提升下游模型性能。
Details
Motivation: 现有合成数据评估指标依赖难以解释的隐特征相似性,且与下游任务性能相关性弱,亟需更可解释、任务导向的评估方法。 Method: 构建融合生成模型、语义分割与图像描述的视觉-语言联合框架;设计ARAS400k数据集(10万真实+30万合成图像),每张图像配分割图与文本描述;通过语义组成分析、描述去冗余和跨模态一致性验证实现自动化评估。 Result: 仅用合成数据训练的模型已具竞争力;真实+合成数据联合训练持续超越纯真实数据基线;ARAS400k成为遥感语义分割与图像描述的可扩展基准。 Conclusion: 该工作确立了以任务为中心、可解释的合成数据评估新范式,推动遥感领域数据增强的标准化与实用性。 Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.[149] OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
Zikun Chen,Wentao Zhao,Yihe Niu,Tianchen Deng,Jingchuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种鲁棒的双目点线视觉惯性里程计(VIO)系统,通过引入无训练深度描述符和基于熵正则化最优传输的线段匹配方法,在低纹理和光照突变场景下提升了匹配鲁棒性和估计稳定性;同时采用可靠性自适应加权策略抑制线测量噪声影响,实验证明其在精度、鲁棒性和实时性上均优于现有方法。
Details
Motivation: 传统基于点特征的VIO在低纹理和光照突变场景中因点特征稀疏不稳定而导致匹配模糊和估计欠约束;而现有点线融合系统多依赖点引导的线关联,在点支持弱时易失效并引入偏差。 Method: 提出一种双目点线VIO系统:1)为线段设计无训练的深度描述符(通过采样与池化网络特征图获得);2)采用熵正则化最优传输进行全局一致的线匹配,应对模糊性、外点和部分观测;3)分析线测量噪声影响,引入可靠性自适应加权机制调节优化中线约束的贡献。 Result: 在EuRoC和UMA-VI数据集及真实低纹理/光照挑战环境部署中,该系统相比代表性基线方法显著提升了定位精度与鲁棒性,并保持实时性能。 Conclusion: 所提点线VIO框架通过解耦线描述与点引导、增强匹配鲁棒性及自适应噪声抑制,有效克服了传统VIO在退化场景下的局限性,为复杂环境下的可靠导航提供了新思路。 Abstract: Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.[150] When to Lock Attention: Training-Free KV Control in Video Diffusion
Tianyi Zeng,Jincheng Gao,Tianyi Wang,Zijie Meng,Miao Zhang,Jun Yin,Haoyuan Sun,Junfeng Jiao,Christian Claudel,Junbo Tan,Xueqian Wang
Main category: cs.CV
TL;DR: 本文提出KV-Lock,一种无需训练的视频编辑框架,通过扩散幻觉检测动态调节背景KV融合比例与CFG尺度,以在提升前景质量的同时保持背景一致性。
Details
Motivation: 现有方法在视频编辑中难以兼顾前景质量提升与背景一致性:全图信息注入易导致背景伪影,而刚性背景锁定又严重限制前景生成能力。 Method: 提出KV-Lock框架,利用去噪预测方差(即幻觉指标)量化生成多样性,并据此动态调度缓存背景KV与新生成KV的融合比及CFG尺度;高幻觉风险时增强背景KV锁定并提升条件引导强度。 Result: KV-Lock作为即插即用、无需训练的模块,在多种视频编辑任务中显著提升前景质量并保持高背景保真度,性能优于现有方法。 Conclusion: KV-Lock通过训练无关的动态调控机制,有效平衡了视频编辑中前景生成能力与背景一致性,为DiT类视频扩散模型提供了实用且通用的增强方案。 Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.[151] DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics
Yuanhang Lei,Boming Zhao,Zesong Yang,Xingxuan Li,Tao Cheng,Haocheng Peng,Ru Zhang,Yang Yang,Siyuan Huang,Yujun Shen,Ruizhen Hu,Hujun Bao,Zhaopeng Cui
Main category: cs.CV
TL;DR: 本文提出DiffWind,一种物理信息驱动的可微分框架,用于从视频中建模风与物体的相互作用,通过结合网格化风场、粒子系统建模、MPM交互模拟、可微分渲染与仿真,并引入LBM作为物理约束,实现高精度重建与新风场下的前向仿真。
Details
Motivation: 建模视频中风驱动物体的动力学极具挑战性,因风不可见、时空变化大,且物体形变复杂。 Method: 提出DiffWind框架:用网格化物理风场表示风,用3D高斯溅射导出的粒子系统表示物体,以物质点法(MPM)建模风-物交互;通过可微分渲染与仿真联合优化风力场和物体运动;引入格子玻尔兹曼法(LBM)作为物理约束保证流体力学一致性。 Result: 在重建精度和仿真保真度上显著优于现有动态场景建模方法;支持新风场下的前向仿真与风重定向等新应用;构建了WD-Objects合成与真实风驱动场景数据集。 Conclusion: DiffWind为基于视频的风-物体交互建模开辟了新路径,实现了物理合理性与数据驱动能力的统一。 Abstract: Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.[152] VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
Anh Thuan Tran,Jana Kosecka
Main category: cs.CV
TL;DR: VarSplat 是一种不确定性感知的 3D 高斯泼溅 SLAM 系统,通过显式学习每个高斯椭球的外观方差,并利用总方差定律与 Alpha 合成渲染每像素不确定性图,从而提升低纹理、透明或复杂反射区域中的位姿估计与建图鲁棒性。
Details
Motivation: 现有 3DGS-SLAM 方法隐式处理测量可靠性,导致在低纹理、透明表面或复杂反射区域中易发生位姿漂移和全局对齐失败。 Method: 提出 VarSplat:显式学习每个 3D 高斯 splat 的外观方差;结合总方差定律与 alpha 合成,在单次光栅化中高效渲染可微分的每像素不确定性图;该不确定性图用于指导跟踪、子图配准与回环检测。 Result: 在 Replica(合成)、TUM-RGBD、ScanNet 和 ScanNet++(真实世界)数据集上,VarSplat 在跟踪精度、建图质量与新视角合成效果上达到或优于现有密集 RGB-D SLAM 方法,显著提升系统鲁棒性。 Conclusion: 显式建模并利用每像素不确定性可有效缓解 3DGS-SLAM 在挑战性场景下的性能退化,为不确定性驱动的神经 SLAM 提供了可行且高效的实现范式。 Abstract: Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then render differentiable per-pixel uncertainty map via efficient, single-pass rasterization. This map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.[153] Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture
Tom Wehrbein,Bodo Rosenhahn
Main category: cs.CV
TL;DR: 本文提出FootMR方法,通过将2D足部关键点序列提升为3D来优化现有3D人体重建模型的足部运动估计,避免依赖不准确的图像-3D标注,利用大规模动捕数据,并结合上下文信息与残差预测提升精度,在多个数据集上显著降低踝关节角度误差。
Details
Motivation: 现有方法在野外视频中恢复人体3D运动时,难以准确捕捉足部等细粒度关节约束,主因是训练数据中足部标注不准且运动多样性不足。 Method: 提出FootMR(Foot Motion Refinement)方法:基于2D足部关键点序列进行3D提升;不直接使用图像输入,转而利用高质量动捕数据;引入膝部和足部运动作为上下文,仅预测残差足部运动;采用全局旋转表示关节并施加强数据增强以提升泛化性。 Result: 在MOOF、MOYO和RICH数据集上的实验表明,FootMR显著优于现有SOTA方法,在MOYO上踝关节角度误差较最佳视频方法降低最多达30%。 Conclusion: FootMR有效提升了足部3D运动重建精度,尤其适用于需高精度足部建模的场景(如步态分析、动画),其脱离图像输入、依赖2D关键点与动捕先验的设计具有鲁棒性和泛化优势。 Abstract: State-of-the-art methods can recover accurate overall 3D human body motion from in-the-wild videos. However, they often fail to capture fine-grained articulations, especially in the feet, which are critical for applications such as gait analysis and animation. This limitation results from training datasets with inaccurate foot annotations and limited foot motion diversity. We address this gap with FootMR, a Foot Motion Refinement method that refines foot motion estimated by an existing human recovery model through lifting 2D foot keypoint sequences to 3D. By avoiding direct image input, FootMR circumvents inaccurate image-3D annotation pairs and can instead leverage large-scale motion capture data. To resolve ambiguities of 2D-to-3D lifting, FootMR incorporates knee and foot motion as context and predicts only residual foot motion. Generalization to extreme foot poses is further improved by representing joints in global rather than parent-relative rotations and applying extensive data augmentation. To support evaluation of foot motion reconstruction, we introduce MOOF, a 2D dataset of complex foot movements. Experiments on MOOF, MOYO, and RICH show that FootMR outperforms state-of-the-art methods, reducing ankle joint angle error on MOYO by up to 30% over the best video-based approach.[154] AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering
Nguyen Anh Tuong,Phan Ba Duc,Nguyen Trung Quoc,Tran Dac Thinh,Dang Duy Lan,Nguyen Quoc Thinh,Tung Le
Main category: cs.CV
TL;DR: 本文探讨了越南语视觉问答(VQA)任务,利用基于Transformer的架构,结合文本与视觉预训练,并在多语言环境下系统比较自动评估指标。
Details
Motivation: 越南语VQA属于低资源多模态学习领域,现有数据集(如ViVQA、OpenViVQA、ViTextVQA)推动了该方向发展,但需更适配的模型架构与评估方法。 Method: 采用基于Transformer的多模态融合架构,结合越南语语言模型(如PhoBERT)和视觉模型(如ViT),并在多语言设置下系统评估BLEU、METEOR、CIDEr等自动指标。 Result: 验证了预训练多模态Transformer在越南语VQA上的有效性,并揭示不同自动评估指标在该语种下的表现差异及与人工判断的一致性潜力。 Conclusion: 越南语VQA可通过跨模态预训练模型有效建模,未来需设计更贴合本地语言特性和人类判断的评估机制。 Abstract: Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains -- such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning -- multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.[155] TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering
Luca Carlini,Chiara Lena,Cesare Hassan,Danail Stoyanov,Elena De Momi,Sophia Bano,Mobarak I. Hoque
Main category: cs.CV
TL;DR: 本文提出TemporalDoRA,一种面向手术视频问答(VideoQA)的时序感知参数高效微调方法,通过在视觉编码器低秩瓶颈中嵌入轻量级时序多头注意力,并仅对可训练低秩分支进行权重分解,提升对语言变化的鲁棒性;同时构建了REAL-Colon-VQA新基准数据集用于评估。
Details
Motivation: 现有参数高效微调(PEFT)方法缺乏显式的帧间时序建模能力,难以利用稀疏时序证据,且易受临床提问语言多样性引发的语言偏差影响。 Method: 提出TemporalDoRA:(i)在视觉编码器低秩适配路径中插入轻量级时序多头注意力(MHA);(ii)仅对可训练的低秩分支进行权重分解,而非整个适配权重;保持主干冻结与稳定缩放。 Result: 在新构建的REAL-Colon-VQA数据集及适配后的EndoVis18-VQA短片段任务上,TemporalDoRA显著提升Out-of-Template问题回答性能;消融实验证明时序混合机制是性能提升主因。 Conclusion: TemporalDoRA通过在低秩适应空间内引入帧间信息融合,实现了高效、鲁棒且时序感知的视频问答微调,在保持参数效率的同时提升了对语言变异和稀疏时序线索的建模能力。 Abstract: Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip--question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\href{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}{Anonymous GitHub}.[156] TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR
Fayaz Ali Dharejo,Sharif S. M. A.,Aiman Khalil,Nachiket Chaudhary,Rizwan Ali Naqvi,Radu Timofte
Main category: cs.CV
TL;DR: 本文提出TriFusionSR,一种基于小波引导的条件扩散框架,用于联合三模态医学图像融合与超分辨率重建,通过频率感知的跨模态交互和自适应空间-频率融合模块显著提升融合质量与分辨率。
Details
Motivation: 现有方法将图像融合与超分辨率分开处理,导致伪影和感知质量下降,尤其在MRI/CT与PET/SPECT等三模态融合中因频域失衡问题更严重。 Method: 提出TriFusionSR框架:利用2D离散小波变换进行多模态特征频带分解;引入校正小波特征(RWF)策略校准潜在系数;设计带门控通道-空间注意力的自适应空间-频率融合(ASFF)模块。 Result: 在多个上采样尺度下达到SOTA性能,PSNR提升4.8–12.4%,RMSE和LPIPS显著降低。 Conclusion: TriFusionSR有效缓解了三模态医学图像融合中的分辨率退化与模态差异问题,为临床综合诊断提供了高质量、高保真的融合图像。 Abstract: Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.[157] ProGS: Towards Progressive Coding for 3D Gaussian Splatting
Zhiye Tang,Lingzhuo Liu,Shengjie Jiao,Qiudan Zhang,Junhui Hou,You Yang,Xu Wang
Main category: cs.CV
TL;DR: 本文提出ProGS,一种基于八叉树结构的3D高斯泼溅(3DGS)渐进式编码方法,显著提升压缩效率与视觉保真度,支持带宽自适应流式传输。
Details
Motivation: 现有3DGS压缩方法缺乏对渐进式编码的支持,难以满足带宽变化的流媒体应用需求。 Method: 将3DGS数据组织为八叉树结构,引入互信息增强机制缓解结构冗余,并动态调整锚点节点以实现可扩展压缩。 Result: 相比原始3DGS格式,文件存储减少45倍,视觉性能提升超10%。 Conclusion: ProGS为实时、多变网络条件下的3DGS流式传输提供了高效鲁棒的解决方案。 Abstract: With the emergence of 3D Gaussian Splatting (3DGS), numerous pioneering efforts have been made to address the effective compression issue of massive 3DGS data. 3DGS offers an efficient and scalable representation of 3D scenes by utilizing learnable 3D Gaussians, but the large size of the generated data has posed significant challenges for storage and transmission. Existing methods, however, have been limited by their inability to support progressive coding, a crucial feature in streaming applications with varying bandwidth. To tackle this limitation, this paper introduce a novel approach that organizes 3DGS data into an octree structure, enabling efficient progressive coding. The proposed ProGS is a streaming-friendly codec that facilitates progressive coding for 3D Gaussian splatting, and significantly improves both compression efficiency and visual fidelity. The proposed method incorporates mutual information enhancement mechanisms to mitigate structural redundancy, leveraging the relevance between nodes in the octree hierarchy. By adapting the octree structure and dynamically adjusting the anchor nodes, ProGS ensures scalable data compression without compromising the rendering quality. ProGS achieves a remarkable 45X reduction in file storage compared to the original 3DGS format, while simultaneously improving visual performance by over 10%. This demonstrates that ProGS can provide a robust solution for real-time applications with varying network conditions.[158] GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System
Zhiye Tang,Qiudan Zhang,Lei Zhang,Junhui Hou,You Yang,Xu Wang
Main category: cs.CV
TL;DR: 本文提出GSStream系统,通过协同视口预测和深度强化学习的码率自适应模块,实现3D高斯泼溅(3DGS)数据的高效实时流式传输。
Details
Motivation: 3D高斯泼溅(3DGS)虽提升了实时辐射场渲染质量,但其数据量大、带宽需求高,难以实现实时分发,现有压缩与变体方法仍难以满足实时流式传输需求。 Method: 提出GSStream系统:1)协同视口预测模块,融合多用户协同先验与单用户历史视口序列;2)基于深度强化学习(DRL)的码率自适应模块,应对状态与动作空间变化;3)构建首个面向体素场景的用户视口轨迹数据集,支撑训练与仿真。 Result: 实验表明,GSStream在视觉质量和网络资源利用率方面均优于现有代表性体素场景流式传输系统。 Conclusion: GSStream为3DGS数据提供了高效、可扩展的实时流式传输方案,推动沉浸式三维内容在网络环境下的实际部署。 Abstract: Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users' future behaviors by learning collaborative priors and historical priors from multiple users and users' viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: https://youtu.be/3WEe8PN8yvA.[159] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation
Minh Khoa Le,Kien Do,Duc Thanh Nguyen,Truyen Tran
Main category: cs.CV
TL;DR: 本文提出了一种新的帧级时序注意力机制——Matrix Attention,用于视频扩散模型,以在保持效率的同时提升时空建模能力;基于此构建了FrameDiT-G和FrameDiT-H架构,后者结合局部因子化注意力,在多个视频生成基准上达到SOTA性能。
Details
Motivation: 现有视频扩散模型在建模复杂时空动态时面临效率与效果的权衡:全3D注意力强但计算昂贵,局部因子化注意力高效但时序建模能力有限。 Method: 提出Matrix Attention,将整帧视为矩阵,通过矩阵原生操作生成Q/K/V矩阵,实现跨帧而非跨token的注意力;构建基于该机制的FrameDiT-G,并进一步融合局部因子化注意力得到FrameDiT-H。 Result: FrameDiT-H在多个视频生成基准上达到SOTA,显著提升时序一致性和视频质量,同时保持与局部因子化注意力相当的计算效率。 Conclusion: Matrix Attention是一种有效平衡建模能力与计算效率的新型时序注意力机制,为高保真视频生成提供了新思路。 Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.[160] FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
Xiaotian Hu,Junwei Huang,Mingxuan Liu,Kasidit Anmahapong,Yifei Chen,Yitong Luo,Yiming Huang,Xuguang Bai,Zihan Li,Yi Liao,Haibo Qu,Qiyuan Tian
Main category: cs.CV
TL;DR: 本文提出了FetalAgents,首个面向胎儿超声(US)全流程分析的多智能体系统,通过轻量级智能体协调框架动态调度专用视觉专家,实现诊断、测量与分割任务的高精度协同,并支持视频流关键帧识别、多平面分析及结构化临床报告生成。
Details
Motivation: 现有胎儿超声自动分析工具难以兼顾任务特异性精度与端到端临床工作流所需的全流程通用性,且依赖专家经验,亟需更鲁棒、可审计、流程对齐的解决方案。 Method: 提出基于轻量级智能体协调框架的FetalAgents多智能体系统,集成多个专用视觉专家,支持静态图像分析与动态视频流总结(含关键帧选取、多平面分析、患者元数据融合及结构化报告生成)。 Result: 在八个临床任务的多中心外部评估中,FetalAgents在准确性与鲁棒性上持续优于专用模型和多模态大语言模型(MLLMs)。 Conclusion: FetalAgents为胎儿超声分析提供了首个可审计、工作流对齐、兼具精度与通用性的多智能体解决方案,推动AI在产前筛查中的临床落地。 Abstract: Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.[161] $M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs
Kaixin Lin,Kunyu Peng,Di Wen,Yufan Chen,Ruiping Liu,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出M²-Occ框架,解决多相机输入不完整时的语义占据预测问题,通过多视图掩码重建模块和特征记忆模块提升几何结构保持与语义一致性,在nuScenes SurroundOcc基准上显著提升缺失视角下的IoU,且不损害全视角性能。
Details
Motivation: 现有基于相机的语义占据预测方法隐含假设具备完整的环视观测,但在实际部署中常因遮挡、硬件故障或通信失败导致部分视角缺失,影响系统鲁棒性与安全性。 Method: 提出M²-Occ框架,包含两个核心模块:1)多视图掩码重建(MMR)模块,利用相邻相机的空间重叠关系在特征空间中直接恢复缺失视角的表示;2)特征记忆模块(FMM),引入可学习的记忆库存储类别级语义原型,并通过检索与融合全局先验来细化模糊体素特征。 Result: 在nuScenes SurroundOcc基准的系统性缺失视角评测协议下,M²-Occ在后视缺失场景中IoU提升4.93%,五视角缺失时提升5.01%,且全视角性能无损。 Conclusion: M²-Occ有效提升了语义占据预测在不完整多相机输入下的鲁棒性与语义一致性,为自动驾驶在真实复杂环境中的安全感知提供了新思路。 Abstract: Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.[162] ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios
Francesco Ragusa,Rosario Leonardi,Michele Mazzamuto,Daniele Di Mauro,Camillo Quattrocchi,Alessandro Passanisi,Irene D'Ambra,Antonino Furnari,Giovanni Maria Farinella
Main category: cs.CV
TL;DR: 本文提出了ENIGMA-360数据集,首个在真实工业场景中同步采集的180段第一人称(egocentric)与180段第三人称(exocentric)视频数据集,并提供时空标注;基于该数据集开展了三项基础行为理解任务的基线实验,揭示了现有方法在真实工业场景下的局限性,并开源了全部数据与标注。
Details
Motivation: 缺乏真实工业场景下同步的第一人称与第三人称视角数据集,制约了人类行为理解及其在工业安全与辅助系统中的应用。 Method: 构建ENIGMA-360数据集:在真实工业环境中同步采集180对ego-exo视频(共360段),并进行精细的时空标注;设计并执行三项基础任务(时间动作分割、关键步骤识别、第一人称人-物交互检测)的基线实验。 Result: 基线实验表明当前SOTA方法在该真实工业ego-exo场景下性能显著下降,验证了任务挑战性并凸显新模型需求。 Conclusion: ENIGMA-360填补了真实工业ego-exo数据空白,为鲁棒的多视角人类行为理解提供了重要基准和推动,所有数据与标注已公开。 Abstract: Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at https://iplab.dmi.unict.it/ENIGMA-360.[163] LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
Lei Shi,Victor Aregbede,Andreas Persson,Martin Längkvist,Amy Loutfi,Stephanie Lowry
Main category: cs.CV
TL;DR: 本文提出了一种名为Language-Aware Planning (LAP)的新方法,利用语言描述增强视觉信息,通过微调的视觉语言模型(VLM)将视觉观测转化为文本描述,并用文本嵌入驱动扩散模型进行动作序列规划,在多个基准上显著提升了性能。
Details
Motivation: 现有基于视觉的方法在程序规划中因视觉相似动作难以区分而表现受限,语言描述在潜在空间中更具判别性,可缓解该问题。 Method: 提出LAP方法:使用微调的视觉语言模型(VLM)将视觉观测转为文本描述、预测动作并提取文本嵌入;利用更判别性的文本嵌入驱动扩散模型进行动作序列规划。 Result: 在CrossTask、Coin和NIV三个程序规划基准上,LAP在多项指标和不同时间跨度下均取得大幅领先的新SOTA性能。 Conclusion: 语言感知的规划范式显著优于纯视觉方法,验证了语言在程序规划中提供更清晰语义表征的有效性和必要性。 Abstract: Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.[164] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control
Mingyu Kang,Hyein Seo,Yuna Jeong,Junhyeong Park,Yong Suk Choi
Main category: cs.CV
TL;DR: LogoDiffuser是一种无需训练的多语言设计logo生成方法,利用多模态扩散Transformer,通过将目标字符作为图像输入,并结合分析和注入关键注意力图,实现对字符结构的鲁棒控制与视觉风格的协同设计。
Details
Motivation: 现有文本到图像生成方法在多语言logo设计中易扭曲字符几何结构,且难以支持多语言文本生成而不额外训练。 Method: 提出LogoDiffuser:不使用文本提示,而是将字符作为图像输入;分析联合注意力机制识别核心token;注入最具信息量的注意力图以融合字符结构与视觉设计;采用层间注意力图聚合以稳定核心token。 Result: 在多语言logo生成任务上达到SOTA性能,实验与用户研究均验证其有效性。 Conclusion: LogoDiffuser提供了一种训练-free、跨语言鲁棒、结构可控的logo生成新范式,显著提升了多语言设计logo的质量与实用性。 Abstract: Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.[165] PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments
Guoliang Zhu,Wanjun Jia,Caoyang Shao,Yuheng Zhang,Zhiyong Li,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出全景空间中整体功能接地的新任务,设计PanoAffordanceNet框架解决几何畸变、语义分散等挑战,并构建首个高质量数据集360-AGD。
Details
Motivation: 当前功能接地方法局限于透视视角和物体中心,难以支持具身智能体在360°空间中的全局感知需求。 Method: 提出PanoAffordanceNet框架,包含畸变感知谱调制器(DASM)和全向球面致密化头(OSDH),并结合像素级、分布级和区域-文本对比损失进行多级约束训练。 Result: 在自建360-AGD数据集上显著超越现有方法,为具身智能的场景级感知建立坚实基线。 Conclusion: 该工作首次系统性地将功能接地扩展至360°室内全景环境,推动了具身智能中全局语义理解的发展。 Abstract: Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.[166] Ego: Embedding-Guided Personalization of Vision-Language Models
Soroush Seifi,Simon Gardier,Vaggelis Dorovatas,Daniel Olmeda Reino,Rahaf Aljundi
Main category: cs.CV
TL;DR: 本文提出了一种高效的大视觉语言模型个性化方法,通过利用模型内部注意力机制提取视觉token作为个性化概念的记忆,无需额外训练或复杂工程模块,在单概念、多概念和视频个性化任务中均取得显著性能提升。
Details
Motivation: 现有大视觉语言模型的个性化方法通常依赖额外训练阶段(限制泛化性和可扩展性)或工程化外部模块(影响部署效率),难以兼顾通用性与高效性。 Method: 利用模型内部注意力机制提取能代表目标概念的视觉token,将其作为该概念的记忆;在测试时,模型可通过这些token回忆并描述对应概念。 Result: 在单概念、多概念及视频个性化等多种设置下,该方法相较SOTA方法展现出更强性能,且个性化开销极小。 Conclusion: 所提方法充分利用模型固有表征能力实现高效个性化,避免了额外训练和复杂架构,在保持模型通用性的同时提升了部署效率与实用性。 Abstract: AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.[167] Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
Gorka Abad,Ermes Franch,Stefanos Koffas,Stjepan Picek
Main category: cs.CV
TL;DR: 本文揭示了后门攻击中存在与训练触发器感知上不同但功能等效的'替代触发器',证明其存在性并提出基于特征空间方向的防御新思路。
Details
Motivation: 现有后门防御方法假设消除已知触发器即可清除后门,但该假设不完整;作者旨在揭示并形式化‘替代触发器’现象及其对防御有效性的影响。 Method: 通过对比干净样本与触发样本在特征空间中的表征,估计替代触发器对应的后门方向;设计特征引导攻击,联合优化目标预测与方向对齐;并从理论上证明替代触发器存在的必然性。 Result: 理论证明替代触发器是后门训练的必然产物;实证验证其普遍存在;发现主流触发器移除型防御无法消除由替代触发器激活的后门。 Conclusion: 应将防御重点从输入空间的触发器转向表征空间的后门方向,以提升鲁棒性。 Abstract: Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.[168] What is Missing? Explaining Neurons Activated by Absent Concepts
Robin Hesse,Simone Schaub-Meyer,Janina Hesse,Bernt Schiele,Stefan Roth
Main category: cs.CV
TL;DR: 本文提出了一种新的可解释人工智能(XAI)视角——'编码缺失'(encoded absences),即神经元高激活可能源于某些概念的缺失而非存在;作者改进了归因与特征可视化方法以揭示此类关系,并验证其在ImageNet模型分析与去偏中的有效性。
Details
Motivation: 现有XAI方法主要关注‘概念存在’与神经元激活的因果关系,而忽略了‘概念缺失’也可能导致高激活这一重要但被忽视的因果类型。 Method: 提出两种简单扩展:1)对归因方法引入反事实扰动以检测缺失概念的贡献;2)对特征可视化方法优化目标函数,使其能生成使神经元高激活的‘缺失型’输入模式。 Result: 实验证明编码缺失在ImageNet模型中普遍存在;标准XAI方法难以发现它们,而所提扩展能有效揭示;考虑编码缺失可提升模型去偏效果。 Conclusion: 编码缺失是DNN中一种重要且常见的因果机制;将缺失概念纳入XAI分析框架,有助于更全面理解模型行为并提升鲁棒性与公平性。 Abstract: Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.[169] Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
Zhaofeng Shi,Heqian Qiu,Lanxiao Wang,Qingbo Wu,Fanman Meng,Lili Pan,Hongliang Li
Main category: cs.CV
TL;DR: 本文提出了一种面向动作预测的测试时主-客观视角自适应新任务(TE²A³),并设计了双线索增强的原型增长网络(DCPGN)来解决跨视角、多标签、时-空差异大的挑战。
Details
Motivation: 现有主-客观视角(Ego-Exo)适配方法依赖目标视角训练数据,导致计算和采集成本高;而测试时自适应(TTA)方法难以应对多动作候选及显著的时-空跨视角差异。 Method: 提出Dual-Clue enhanced Prototype Growing Network(DCPGN),包含:1)多标签原型增长模块(ML-PGM),通过多标签分配与置信度重加权构建类级记忆库,并用熵优先队列更新;2)双线索一致性模块(DCCM),引入轻量叙述器生成文本线索辅助视觉线索,并约束图文logits实现一致性建模。 Result: 在新提出的EgoMe-anti和现有EgoExoLearn基准上显著优于SOTA方法。 Conclusion: DCPGN首次实现了无需目标视角训练数据的测试时Ego-Exo动作预测自适应,有效弥合了主-客观视角间的时-空语义鸿沟。 Abstract: Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \href{https://github.com/ZhaofengSHI/DCPGN}{https://github.com/ZhaofengSHI/DCPGN}.[170] RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding
Muyi Sun,Yixuan Wang,Hong Wang,Chen Su,Man Zhang,Xingqun Qi,Qi Li,Zhenan Sun
Main category: cs.CV
TL;DR: 本文提出了一种细粒度的音视频学习任务Region-Aware Sound Source Understanding (RA-SSU),构建了两个新数据集f-Music和f-Lifescene,并设计了多模态输入输出的SSUFormer模型,在音源分割与区域描述任务上达到SOTA。
Details
Motivation: 现有音视频学习研究多集中于粗粒度下游任务(如音视频对应、声源定位),缺乏对场景感知细节的精细建模,因此需定义更细粒度、帧级、高质量的声源理解任务。 Method: 提出RA-SSU新任务;构建f-Music和f-Lifescene两个带声源掩码与逐帧文本描述的细粒度数据集;设计SSUFormer模型,含Mask Collaboration Module(MCM)提升分割精度,Mixture of Hierarchical-prompted Experts(MoHE)增强描述丰富性。 Result: 在自建数据集上实验验证了RA-SSU任务可行性、数据集有效性及SSUFormer的优越性,模型在Sound Source Understanding基准上达到SOTA性能。 Conclusion: RA-SSU任务拓展了音视频学习的细粒度建模范畴,f-Music/f-Lifescene数据集为该方向提供了重要资源,SSUFormer为多模态声源理解提供了可扩展的统一框架。 Abstract: Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.[171] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation
Liudi Yang,George Eskandar,Fengyi Shen,Mohammad Altillawi,Yang Bai,Chi Zhang,Ziyuan Liu,Abhinav Valada
Main category: cs.CV
TL;DR: 本文提出ConfCtrl框架,通过置信度加权的点云投影与卡尔曼式预测-更新机制,提升扩散模型在双图大视角变化下的新视角合成能力,兼顾几何一致性与视觉合理性。
Details
Motivation: 现有基于回归的方法难以重建未见区域,而相机引导的扩散模型因点云投影噪声或相机姿态条件不足,易偏离预期轨迹。 Method: ConfCtrl采用置信度加权的投影点云潜变量与噪声融合初始化扩散过程,并引入类卡尔曼的预测-更新机制:将投影点云视为含噪观测,用学习到的残差修正来平衡姿态驱动预测与几何观测,动态抑制不可靠区域。 Result: 在多个数据集上实验表明,ConfCtrl能生成几何一致、视觉逼真的新视角,并有效重建大视角变化下的遮挡区域。 Conclusion: ConfCtrl通过置信感知机制提升了扩散模型对相机姿态的遵循能力与未见区域补全能力,为稀疏输入下的新视角合成提供了更鲁棒、几何感知的解决方案。 Abstract: We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.[172] BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling
Guiliang Guo,Guangqi Wen,Lingwen Liu,Ruoxian Song,Peng Cao,Jinzhu Yang,Fei Wang,Xiaoli Liu,Osmar R. Zaiane
Main category: cs.CV
TL;DR: 本文提出BrainSTR框架,通过时空对比学习提升动态脑网络建模的可解释性,用于精神疾病诊断,能准确定位疾病标志出现的时间与脑区连接位置。
Details
Motivation: 动态功能连接虽有助于神经精神疾病诊断和时空可解释性,但诊断信号微弱且稀疏,同时存在大量干扰波动和非诊断性连接,导致可靠可解释性面临挑战。 Method: 提出BrainSTR:包含自适应相位划分模块以学习状态一致的相位边界;注意力机制识别诊断关键相位;增量图结构生成器(结合二值化、时间平滑与稀疏性正则)提取各相位内疾病相关连接;并引入时空监督对比学习,利用诊断相关模式优化样本相似度度量,构建结构良好的语义空间。 Result: 在ASD、BD和MDD数据集上验证了BrainSTR的有效性;发现的关键相位与子网络结果与既往神经影像学发现一致,具备可解释性。 Conclusion: BrainSTR显著提升了动态脑网络建模在精神疾病诊断中的判别能力与时空可解释性,为临床生物标志物发现提供了新方法。 Abstract: Dynamic functional connectivity captures time-varying brain states for better neuropsychiatric diagnosis and spatio-temporal interpretability, i.e., identifying when discriminative disease signatures emerge and where they reside in the connectivity topology. Reliable interpretability faces major challenges: diagnostic signals are often subtle and sparsely distributed across both time and topology, while nuisance fluctuations and non-diagnostic connectivities are pervasive. To address these issues, we propose BrainSTR, a spatio-temporal contrastive learning framework for interpretable dynamic brain network modeling. BrainSTR learns state-consistent phase boundaries via a data-driven Adaptive Phase Partition module, identifies diagnostically critical phases with attention, and extracts disease-related connectivity within each phase using an Incremental Graph Structure Generator regularized by binarization, temporal smoothness, and sparsity. Then, we introduce a spatio-temporal supervised contrastive learning approach that leverages diagnosis-relevant spatio-temporal patterns to refine the similarity metric between samples and capture more discriminative spatio-temporal features, thereby constructing a well-structured semantic space for coherent and interpretable representations. Experiments on ASD, BD, and MDD validate the effectiveness of BrainSTR, and the discovered critical phases and subnetworks provide interpretable evidence consistent with prior neuroimaging findings. Our code: https://anonymous.4open.science/r/BrainSTR1.[173] VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
Shuhao Kang,Youqi Liao,Peijie Wang,Wenlong Liao,Qilin Zhang,Benjamin Busam,Xieyuanli Chen,Yun Liu
Main category: cs.CV
TL;DR: 本文提出VLM-Loc框架,利用大视觉语言模型(VLM)的空间推理能力,通过将点云转换为鸟瞰图(BEV)图像和场景图,并引入部分节点分配机制,提升文本到点云(T2P)定位的精度与可解释性;同时构建了CityLoc新基准用于系统评估。
Details
Motivation: 现有T2P定位方法仅依赖浅层文本-点云对应关系,缺乏有效空间推理,难以在复杂环境中实现高精度定位。 Method: 提出VLM-Loc框架:将点云转化为BEV图像和场景图以联合编码几何与语义信息;利用VLM学习跨模态表征;设计部分节点分配机制,显式关联文本线索与场景图节点,支持可解释的空间推理。 Result: 在新构建的CityLoc基准上,VLM-Loc在定位精度和鲁棒性上均优于当前最先进方法。 Conclusion: 利用VLM的结构化空间理解能力可显著提升T2P定位性能,BEV+场景图+节点分配机制是有效且可解释的技术路径。 Abstract: Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.[174] MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
Kangsan Kim,Yanlai Yang,Suji Kim,Woongyeong Yeo,Youngwan Lee,Mengye Ren,Sung Ju Hwang
Main category: cs.CV
TL;DR: 本文提出了一种理解多智能体第一人称视频的新任务,并构建了MA-EgoQA基准与EgoMAS基线模型,揭示当前方法在多视角系统级理解上的不足。
Details
Motivation: 随着具身AI的发展,人类将与多个具身智能体协作,需并行理解来自各智能体的大量第一人称视频输入,并基于上下文进行准确推理;但现有方法难以有效压缩、通信和聚合多路高维感官数据。 Method: 1)形式化定义多路长时序第一人称视频联合理解新问题;2)构建MA-EgoQA基准(含1.7k跨五类问题的多流特有问题);3)提出EgoMAS基线模型,采用跨智能体共享记忆与按智能体动态检索机制。 Result: 在MA-EgoQA上对多种基线及EgoMAS的全面评估表明,当前模型无法有效处理多路第一人称视频,系统级跨智能体理解能力严重不足。 Conclusion: 多智能体第一人称视频理解是一个亟待研究的新方向,需发展能建模跨智能体上下文关联与长期协同记忆的新方法。 Abstract: As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.[175] MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities
Tien Anh Pham,Phuong-Anh Nguyen,Duc-Trong Le,Cam-Van Thi Nguyen
Main category: cs.CV
TL;DR: 本文提出MissBench基准与两个新诊断指标(MEI和MLI),用于评估多模态情感计算模型在模态缺失不均衡场景下的公平性与优化平衡性。
Details
Motivation: 现实应用中各模态缺失率常不均衡,而传统评测假设模态等量可用,无法揭示由此引发的训练偏差与模态贡献不公平问题。 Method: 构建MissBench基准,涵盖四种常用情感/情绪数据集及共享与不均衡缺失协议;定义Modality Equity Index(MEI)衡量模态贡献公平性,Modality Learning Index(MLI)通过梯度范数量化模态间优化不平衡。 Result: 实验表明,在不均衡缺失下,部分在共享缺失下表现鲁棒的模型仍存在显著模态不公平与优化失衡。 Conclusion: MissBench、MEI与MLI为多模态情感模型在真实不完整模态场景下的压力测试与深度分析提供了实用工具。 Abstract: Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.For reproducibility, we release our code at: https://anonymous.4open.science/r/MissBench-4098/[176] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
Changyao Tian,Danni Yang,Guanzhou Chen,Erfei Cui,Zhaokai Wang,Yuchen Duan,Penghao Yin,Sitao Chen,Ganlin Yang,Mingxin Liu,Zirun Zhu,Ziqian Fan,Leyao Gu,Haomin Wang,Qi Wei,Jinhui Yin,Xue Yang,Zhihang Zhong,Qi Qin,Yi Xin,Bin Fu,Yihao Liu,Jiaye Ge,Qipeng Guo,Gen Luo,Hongsheng Li,Yu Qiao,Kai Chen,Hongjie Zhang
Main category: cs.CV
TL;DR: 本文提出InternVL-U,一个4B参数的轻量级统一多模态模型(UMM),通过统一上下文建模与模态特异性模块化设计,结合先进MLLM与MMDiT视觉生成头,并构建基于思维链(CoT)的高语义密度数据合成流程,在生成、编辑、理解与推理任务上实现性能-效率最优平衡。
Details
Motivation: 解决统一多模态模型在语义理解与生成能力之间的固有折衷问题,推动强理解与强生成能力在轻量级模型中统一实现。 Method: 提出InternVL-U架构:融合SOTA多模态大语言模型(MLLM)与MMDiT视觉生成头;采用统一上下文建模与解耦视觉表征的模态特异性模块化设计;构建面向高语义密度任务(如文本渲染、科学推理)的CoT驱动数据合成管道。 Result: 在仅4B参数下,InternVL-U在多项生成与编辑任务上持续超越参数超3倍(14B)的基线模型BAGEL,同时保持强大的多模态理解与推理能力。 Conclusion: InternVL-U验证了轻量级统一多模态模型在兼顾高性能生成与深层语义理解方面的可行性,为资源受限场景下的多模态智能提供了新范式。 Abstract: Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.[177] DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
Jiazhi Guan,Quanwei Yang,Luying Huang,Junhao Liang,Borong Liang,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou,Jingdong Wang
Main category: cs.CV
TL;DR: 本文提出DISPLAY框架,通过稀疏运动引导(仅手腕关节坐标和物体边界框)实现高保真、可控的人-物交互(HOI)视频生成,并引入Object-Stressed Attention机制和多任务辅助训练策略以提升物体鲁棒性和缓解数据稀缺问题。
Details
Motivation: 现有方法在生成可控且物理一致的人-物交互视频方面存在困难,依赖密集控制信号、模板视频或精细文本提示,灵活性与泛化性不足。 Method: 提出DISPLAY框架,采用稀疏运动引导(手腕坐标+物体边界框);设计Object-Stressed Attention机制增强物体表征;构建多任务辅助训练策略及专用数据清洗流程。 Result: 在多种任务上实现了高保真、强可控的HOI视频生成,显著提升物体鲁棒性与生成质量。 Conclusion: 稀疏引导结合注意力机制与多任务训练可有效提升HOI视频生成的可控性、一致性与泛化能力。 Abstract: Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.[178] Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Yuchen Yang,Yuqing Shao,Duxiu Huang,Linfeng Dong,Yifei Liu,Suixin Tang,Xiang Zhou,Yuanyuan Gao,Wei Wang,Yue Zhou,Xue Yang,Yanfeng Wang,Xiao Sun,Zhihang Zhong
Main category: cs.CV
TL;DR: 本文提出了CourtSI,首个面向体育场景的大规模空间智能数据集,包含超100万QA对,并构建了高质量评估基准CourtSI-Bench;实验表明现有视觉语言模型在体育空间推理上存在明显短板,而基于CourtSI微调的模型显著提升性能并具备跨运动泛化能力。
Details
Motivation: 体育场景具有高动态性、复杂人-物-环境交互和精确几何约束,是检验和提升视觉语言模型空间智能的理想测试床,但此前缺乏针对性的大规模数据集和评估基准。 Method: 构建了CourtSI数据集(含1M+ QA对)及其评估基准CourtSI-Bench(3686人工验证QA对),基于标准球场几何结构设计半自动数据引擎重建运动场景;系统覆盖空间计数、距离测量、定位和关系推理四大任务;并在25个VLM上进行评测,进一步对Qwen3-VL-8B开展CourtSI微调实验。 Result: 现有VLM在CourtSI-Bench上表现远低于人类水平,且在通用空间智能基准上训练的模型泛化到体育场景效果差;Qwen3-VL-8B经CourtSI微调后准确率提升23.5个百分点,并在未见运动扩展集CourtSI-Ext及空间感知解说生成任务中展现良好泛化能力。 Conclusion: CourtSI有效揭示了当前VLM在真实体育场景中空间推理能力的不足,为推动视觉语言模型的空间智能发展提供了可扩展的数据基础与评估标准。 Abstract: Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.[179] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Shan Ning,Longtian Qiu,Jiaxuan Sun,Xuming He
Main category: cs.CV
TL;DR: 本文提出WikiCLIP,一种基于对比学习的高效开放域视觉实体识别(VER)框架,利用大语言模型嵌入和视觉引导知识适配器(VGKA)对齐图文语义,并引入难负样本合成机制提升细粒度判别能力,在OVEN等基准上显著超越生成式方法,同时大幅降低推理延迟。
Details
Motivation: 现有生成式VER方法性能强但计算开销大,难以扩展和部署;需构建高效、强性能的对比学习基线。 Method: 提出WikiCLIP框架:1)使用大语言模型嵌入作为知识丰富的实体表示;2)设计Vision-Guided Knowledge Adaptor(VGKA)在图像块级别对齐文本语义与视觉线索;3)引入Hard Negative Synthesis Mechanism生成视觉相似但语义不同的难负样本以增强判别能力。 Result: 在OVEN等开放域VER基准上显著优于强基线;在OVEN unseen集上提升16%,推理延迟比AutoVER降低近100倍。 Conclusion: 对比学习范式经合理设计(如知识嵌入、VGKA、难负样本)可在VER任务中实现高性能与高效率的统一,WikiCLIP为开放域VER提供了强而实用的新基线。 Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/[180] On the Structural Failure of Chamfer Distance in 3D Shape Optimization
Chang-Yong Song,David Hyde
Main category: cs.CV
TL;DR: 本文揭示了Chamfer距离作为点云重建等任务训练损失时存在的梯度结构性缺陷——其逐点梯度会导致不可逆的多对一坍缩,而局部正则化无法解决;提出需引入超越局部邻域的非局部耦合机制(如共享基变形或可微MPM先验)来抑制坍缩,并验证其在2D/3D实验中显著缩小Chamfer误差。
Details
Motivation: Chamfer距离虽被广泛用作点云重建、补全与生成的标准训练损失,但直接优化它反而可能导致比不优化更差的Chamfer值,这一反直觉现象亟需从梯度结构层面解释。 Method: 通过理论分析推导坍缩抑制的必要条件(即耦合必须传播至非局部邻域),并在2D可控实验中采用共享基变形建模,在3D形状形变中引入可微物质点法(MPM)先验,实现全局耦合以抑制坍缩。 Result: 在20组定向形变对上一致缩小Chamfer误差差距,在拓扑复杂的dragon数据上取得2.5倍性能提升;验证了非局部耦合的存在与否是Chamfer优化成败的关键判据。 Conclusion: Chamfer距离优化失败的根本原因在于其梯度引发的结构性坍缩,仅靠局部正则无效;成功优化依赖于显式引入非局部耦合机制,该发现为所有基于点级距离度量的优化流程提供了实用设计准则。 Abstract: Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.[181] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Yao Zhang,Zhuchenyang Liu,Yanlan He,Thomas Ploetz,Yu Xiao
Main category: cs.CV
TL;DR: 本文提出了一种可解释的、基于关节角度的运动表示方法,结合MaxSim和掩码语言建模正则化,提升了文本-动作检索的精度与细粒度对齐可解释性。
Details
Motivation: 现有双编码器方法仅学习全局嵌入,丢失局部细粒度对应关系,且缺乏可解释性。 Method: 提出关节角度驱动的伪图像运动表示,适配ViT;在文本到动作检索中采用token-wise late interaction(MaxSim)并加入掩码语言建模正则化。 Result: 在HumanML3D和KIT-ML数据集上超越SOTA方法,同时提供可解释的文本-动作细粒度对齐。 Conclusion: 所提方法在提升检索性能的同时增强了模型可解释性,验证了局部特征建模与文本-动作细粒度对齐的有效性。 Abstract: Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.[182] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Rong Zhou,Houliang Zhou,Yao Su,Brian Y. Chen,Yu Zhang,Lifang He,Alzheimer's Disease Neuroimaging Initiative
Main category: cs.CV
TL;DR: ACADiff是一种基于自适应临床感知扩散模型的框架,用于合成阿尔茨海默病诊断中缺失的多模态脑影像(sMRI、FDG-PET、AV45-PET),结合临床元数据与GPT-4o语义引导,在高达80%模态缺失下仍保持高生成质量与诊断鲁棒性。
Details
Motivation: 临床神经影像数据常存在模态缺失问题,而多模态信息对阿尔茨海默病精准诊断至关重要,亟需能利用不完整数据可靠合成缺失模态的方法。 Method: 提出ACADiff框架:采用渐进式去噪的扩散模型,在潜在空间中建模模态间映射;引入自适应融合机制动态响应可用输入,并通过GPT-4o编码的临床提示提供语义引导;设计三个专用生成器实现sMRI、FDG-PET与AV45-PET之间的双向合成。 Result: 在ADNI数据集上验证,ACADiff在生成质量(如PSNR、SSIM)和下游诊断性能(如分类准确率)上均显著优于现有方法,即使在80%模态缺失极端情况下仍保持稳健。 Conclusion: ACADiff有效解决了多模态神经影像中模态缺失难题,通过融合临床先验与先进生成建模,为临床有限数据下的AD辅助诊断提供了新范式。 Abstract: Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff[183] Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy
Gauthier Miralles,Loïc Le Folgoc,Vincent Jugnon,Pietro Gori
Main category: cs.CV
TL;DR: 本文提出了一种基于Margin Disparity Discrepancy(MDD)的新型无监督域自适应(UDA)框架,用于提升介入CBCT图像中的肝脏分割性能,利用未标注CBCT数据与标注CT数据进行跨模态迁移学习,在UDA及少样本设定下均达到SOTA效果。
Details
Motivation: 介入放射学中CBCT数据稀缺且缺乏标注,而CT数据丰富;现有CBCT数据集多面向放疗,不适用于介入场景,亟需跨模态迁移方法解决CBCT肝脏分割问题。 Method: 提出一种改进的基于Margin Disparity Discrepancy(MDD)的无监督域自适应框架,利用未标注介入CBCT数据和已标注CT数据,通过域适应缩小模态差异,实现跨模态肝脏分割。 Result: 在CT与CBCT肝脏分割任务上,该方法在无监督域自适应(UDA)和少样本(few-shot)设定下均取得当前最优性能(state-of-the-art)。 Conclusion: 所提MDD改进框架有效弥合了CT与介入CBCT之间的模态鸿沟,为稀缺标注的医学影像分析提供了可行的跨域学习范式。 Abstract: In interventional radiology, Cone-Beam Computed Tomography (CBCT) is a helpful imaging modality that provides guidance to practicians during minimally invasive procedures. CBCT differs from traditional Computed Tomography (CT) due to its limited reconstructed field of view, specific artefacts, and the intra-arterial administration of contrast medium. While CT benefits from abundant publicly available annotated datasets, interventional CBCT data remain scarce and largely unannotated, with existing datasets focused primarily on radiotherapy applications. To address this limitation, we leverage a proprietary collection of unannotated interventional CBCT scans in conjunction with annotated CT data, employing domain adaptation techniques to bridge the modality gap and enhance liver segmentation performance on CBCT. We propose a novel unsupervised domain adaptation (UDA) framework based on the formalism of Margin Disparity Discrepancy (MDD), which improves target domain performance through a reformulation of the original MDD optimization framework. Experimental results on CT and CBCT datasets for liver segmentation demonstrate that our method achieves state-of-the-art performance in UDA, as well as in the few-shot setting.[184] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
Yundi Zhang,Sevgi Gokce Kafali,Niklas Bubeck,Daniel Rueckert,Jiazhen Pan
Main category: cs.CV
TL;DR: 本文提出k-MTR框架,直接在k空间学习语义表征,跳过传统重建步骤,实现多任务诊断性能提升。
Details
Motivation: 传统CMR流程采用‘重建-再分析’范式,引入不必要的伪影和信息瓶颈;而诊断实际只需低维生理标签,无需高维图像重建,存在数学悖论。 Method: 提出k-space Multi-Task Representation(k-MTR),通过大规模仿真数据(42,000例)训练k空间编码器,将 undersampled k-space 与 fully-sampled 图像对齐到共享语义流形,在潜在空间中直接恢复解剖信息,绕过显式逆问题求解。 Result: 在连续表型回归、疾病分类和解剖分割任务上,k-MTR性能媲美当前最优图像域方法;验证了仅从欠采样k空间即可恢复精确空间几何与多任务特征。 Conclusion: k-MTR为任务感知的心脏MRI工作流提供了鲁棒的架构蓝图,证明k空间可直接承载高阶生理语义,突破重建依赖范式。 Abstract: Conventional clinical CMR pipelines rely on a sequential "reconstruct-then-analyze" paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.[185] Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading
Marie Arrivat,Rémy Peyret,Elsa Angelini,Pietro Gori
Main category: cs.CV
TL;DR: 本文提出Whole Slide Difficulty (WSD)概念,基于专家与非专家病理学家对全切片图像(WSI)诊断的分歧程度,并通过多任务学习和加权分类损失两种方法将其融入训练过程,显著提升了前列腺癌Gleason分级(尤其是高级别)的分类性能。
Details
Motivation: 专家标注虽为金标准,但部分WSI诊断困难,易引发专家与非专家间分歧;利用这种分歧(即WSD)可挖掘潜在难度信息以提升模型鲁棒性与泛化能力。 Method: 提出Whole Slide Difficulty(WSD)度量,并设计两种利用方式:(1)多任务学习框架,联合预测诊断标签与WSD;(2)基于WSD的加权分类损失函数,使模型更关注高难度样本。应用于Gleason分级任务,适配多种特征编码器和MIL方法。 Result: 在Gleason分级任务中,引入WSD显著提升分类性能,尤其对高级别(更难诊断)的类别效果更明显;该增益在不同特征编码器和MIL方法上具有一致性。 Conclusion: WSD是一种有效且通用的弱监督信号,将其融入MIL训练可提升模型对诊断困难WSI的判别能力,为临床辅助诊断提供新思路。 Abstract: Multiple Instance Learning (MIL) has been widely applied in histopathology to classify Whole Slide Images (WSIs) with slide-level diagnoses. While the ground truth is established by expert pathologists, the slides can be difficult to diagnose for non-experts and lead to disagreements between the annotators. In this paper, we introduce the notion of Whole Slide Difficulty (WSD), based on the disagreement between an expert and a non-expert pathologist. We propose two different methods to leverage WSD, a multi-task approach and a weighted classification loss approach, and we apply them to Gleason grading of prostate cancer slides. Results show that integrating WSD during training consistently improves the classification performance across different feature encoders and MIL methods, particularly for higher Gleason grades (i.e. worse diagnosis).[186] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding
Wenzhao Xiang,Yue Wu,Hongyang Yu,Feng Gao,Fan Yang,Xilin Chen
Main category: cs.CV
TL;DR: C2FMAE是一种粗到细的掩码自编码器,通过三层粒度(语义、实例、像素)和级联解码器与渐进掩码策略,统一对比学习与掩码建模的优势,提升视觉表征鲁棒性与泛化性。
Details
Motivation: 现有自监督方法存在矛盾:对比学习擅长全局语义但丢失细节,掩码图像建模保留局部纹理但因随机掩码导致注意力漂移。 Method: 提出C2FMAE框架,包含三粒度掩码(语义/实例/RGB)、级联解码器实现自上而下重建、渐进掩码课程(语义→实例→随机),并构建含1.28M ImageNet伪标签的多粒度数据集。 Result: 在图像分类、目标检测、语义分割任务上均取得显著性能提升,验证了分层表征学习的有效性。 Conclusion: 显式建模跨粒度依赖与结构化训练路径可有效缓解自监督视觉预训练中的语义-细节权衡问题,提升模型鲁棒性与泛化能力。 Abstract: Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.[187] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
Freeman Cheng,Botao Ye,Xueting Li,Junqi You,Fangneng Zhan,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: ReCoSplat是一种自回归前馈高斯点绘模型,用于在线新视角合成,通过Render-and-Compare模块缓解预测位姿误差带来的分布偏移问题,并采用混合KV缓存压缩策略支持长序列处理。